CUDA Optimization with NVIDIA® Nsight Visual Studio Edition...

CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA

What Will You Learn?

An iterative method to optimize your GPU code

A way to conduct that method with Nsight VSE

APOD Method, Session S3008, Cliff Woolley

https://developer.nvidia.com/content/assess-parallelize-optimize-deploy

What Does the Application Do ?

It does not matter !!!

We care about memory accesses, instructions, latency, …

Companion code (with a different input file)

https://github.com/jdemouth/nsight-gtc2013

Our Method

Trace your application

Identify the hot spot and profile it

Identify the performance limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the code

Iterate

Our Environment

We use

— Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF,

— Microsoft Windows 7 x64,

— Microsoft Visual Studio 2010 SP1,

— CUDA 5.0,

— Driver 310.34,

— Nvidia Nsight 3.0.

ITERATION 1

Trace the Application

CUDA Launch Summary

spmv_kernel_v0 is a hot spot, let’s start here!!!

Kernel Time Speedup

Original version 457.1ms

Profile the Most Expensive Kernel

CUDA Launches

Identify the Main Limiter

Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

Memory Bandwidth

Utilization of DRAM Bandwidth: 37.67%

We are not limited by the memory bandwidth

Instruction Throughput

Instructions Per Clock (IPC): 0.04

We are not limited by instruction throughput

Latency

First two things to check:

— Occupancy

— Memory accesses (coalesced/uncoalesced accesses)

Other things to check (if needed):

— Control flow efficiency (branching, idle threads)

— Divergence

— Bank conflicts in shared memory

Latency

Occupancy: 47.58% Achieved / 50% Theoretical

Eligible Warps per Active Cycle: >4.7 on average

On GK110, 4 Eligible Warps are enough: Not an issue 16

Latency

Memory Accesses:

— Load: 22 Transactions per Request

— Store: 8 Transactions per Request

We have too many uncoalesced accesses!!!

Where Do Those Accesses Happen?

CUDA Source Profiler:

— Find where most of the uncoalesced requests happen

Tip: Sort “L2 Global Transactions Executed”

Access Pattern

Double precision numbers: 64-bit

Per Warp:

— Up to 32 L1 Transactions / Ideal case: 2 Transactions

L2 Transaction

L1 Transaction (128B)

Thread 0 Thread 1

L2 Transaction

Thread 2

Access Pattern

Next iteration:

Idea: Use the Read-only cache (LDG load)

— On Fermi: Use a texture or Use 48KB for L1

Thread 0 Thread 1 Thread 2

L2 Transaction

L1 Transaction (128B) L2 Transaction

First Modification: Use __ldg

We change the source code:

It is slower: 625.8ms

Kernel Time Speedup

LDG to load A 625.8ms 0.73x

Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB)

Why does 6% less traffic lead to 37% performance loss?

Instruction Efficiency (Eligible Warps per Active Cycle):

The average number of Eligible Warps dropped below 1

There are already “a lot” of Active Warps per Cycle

Warps cannot issue because they have to wait

Warps wait for Texture in 91.1% of the cases

The loads compete for the cache too much

— Low hit rate: 7.7%

Texture requests introduce too much latency

Things to check in those cases:

— Texture Hit Rate: Low means no reuse

— Issue Efficiency and Stall Reasons

It was actually expected: GPU caches are not CPU caches!!!

Other accesses may benefit from LDGs

Memory blocks accessed several times by several threads

How can we detect it?

— Source code analysis

— There is no way to detect it from Nsight

We change the source code

— In y = Ax, we use __ldg when loading x

It’s faster: 403.4ms

Kernel Time Speedup

LDG to load X 403.4ms 1.13x

Much less L2 to SM traffic: 774MB (it was 906.2MB)

Good hit rate in Texture Cache: 83%

ITERATION 2

CUDA Launch Summary

spmv_kernel_v2 is still a hot spot, so we profile it 31

We are still limited by latency

— Low DRAM utilization: 36.48%

— Low IPC: 0.06

We are not limited by the Occupancy

— We have > 6 Eligible Warps per Active Cycle

We are limited by uncoalesced accesses: 48.92% of Replays

Second Strategy: Change Memory Accesses

4 consecutive threads load 4 consecutive elements

Per Warp:

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

Threads 8, 9, 10, 11

It’s much faster: 161.7ms

Kernel Time Speedup

Coalescing with 4 Threads 161.7ms 2.83x

We have much fewer Transactions per Request: 5.51 (LD)

Much less traffic from L2: 230.5MB (it was 774MB)

Much less DRAM traffic: 210.1MB (it was 503.1MB)

ITERATION 3

CUDA Launch Summary

spmv_kernel_v3 is still a hot spot, so we profile it 40

We are still limited by latency

— Low DRAM utilization: 37.67%

— Low IPC: 0.31

Latency

Occupancy: 57.80% Achieved / 62.50% Theoretical

Eligible Warps per Active Cycle: ~3 on average

We need 4 warps on GK110, so ~3 could be an issue

Latency

Occupancy is limited by the number of registers

We change the number of registers with __launch_bounds__

It does not really help 44

__launch_bounds__(BLOCK_SIZE, MIN_BLOCKS)

Latency

Memory Accesses:

— Load: 5.51 Transactions per Request

— Store: 2 Transactions per Request

We still have too many uncoalesced accesses

Latency

We still have too many uncoalesced accesses

— Nearly 70% of Instruction Serialization (Replays)

— Stall Reasons: 48.1% due to Data Requests

Where Do Those Accesses Happen?

Same lines of code as before

What Can We Do?

In our kernel: 4 threads per row of the matrix A

New approach: 1 warp of threads per row of the matrix A

Threads 0, 1, 2, 3 Threads 4, 5, 6, 7

L2 Transaction

Threads 8, 9, 10, 11

Threads 0, 1, 2, 3, …, 31 (some possibly idle)

L2 Transaction

(32B) L2 Transaction

(32B) 48

One Warp Per Row

It’s faster: 140.4ms

Kernel Time Speedup

1 Warp per Row 140.4ms 3.26x

One Warp Per Row

Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST)

ITERATION 4

One Warp Per Row

spmv_kernel_v4 is the hot spot

One Warp Per Row

DRAM utilization: 40.64%

IPC: 1.57

We are still limited by latency 53

One Warp Per Row

Occupancy and memory accesses are OK (not shown)

Control Flow Efficiency: 86.59%

Only 72.5% threads active in the expensive loop

One Half Warp Per Row

It is faster: 114.7ms

Kernel Time Speedup

½ Warp per Row 114.7ms 3.99x

ITERATION 5

IPC: 1.34

We are still limited by latency 57

Memory accesses are good enough

Occupancy could be an issue: ~3.2 Eligible Warps per Cycle

— Occupancy is limited by registers

But forcing register count does not improve performance 58

Branch divergence induce latency

We have 23.1% of divergent branches

We fix branch divergence

It is faster: 91.2ms

Kernel Time Speedup

½ Warp per Row 114.7ms 3.99x

No divergence 91.2ms 5.01x

IPC: 1.56

We achieve a much better bandwidth 61

So Far

We have consecutively:

— Improved caching using __ldg (use with care)

— Improved coalescing

— Improved control flow efficiency

— Improved branching

Our new kernel is 5x faster than our first implementation

Nsight helped us a lot 62

ITERATION 6

Next Kernel

We are satisfied with the performance of spmv_kernel

We move to the next kernel: jacobi_smooth

What Have You Seen?

An iterative method to optimize your GPU code

— Trace your application

— Identify the hot spot and profile it

— Identify the performance limiter

— Optimize the code

— Iterate

A way to conduct that method with Nsight VSE

CUDA Optimization with NVIDIA® Nsight Visual Studio Edition...

Documents

NVIDIA NSIGHT SYSTEMS · 57 TOOLS COMPARISON NVIDIA© Nsight™ Systems NVIDIA© Nsight™ Compute NVIDIA© Visual Profiler Intel© VTune™ Amplifier Linux perf OProfile Target OS

Parallel Nsight for Accelerated DirectX 11 Development...Parallel Nsight GPU computing solution in Visual Studio Debug, profile and analyze graphics and GPGPU applications Direct3D,

Parallel Nsight for Accelerated DirectX 11 …on-demand.gputechconf.com/gtc/2010/presentations/S12212-Parallel...Parallel Nsight for Accelerated DirectX 11 ... HUD overlay GPU: DirectX

Exploring nSIGHT Imaging – a totally new architecture for

Creating Your Promise - Melynn Sight, nSight Marketing

WELCearie2016.earie.org/EARIE_2016_CONFERENCE_HANDBOOK.pdf · 2017. 7. 3. · 3On 3behalbWELCOMESAAGe EhoApyOyAuyEopEnjyEtGOocyeAESffourenroAEpoOEayfyeOujErAEdAMGfnOre EtuoAowruf3

Image Interpretability of nSight-1 Nanosatellite Imagery for Remote Sensing … · 2020-02-25 · Image Interpretability of nSight-1 Nanosatellite Imagery for Remote Sensing Applications

PRACTICAL NSIGHT EDITATION

nSight Teacher Resource Samples

Debugging DirectX12 with Nsight™ Visual Studio Edition · 2015. 8. 20. · NVIDIA® Nsight™ Visual Studio Edition 5.0 • Support for Windows 10, Visual Studio 2015 • Frame

NVIDIA CUDA Toolkit 7developer.download.nvidia.com/compute/cuda/7.5/Prod/docs/sidebar/... · Nsight Visual Studio Edition (VSE) which is installed as a plug-in to Microsoft Visual

Nsight Eclipse Edition - Tsudakarel.tsuda.ac.jp/lec/cuda/doc_v9_0/pdf/Nsight_Eclipse... Nsight Eclipse Edition DG-06450-001 _v9.0 | 1 Chapter 1. INTRODUCTION This guide introduces

I nSight - SGA

I nSight - SGA Canada

Nsight oracle brochure exadata

Nsight Eclipse Edition - Nvidia · Chapter 3. USING NSIGHT ECLIPSE EDITION 3.1. Installing Nsight Eclipse Edition Nsight Eclipse Edition, nsight, is included in the CUDA Toolkit for

在VR开发中使用Nsight Visual Studio Edition

Nsight Eclipse Edition - Rice University...Nsight Eclipse Edition now includes Remote System Explorer plug-in. This plugin enables accessing remote systems for file transfer, shell

CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

Summit Introduction to Nsight Systems for