57
Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS

S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

Jakob Progsch, Mathias Wagner

GTC 2018

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS

Page 2: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

2

BEFORE YOU START

1. Know your hardware

• What are the target machines, how many nodes? Machine-specific optimizations okay?

2. Know your tools

• Strengths and weaknesses of each tool? Learn how to use them (and learn one well!)

3. Know your application

• What does it compute? How is it parallelized? What final performance is expected?

4. Know your process

• Performance optimization is a constant learning process

5. Make it so!

The five steps to enlightenment

Page 3: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

3

THE APOD CYCLE

1. Assess

• Identify Performance Limiter

• Analyze Profile

• Find Indicators

2. Parallelize3. Optimize

3b. Build Knowledge

4. Deploy

and Test

Page 4: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

4

Scope

GUIDING OPTIMIZATION EFFORT

• Challenge: How to know where to start?

• Top-down Approach:

• Find Hotspot Kernel

• Identify Performance Limiter of the Hotspot

• Find performance bottleneck indicators related to the limiter

• Identify associated regions in the source code

• Come up with strategy to fix and change the code

• Start again

“Drilling Down into the Metrics”

Page 5: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

5

KNOW YOUR HARDWARE:VOLTA ARCHITECTURE

Page 6: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

6

VOLTA V100 FEATURES

Volta Architecture

Most Productive GPU

Tensor Core

120 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

Page 7: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

7

GPU COMPARISON

P100 (SXM2) V100 (SXM2)

Double/Single/Half TFlop/s 5.3/10.6/21.2 7.8/15.7/125 (TensorCores)

Memory Bandwidth (GB/s) 732 900

Memory Size 16GB 16GB

L2 Cache Size 4096 KB 6144 KB

Base/Boost Clock (Mhz) 1328/1480 1312/1530

TDP (Watts) 300 300

Page 8: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

8

VOLTA GV100 SM

GV100

FP32 units 64

FP64 units 32

INT32 units 64

Tensor Cores 8

Register File 256 KB

Unified L1/Shared

memory

128 KB

Active Threads 2048

Page 9: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

9

Shared

Memory64 KB

L1$24 KB

L2$4 MB

Load/Store Units

Pascal SM

L2$6 MB

Load/Store Units

Volta SM

L1$ and Shared Memory128 KBLow Latency

Streaming

IMPROVED L1 CACHE

Page 10: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

10

KNOW YOUR TOOLS:PROFILERS

Page 11: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

11

PROFILING TOOLS

From NVIDIA

• nvprof

• NVIDIA Visual Profiler (nvvp)

• Nsight Visual Studio Edition

Coming Soon:

• NVIDIA Nsight Systems

• NVIDIA Nsight Compute

Third Party

• TAU Performance System

• VampirTrace

• PAPI CUDA component

• HPC Toolkit

• (Tools using CUPTI)

Many Options!

Without loss of generality, in this talk we will be showing nvvp screenshots

Page 12: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

12

THE NVVP PROFILER WINDOW

Timeline

Analysis Results

Summary

Guide

Page 13: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

13

KNOW YOUR APPLICATION: HPGMG

Page 14: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

14

HPGMGHigh-Performance Geometric Multi-Grid, Hybrid Implementation

Fine levels are executed on throughput-optimized processors (GPU)

Coarse levels are executed on latency-optimized processors (CPU)

3/24/2018

GPU

CPU

THRESHOLD

F-CYCLEV-CYCLE

DIRECT SOLVE

SMOOTHER

& RESIDUAL

SMOOTHER

& RESIDUAL

SMOOTHER

SMOOTHER

http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/

Page 15: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

15

MAKE IT SO:ITERATION 1

2ND ORDER 7-POINT STENCIL

Page 16: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

16

Identify the hotspot: smooth_kernel()

IDENTIFY HOTSPOT

Hotspot

Kernel Time Speedup

Original Version 2.079ms 1.00x

Page 17: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

1717

IDENTIFY PERFORMANCE LIMITER

Memory utilization

Compute utilization

Page 18: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

18

Memory Utilization vs Compute Utilization

Four possible combinations:

PERFORMANCE LIMITER CATEGORIES

Comp Mem

Compute

Bound

Comp Mem

Bandwidth

Bound

Comp Mem

Latency

Bound

Comp Mem

Compute and

Bandwidth

Bound

Page 19: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

1919

LATENCY BOUND ON P100

Page 20: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

2020

BANDWIDTH BOUND ON V100

Page 21: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

2121

DRILLING DOWN: LATENCY ANALYSIS (V100)

The profiler warns about

low occupancy

Limited by block size of only 8x4=32 threads

Page 22: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

22

OCCUPANCY

Each SM has limited resources:

• max. 64K Registers (32 bit) distributed between threads

• max. 48KB (96KB opt in) of shared memory per block (96KB per SMM)

• max. 32 Active Blocks per SMM

• Full occupancy: 2048 threads per SM (64 warps)

When a resource is used up, occupancy is reduced

GPU Utilization

Values vary with Compute Capability

Page 23: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

23

LATENCY

GPUs cover latencies by having a lot of work in flight

warp 0

warp 1

warp 2

warp 3

warp 4

warp 5

warp 6

warp 7

warp 8

warp 9

The warp issues

The warp waits (latency)

Fully covered latencywarp 0

warp 1

warp 2

warp 3

No warp issues

Exposed latency, not enough warps

Page 24: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

24

LATENCY AT HIGH OCCUPANCY

Many active warps but with high latency instructions

Exposed latency at high occupancy

No warp issuing

warp 0

warp 1

warp 2

warp 3

warp 4

warp 5

warp 6

warp 7

warp 8

warp 9

Page 25: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

2525

LOOKING FOR MORE INDICATORS

12 Global Load

Transactions per 1 RequestFor line numbers use:nvcc -lineinfo

Source Code Association

Page 26: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

26

MEMORY TRANSACTIONS: BEST CASE

A warp issues 32x4B aligned and consecutive load/store request

Threads read different elements of the same 128B segment

1x L1 transaction: 128B needed / 128B transferred

4x L2 transactions: 128B needed / 128B transferred

1x 128B L1 transaction per warp

4x 32B L2 transactions per warp

1x 128B load/store request per warp

Page 27: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

27

MEMORY TRANSACTIONS: WORST CASE

Threads in a warp read/write 4B words, 128B between words

Each thread reads the first 4B of a 128B segment

32x L1 transactions: 128B needed / 32x 128B transferred

32x L2 transactions: 128B needed / 32x 32B transferred

1x 128B L1 transaction per thread

1x 32B L2 transaction per thread

1x 128B load/store request per warpStride: 32x4B thread 2

Page 28: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

28

TRANSACTIONS AND REPLAYS

With replays, requests take more time and use more resources

More instructions issued

More memory traffic

Increased execution time

Inst. 0

Issued

Inst. 1

Issued

Inst. 2

Issued

Execution time

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Inst. 0

Completed

Inst. 1

Completed

Inst. 2

Completed

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Transfer data for inst. 0

Transfer data for inst. 1

Transfer data for inst. 2

Extra latencyExtra work (SM)

Extra memory traffic

Page 29: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

2929

FIX: BETTER GPU TILINGBefore After

Block Size Up

Memory Utilization Up

Transactions Per Access Down

Kernel Time Speedup

Original Version 2.079ms 1.00x

Better Memory Accesses 1.756ms 1.18x

+10%

Page 30: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

3030

Page 31: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

31

ITERATION 2: DATA MIGRATION

Page 32: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

32

PAGE FAULTS Details

Page 33: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

33

MEMORY MANAGEMENTUsing Unified Memory

No changes to data structures

No explicit data movements

Single pointer for CPU and GPU data

Use cudaMallocManaged for allocations

3/24/2

018

Developer View With Unified Memory

Unified Memory

Page 34: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

34

Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory)

UNIFIED MEMORYEliminating page migrations and faults

3/24/2

018

GPU

CPU

THRESHOLD

F-CYCLE

Page faults

Page 35: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

35

PAGE FAULTS Almost gone

Page 36: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

36

PAGE FAULTSSignificant speedup for affected kernel

Page 37: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

37

MEM ADVICE APINot used here

cudaMemPrefetchAsync(ptr, length, destDevice, stream)

Migrate data to destDevice: overlap with computeUpdate page table: much lower overhead than page fault in kernelAsync operation that follows CUDA stream semantics

cudaMemAdvise(ptr, length, advice, device)

Specifies allocation and usage policy for memory regionUser can set and unset at any time

3/24/2018

Page 38: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

38

ITERATION 3:REGISTER OPTIMIZATION AND CACHING

Page 39: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

3939

LIMITER: STILL MEMORY BANDWIDTH

Page 40: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

40

SM

Functional Units

Register File

SM

Functional Units

Register File

GPU MEMORY HIERARCHYV100

Global Memory (Framebuffer)

L2$

Bring reused data closer to

the SMs

• Registers (256 KB/SM): good for intra-thread data reuse

• Shared mem / L1$ (128 KB/SM): good for explicit intra-block data reuse

• L2$ (6144 KB): implicit data reuse

Shared Memory /

L1$

Shared Memory /

L1$

Page 41: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

41

CACHING IN REGISTERSNo data loaded initially

3/24/2018

Page 42: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

42

CACHING IN REGISTERSLoad first set of data

3/24/2018

load

Page 43: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

43

CACHING IN REGISTERSPerform calculation

3/24/2018

Stencil

Page 44: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

44

CACHING IN REGISTERSNaively load next set of data?

3/24/2018

load

Page 45: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

45

CACHING IN REGISTERSReusing already loaded data is better

3/24/2018

load

keep

keep

Page 46: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

46

CACHING IN REGISTERSRepeat

3/24/2018

Stencil

Higher register usage may

result in reduced

occupancy => trade off

(run experiments!)

Page 47: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

4747

THE EFFECT OF REGISTER CACHING

Transactions for cached

loads reduced by a

factor of 8

Memory utilization still

high, but transferring less

redundant data

Kernel Time Speedup

Original Version 2.079ms 1.00x

Better Memory Accesses 1.756ms 1.18x

Register Caching 1.486ms 1.40x

Page 48: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

48

SHARED MEMORY

Programmer-managed cache

Great for caching data reused across threads in a CTA

128KB split between shared memory and L1 cache per SM

Each block can use at most 96KB shared memory on GV100

Search for cudaFuncAttributePreferredSharedMemoryCarveout in the docs

__global__ void sharedMemExample(int *d) {__shared__ float s[64];int t = threadIdx.x;s[t] = d[t];__syncthreads();if(t>0 && t<63)

stencil[t] = -2.0f*s[t] + s[t-1] + s[t+1];}

global

global

registers

registers

shared

Page 49: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

49

Page 50: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

50

ITERATION 4:KERNELS WITH INCREASED

ARITHMETIC INTENSITY

Page 51: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

51

OPERATIONAL INTENSITY

• Operational intensity = arithmetic operations/bytes written and read

• Our stencil kernels have very low operational intensity

• It might be beneficial to use a different algorithm with higher operational intensity.

• In this case this might be achieved by using higher order stencils

Page 52: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

52

ILP VS OCCUPANCY

• Earlier we looked at how occupancy helps hide latency by providing independent threads of execution.

• When our code requires many registers the occupancy will be limited but we can still get instruction level parallelism inside the threads.

• Occupancy is helpful to achieving performance but not alwaysrequired

• Some algorithms such as matrix multiplications allow increases in operational intensity by using more registers for local storage while simultaneously offering decent ILP.In these cases it might be beneficial to maximize ILP andoperational intensity at the cost of occupancy.

a = b + c;d = e + f;

a = b + c;d = a + f;

Independent instr.

Dependent instr.

Page 53: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

53

Page 54: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

54

SUMMARY

Page 55: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

55

SUMMARY

1. Know your application

2. Know your hardware

3. Know your tools

4. Know your process

• Identify the Hotspot

• Classify the Performance Limiter

• Look for indicators

5. Make it so!

Performance Optimization is a Constant Learning Process

Page 56: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

56

REFERENCESCUDA Documentation

Best Practices: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Pascal Tuning Guide: http://docs.nvidia.com/cuda/pascal-tuning-guide

Volta Tuning Guide: http://docs.nvidia.com/cuda/volta-tuning-guide/

NVIDIA Developer Blog

http://devblogs.nvidia.com/

Pointers to GTC 2018 Sessions:

S8718 - Optimizing HPC Simulation and Visualization Codes using the NVIDIA System Profiler (previous talk, check out recording)

S8430 - Everything You Need to Know About Unified Memory (Tue, 4:30PM)

S8106 - Volta: Architecture and Performance Optimization (Thur, 10:30 AM)

S8481 - CUDA Kernel Profiling: Deep-Dive Into NVIDIA's Next-Gen Tools (Thur, 11:00 AM)

Page 57: S8630 - What the Profiler is telling you: Optimizing GPU ...on-demand.gputechconf.com/gtc/2018/presentation/s... · THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT

developer.nvidia.com/join