78
Most of the slides are from: GTC’18 – S81006 and GTC’19 – S9234 Mohammad Motamedi, NVIDIA

Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

Most of the slides are from: GTC’18 – S81006 and GTC’19 – S9234

Mohammad Motamedi, NVIDIA

Page 2: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

2

VOLTA V100

Per Streaming Multiprocessor

• 64 FP32 lanes

• 32 FP64 lanes

• 64 INT32 lanes

• 16 SFU lanes (transcendental)

• 32 LD/ST lanes (G-mem/L-mem/S-mem)

• 8 Tensor Cores

• 4 TEX lanes

SM

L1 …SM

L1

SM

L1

SM

L1

SM

L1

L2

DRAM

80 SM

Page 3: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

3

SM Resources

Thread blocks require registers and shared memory.

SM can schedule any resident warp without context switching.

Volta limits per SM

• Register file size: 256 KB.

• Maximum shared memory: 96 KB.

• Maximum number of threads: 2048.

• Maximum number of warps: 64.

• Maximum number of blocks: 32.

Page 4: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

4

Performance Constraints

Compute bound – saturates compute units

• Reduce the number of instructions executed

• Vector types, intrinsic operations, tensor cores, FMAs.

Bandwidth bound – saturates memory bandwidth

• Optimize access pattern

• Use lower precision

Latency bound

• Increase the number of instructions / mem accesses in flight

Page 5: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

5

Little’s LawFor Escalators

Our escalator parameters

• 1 Person per step

• A step arrives every 2 secondsBandwidth: 0.5 person/s

• 20 steps tallLatency = 40 seconds

Page 6: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

6

Little’s LawFor Escalators

• One person at a time?Achieved bandwidth = 0.025 person/s

• To saturate bandwidthNeed one person arriving with every step,we need 20 persons in flight.

• Need Bandwidth x Latency persons in flight.

A step arrives every 2 secondsBandwidth: 0.5 person/s20 steps tall : Latency = 40 seconds

Page 7: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

7

Optimization GoalsSaturating the compute units

• 𝑎 = 𝑏 + 𝑐 → 𝑎𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 =1

3×4𝐹/𝐵

• 𝐺𝑃𝑈’𝑠 𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =15 𝑇𝐹𝑝𝑠

900 𝐺𝐵𝑝𝑠= 16.6 𝐹/𝐵

• It is hard to saturate the compute units.

• One can improve it by saturating the memory bandwidth.

Saturating the memory bandwidth

• Hiding the latency is the key in achieving this goal.

Page 8: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

8

Memory Bandwidth

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.01024

1536

2048

2560

3072

3584

4096

4608

5120

5632

6144

6656

7168

7680

8192

8704

9216

9728

10240

10752

11264

11776

12288

12800

13312

13824

14336

14848

15360

16384

17408

18432

19456

20480

21504

22528

23552

24576

25600

28672

29696

32768

% o

f Peak B

andw

idth

Bytes in flight per SM

V100

Volta reaches 90% of peak bandwidth with ~6KB of data in flight per SM.

Page 9: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

9

Little’s lawTakeaways

Instructions in flight =

instructions in flight per thread x threads executing

Instruction Level Parallelism Occupancy

Page 10: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

10

Higher occupancy hides latencySM has more warp candidates to schedule while other warps are waiting for instructions to complete.

Occupancy = Achieved number of threads per SM

Maximumnumber of threads per SM

• Use the profiler to compute it.

Achieved occupancy vs theoretical occupancyNeed to run enough thread blocks to fill all the SMs.

Be mindful of diminishing returns.

Occupancy

Page 11: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

11

Instruction Issue

Instructions are issued in-order

If an instruction is not eligible, it stalls the warp.

An instruction is eligible for issue if both are true:

• A pipeline is available for execution

Some pipelines need multiple cycles to issue a warp.

• All the arguments are ready

Argument is not ready if a previous instruction has not produced it yet.

Page 12: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

12

Instruction Issue Example

__global__ void kernel (float *a, float *b, float *c) {

int i= blockIdx.x * blockDim.x + threadIdx.x;

c[i] += a[i] * b[i];} LDG.E R2, [R2];

LDG.E R4, [R4];

LDG.E R9, [R6];

FFMA R9, R2, R4, R9;

STG.E [R6], R9;

stall!

stall!

12B / threadin flight

Page 13: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

13

Computing 2 values per thread

__global__ void kernel (float2 *a, float2 *b, float2 *c) {

int i= blockIdx.x * blockDim.x + threadIdx.x;

c[i].x += a[i].x * b[i].x;c[i].y += a[i].y * b[i].y;

}LDG.E.64 R2, [R2];

LDG.E.64 R4, [R4];

LDG.E.64 R6, [R8];

FFMA R7, R3, R5, R7;

FFMA R6, R2, R4, R6;

STG.E.64 [R8], R6;

24B/ threadin flight

2 Independent instructions

stall!

stall!

Page 14: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

14

Control FlowBlocks of threads, warps

• Single Instruction Multiple Threads (SIMT) model.

• CUDA hierarchy: Grid -> Blocks -> Threads.

• One warp = 32 threads.

• Why does it matter?Many optimizations depend on behavior at the warp level.

Page 15: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

15

Control Flow

• Thread blocks can have 1D, 2D, or 3D representation. Threads are linear in hardware.

• Consecutive 32 threads belong to the same warp.

Mapping threads

80 Threads:40 threads in X

2 rows of threads in Y

40

2

Page 16: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

16

Control Flow

• Thread blocks can have 1D, 2D, or 3D representation. Threads are linear in hardware.

• Consecutive 32 threads belong to the same warp.

Mapping threads

80 Threads:40 threads in X

2 rows of threads in Y

40

2

3 warps (96 threads)16 inactive threads in 3rd warp

32

24

8

16 162

40

Page 17: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

17

Control Flow

• Different warps can execute different codeNo impact on performance.Each warp maintains its own Program Counter.

• Different code path inside the same warp?Threads that do not participate are masked out,but the whole warp executes both sides of the branch.

Divergence

Page 18: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

18

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 19: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

19

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

AWarp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 20: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

20

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A BWarp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 21: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

21

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A B DWarp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 22: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

22

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B DWarp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 23: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

23

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

B

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 24: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

24

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

B C

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 25: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

25

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

DB C

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

Page 26: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

26

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

DB C

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

A

Page 27: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

27

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

DB C

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

A C

Page 28: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

28

Control Flow

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

DB C

Warp 10…

31

Warp 20…

31

Warp 30…

31

Instructions, time

A C D

Page 29: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

29

Control Flow

Problem: 2D convolution with padding on an input image of size 1024x1024.

• Number of divergence in first and last row: 0

• Number of divergence in other rows

• 2-way divergence in first and last warp of each row.

• Total warps with divergence: 1022 x 2 = 2044.

• 6% of threads will have a 2-way divergence.

Case study: Thread divergence

Page 30: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

30

Control Flow

• Not every branch is a divergence.

• Minimize thread divergence inside a warp.

• Divergence between warps is fine.

• Maximize “useful” cycles for each thread (maximize # of threads executing + minimize thread divergence).

• Do not call a warp-wide instruction on a divergent branch (e.g. __syncthreads()).

Takeaways

Page 31: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

31

Intrinsic Functions

• Fast but less accurate math intrinsic functions are available.

• 2 ways to use the intrinsic functions

• Whole file: compile with --fast-math

• Individual callsE.g. __sinf(x), __logf(x), __fdivide(x,y)

• The programming guide has a list of intrinsic functions and their impact on accuracy.

Page 32: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

32

Tensor Cores

Dedicated matrix multiplication pipeline.

Input precision: FP16.

Peak Performance: 125 TFLOPS.

Used in CUBLAS, CUDNN, CUTLASS.Optimized libraries can reach ~90% of peak.

Exposed in CUDA.

Volta/Turing

Page 33: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

33

Tensor Cores

Warp Matrix Multiply Add (WMMA)

• Warp-wide macro-instructions.

• All threads in the warp must be active.

Performs matrix multiplication on 16x16 tiles(8x32x16 and 32x8x16 tiles also available)

D = A x B + CA and B: FP16 onlyC and D: Same, either FP16 or FP32.

C

B

DA

16

16

Using Tensor Cores in your CUDA code

Page 34: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

34

Tensor Cores

Each warp processes a 16x16 output tile

Each warp:Loop on all input tiles Ak and Bk

C = C + Ak x Bk

Write the output tile.

Typical use

B

CA

Page 35: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

35

Tensor Cores

Each warp processes a 16x16 output tile

Each warp:Loop on all input tiles Ak and Bk

C = C + Ak x Bk

Write the output tile.

Typical use

B

CA

Can compute several tiles per block,

with inputs staged in shared memory.

Page 36: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

36

Tensor CoresCase Study: Deep Learning

Source: https://arxiv.org/pdf/1710.03740.pdf

Scaling is required in the backward path.

Page 37: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

37

Volta’s Memory SystemV100

SM

L1 …SM

L1

SM

L1

SM

L1

SM

L1

L2

DRAM

80 Streaming Multiprocessors256KB register file (20 MB)

Unified Shared Mem / L1 Cache128KB, Variable split(~10MB Total, 14 TB/s)

6 MB L2 Cache(2.5TB/s Read, 1.6TB/s Write)

16/32 GB HBM2 (900 GB/s)“Free” ECC.

PCIe

Page 38: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

38

Memory System

By default, the driver is using the configuration that maximizes the occupancy.

Shared Memory Split

Shared Memory / L1 Split

Volta Turing

96 KB / 32 KB 64 KB / 32 KB

64 KB / 64 KB 32 KB / 64 KB

32 KB / 96 KB

16 KB / 112 KB

8 KB / 120 KB

0 KB /128 KB

Examples Volta Turing

0 KB Shared Mem. Other resources: 16 Blocks/SM.

Config: 0/128 Blocks/SM: 16

Config: 32/64 Blocks/SM: 16

40 KB Shared Mem. Other resources: 4 Blocks/SM.

Config: 96/32 Blocks/SM: 2

Config: 64/32Blocks/SM: 1

Examples

Page 39: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

39

Cache Lines and Sectors

Sector-level memory access granularity.

• Sector size in Maxwell, Pascal, and Volta: 32B.

• Sector size in Kepler and before: variable (32 or 128).

A cache line is 128 Bytes, made of 4 sectors.

Moving data between L1, L2, DRAM

128 Byte cache line

Sector 0 Sector 1 Sector 2 Sector 3

128-Byte alignment

Page 40: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

40

Memory ReadsGetting Data from Global Memory

SM

L1…

SM

L1

L2

DRAM

• Checking if the data is in L1 (if not, check L2).

• Checking if the data is in L2 (if not, get in DRAM).

• Unit of data moved: Sectors.

Page 41: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

41

Memory Writes

SM

L1…

SM

L1

L2

DRAM

Before Volta : Writes were not cached in L1.Volta+ : L1 will cache writes.L1 is write-through: Write to L1 AND L2.

L2 is write back : Will flush data to DRAM only when needed.

Partial writes are supported (masked portion of sector, but behavior can change with ECC on/off).

Page 42: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

42

L1, L2 Caches

In general, not for cache blocking

• 100s ~ 1000s of threads running per SM.Tens of thousands of threads sharing the L2 cache.L1, L2 are small per thread.E.g. at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread.Running at lower occupancy increases bytes of cache per thread.

• Shared Memory is usually a better option to cache data explicitly:User managed, no evictions out of your control.

Why Does GPU Have Caches?

Page 43: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

43

L1, L2 Caches

Caches on GPUs are useful for:

• “Smoothing” irregular, unaligned access patterns.

• Caching common data accessed by many threads.

• Faster register spills, local memory.

• Fast atomics.

• Codes that do not use shared memory (naïve code).

Why Does GPU Have Caches?

Page 44: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

44

Access Patterns

For each warp: How many sectors needed?Depends on addresses, active threads, access size.Natural element sizes = 1B, 2B, 4B, 8B, 16B.

Warps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte element access4 sectors

Page 45: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

45

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 318-Byte element access8 sectors

Examples of 8-byte elements: long long, int2, double, float2.

Page 46: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

46

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte access4 sectors

Page 47: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

47

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte access, unaligned5 sectors

128 Bytes requested; 160 bytes read (80% efficiency).

Page 48: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

48

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte access, unaligned5 sectors NEXT WARP

With >1 warp per block, this sector might be found in L1 or L2.

Page 49: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

49

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 31Same address1 sector

Page 50: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

50

Access PatternsWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte strided access.32 sectors

128 bytes requested; 1024 bytes transferred.Using only a few bytes per sector. Wasting lots of bandwidth.

Page 51: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

51

Access Patterns

• Know your access patterns.

• Use the profiler (metrics, counters) to check how many sectors are moved. Is that what you expect? Is it optimal?

• Using the largest type possible (e.g. float4) will maximize the number of sectors moved per instruction.

Takeaways

Page 52: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

52

Shared Memory Addressing

• Number of banks: 32, band width: 4B.

• Mapping addresses to banks- Successive 4B words go to successive banks- Bank index computation examples:

(4B word index) % 32.

((1B word index) / 4 ) % 32.8B word spans two successive banks.

Page 53: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

53

Logical View Of Shared Memory Banks

0 1

32 33

Bank-0

2 3 4 5 8 96 7

256 260 264

10 30 31

Bank-31

384

Bank-1

0 4 8 12 16 20 24 28 44Byte-address: 32 38 40 120 128124

128 132 136 140 144 148 248 256252

With 4-Bytes data

Page 54: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

54

Shared Memory Bank Conflicts

A bank conflict occurs when, inside a warp:2 or more threads access within different 4B words in the same bank.Think: 2 or more threads access different “rows” in the same bank.

N-way bank conflict: N threads in a warp conflict- Increases latency.- Worst case: 32-way conflict → 31 replays.- Each replay adds a few cycles of latency.

There is no bank conflict if:- Several threads access the same 4-byte word.- Several threads access different bytes of the same 4-byte word.

Page 55: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

55

No Bank Conflicts

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 56: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

56

No Bank Conflicts

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 57: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

57

No Bank Conflicts

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 58: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

58

No Bank Conflicts

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 59: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

59

2-way Bank Conflict

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 60: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

60

2-way Bank Conflict

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 61: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

61

3-way Bank Conflict

0 1

32 33

Bank-0

2 3 4 5 8 96 7

0 4 8 12 16 20 24 28 44Byte-address:

10 30 31

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

Page 62: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

62

Bank Conflict and Access Phase

4B or smaller words

• Process addresses of all threads in a warp in a single phase.

8B words are accessed in 2 phases

• Process addresses of the first 16 threads in a warp.

• Process addresses of the second 16 threads in a warp.

16B words are accessed in 4 phases

• Each phase processes a quarter of a warp.

Bank conflicts occur only between threads in the same phase

Page 63: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

63

8B words, No Conflicts

0 0

16 16

Bank-0

1 1 2 2 4 43 3

0 4 8 12 16 20 24 28 44Byte-address:

5 15 15

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-15

T-16 T-17 T-18 T-19 T-20 T-21 T-31

Phase2

Phase1

Page 64: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

64

8B words, 2-way Conflict

0 0

16 16

Bank-0

1 1 2 2 4 43 3

0 4 8 12 16 20 24 28 44Byte-address:

5 15 15

Bank-31

32 38 40 120 128

Bank-1

124

T-0 T-1 T-2 T-3 T-4 T-5 T-15

T-16 T-17 T-18 T-19 T-20 T-21 T-31

Phase2(no conflict)

Phase1(2 way conflict)

Page 65: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

65

Case Study: Matrix Transpose

Staged via SMEM to coalesce GMEM addresses

32x32 blocks, single-precision values.

32x32 array in shared memory.

Initial implementation:

A warp reads a row from GMEM, writes to a row of SMEM.

Synchronize the threads in a block.

A warp reads a column of from SMEM, writes to a row in GMEM.

Page 66: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

66

Case Study: Matrix Transpose

32x32 SMEM array (.e.g. __shared__ float sm[32][32])

Warp accesses a row : No conflict.

Warp accesses a column : 32-way conflict.

Bank 0Bank 1…

Bank 31

31

210

31210

31210

Threads:0 1 2 31

20 1

31

Number indentifies which warp is accessing data.Color indicates in which bank data resides.

Page 67: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

67

Case Study: Matrix Transpose

Solution: add a column for padding: 32x33(.e.g. __shared__ float sm[32][33])

Warp accesses a row or a column: no conflict.

Bank 0Bank 1…

Bank 31

Number indentifies which warp is accessing data.Color indicates in which bank data resides.

Threads:0 1 2 31 padding

31210

31210

31210

3120 1

Speedup1.3x

Page 68: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

68

Summary: Shared Memory

Shared memory is a precious resource

Very high bandwidth (14 TB/s), much lower latency than Global Memory.

Data is programmer-managed, no evictions by hardware.

Volta: up to 96KB of shared memory per thread block.

4B granularity.

Performance issues to look out for

Bank conflicts add latency and reduce throughput.

Use profiling tools to identify bank conflicts.

Page 69: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

69

2D Stencil Experiment

index = iy * nx + ix;

res = coef[0] * in[index];

for(i=1; i<=RADIUS; i++)

res += coef[i] * (in[index-i] +

in[index+i] +

in[index-i*n1] +

in[index+i*n1]);

out[index] = res;

with and without Shared Memory

With Shared Memory:

• Load the input array and halos in shared memory

• __syncthreads()

• Compute the stencil from the shared memory

Page 70: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

70

2D – Small stencils

Relative speed of L1 implementation versus Smem implementation

103%

78%

RADIUS=1

Volta Pascal

102%

55%

RADIUS=2

Volta Pascal

L1 implementation is faster than Shared Memory on Volta!

Page 71: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

71

2D – Larger Stencils

Relative speed of L1 implementation versus Smem implementation

94%

40%

RADIUS=4

Volta Pascal

95%

32%

RADIUS=8

Volta Pascal

79%

29%

RADIUS=16

Volta Pascal

Shared Memory implementation is always faster for larger stencils

Page 72: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

72

Local Memory• Local Memory

• Volta’s max register/thread: 255

• Option to use smaller number of registers: -maxrregcount

• Developers might use it to increase occupancy

• Issues

• Increased memory traffic

• Increased instruction counts

• Solutions

• Increase register count, increase L1 cache size, careful non-caching loads.

Page 73: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

73

Constant Memory

• Globally-scoped arrays qualified with __constant__.

• Total constant data size limited to 64 KB.

• Throughput = 4B per clock per SM (ideal if entire warp reads the same address).

• Can be used directly in arithmetic instructions (saving registers).

• Example use : Stencil coefficients.

Page 74: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

74

Running Faster

• Start with something simple. Simplicity and massive parallelism are key.

• Profile first, optimize then. Do not over optimize (opportunity cost).

• Use libraries as building blocks whenever possible – e.g. CUB for parallel primitives (reduce/scan/sort/partition/..) at warp/block/device level.

• Other libraries : CUBLAS, CUFFT, CUTLASS, CUSOLVER.

Page 75: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32
Page 76: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

76

Shared Memory

Scratch-pad memory on each SMUser-managed cache, hardware does not evict data.Data written to shared memory stays there till user overwrites.

UsageStoring frequently-accessed data, to reduce DRAM accesses.Communication among threads of a block.

Performance benefits compared to DRAMLatency: 20X – 40X lower. Bandwidth: 15X higher. Access granularity: 4X finer (4B vs. 32B).

Page 77: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

77

Volta Shared Memory

• Shared Memory Size Per Block

• Default: 48 KB, Maximum: 96KB.

• Shared Memory Bandwidth

• Number of banks: 32, band width: 4B.

• Bandwidth per SM: 128 B/clock

• Total Bandwidth: 10 KB/clock ~ 14 TB/s

Page 78: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32

78

Shared Memory Instruction Operation

Threads in a warp provide addressesHW determines into which 4-byte words addresses fall.

Reads (LDS):Fetch the data, distribute the requested bytes among threads.Multi-cast capable.

Writes (STS):

Multiple threads writing the same address: race condition.