CS101c GPU Programming - Caltech Computingcourses.cms.caltech.edu/cs101gpu/2013/lec12_cuda_memory.pdf · 2013-04-29 · CS179 GPU Programming Syntax: Global Device Memory Just device

Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming:CUDA Memory

CS179 GPU Programming

CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture Memory Memory best practices

2


Review

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

3


Registers

Registers are read/write per-thread– Can’t access registers outside of each thread

Physically stored in each MP Can’t index

– No arrays…

4


Local Memory Also read/write per thread Can’t read other thread’s local memory Can index

– This is where local arrays go

5


Shared Memory Read/write per-block All threads in a block share the same memory In general, pretty quick

Certain cases can hinder this

6


Global Memory Read/write per-application Can share between blocks, and grids Persistent across kernel executions Completely un-cached Really slow

7


Constant Memory Read-only from device Cached in each MP Cache can broadcast to every thread currently

running – super efficient!

8


Texture Memory Read-only from Device 2D caching method – shared between two MPs Linear filtering available

9


Syntax Register memory

This is the default for memory within a device function

No special syntax● Just declare your local variables as usual

Remember – no indexing = no array Obviously, can’t access from host code

10


Syntax Local Memory

Remember, local to a thread, not necessarily a function

Can’t access/declare outside of thread Keyword: __device__ __local__ Or just __local__

11


Syntax Shared Memory

Keyword: __device__ __shared__ Or just __shared__ Can’t have a pointer, use array instead

12


Syntax: Global Device Memory Just __device__ outside of a function will give

global memory Can pass pointers to global memory into kernel,

unlike other memory areas cudaMalloc gives you global memory cudaMallocPitch gives you global memory with

more efficient pitch (talk about this soon) Probably want to pass the pitch to the kernel too

13


Useful Functions for Global Memory cudaMemcpy

Takes device pointers into global memory cudaMemcpy2D

Memcpy arrays with differing pitches Remember to specify

cudaMemcpyHostToDevice, etc. Don’t mix up your host/global pointers!

14


Useful Functions for Global Memory cudaMemcpyToSymbol

Instead of pointer, takes symbol or char*. DeviceToDevice or HostToDevice

cudaMemcpyFromSymbol DeviceToHost

Use cudaGetSymbolAddress and cudaGetSymbolSize to find symbols

15


Syntax: Constant Memory Slightly awkward interface Define with __device__ __constant__ or

__constant__ Access from device code as-is Access from host code with

cudaMemcpyToSymbol– Can’t use pointers– Can’t dynamically allocate, even from host

16


Useful Functions for Constant Mem cudaMemcpyToSymbol

Instead of pointer, takes symbol or char*. DeviceToDevice or HostToDevice

cudaMemcpyFromSymbol DeviceToHost

cudaGetSymbolAddress FAILS! Why?

All pointers are assumed to be in global space

17


Syntax: Texture Memory Even more awkward than constant memory We’ll talk about this later today But first…

18


Allocation Issues As noted, each block only runs on one

multiprocessor This means that the resources each block uses

determines the total number of blocks that can run per MP

Since we have limited MPs, and no swapping is done, this puts a limit on the number of blocks we can run

19


Allocation Issues: Registers 32K of registers on each multiprocessor Each register is 4 bytes

– i.e. 8K registers per MP Registers are never swapped, so this is a hard

limit!

20


Allocation Issues: Registers However, basically no rules on minimum

allocation size of registers You can have lots of threads that use a few

registers Or, a few threads that use many registers

– Watch out! You don’t want to waste parallel execution power

– In general, it’s a good idea to have more than one block running per MP

21


Allocation Issues: Shared Memory 16KB of shared memory per MP Broken up into 1KB “banks”

We’ll talk about why this is important Again, this 16KB might be the limiting factor in

the number of blocks per MP

22


Allocation Issues: Local Memory Local memory is not stored per-chip This means it usually is not a limiting factor for

number of total threads However, accesses are as slow as global

memory!– So, try not to use local arrays in kernels

23


Warps To understand memory coalescing, we need to

understand the concept of a warp– Name comes from weaving – group of threads

In this context, it means a group of threads that are scheduled together

– In CUDA, a warp is 32 threads The fact that 32 threads are executing together is

important, but today we’ll just cover what this means for memory

24


Warps In current hardware, “half-warps” (i.e. 16

threads) access memory at the same time– But this is not part of the CUDA standard and will

almost certainly change in the future

25


Memory Coalescing Remember, Global memory accesses are

SLOW! Often we want to access data indexed by the

thread ID Luckily, under certain conditions there is

hardware support to copy chunks of data at once

Ensuring this occurs is called “memory coalescing”

26


Aligned Accesses Threads can read 4, 8, or 16 bytes at a time

from global memory.– However, only if accesses are aligned!

4 byte reads must be aligned to 4-byte boundaries, 8 byte reads must be aligned to 8-byte boundaries, etc.

So, which takes longer: reading 16 bytes from address 0xABCD0 or 0xABCDE?

27


Aligned Accesses Built-in types force correct alignment

Note that this means float3 takes up the same amount of space as float4!

Arrays of float3 are NOT aligned! For structs, use the __align__(x) keyword

x = 4, 8, 16

28


Memory Coalescing Multiple threads making coalesced accesses

are much faster Coalesced means:

Contiguous In-order – i.e. thread 0 takes index 0, thread 15 takes

index 15 Aligned to a multiple of the total data size

i.e. if accessing 4-byte float, thread 0 should access a multiple of 4*16!

29


Memory Coalescing Luckily, cudaMalloc already aligns the start of

each block for you– Remember to use cudaMalloc2D for 2-dimensional

arrays so that each row will be aligned as well

30


Memory Coalescing

: ● Fast!– Contiguous

– In-order

– Accesses aligned

31


Memory Coalescing

● Slow:– Thread 2 and 3 do

not access in order

– Thread accesses not aligned


Memory Coalescing

● Slow:– Address skipped

– Not in order

33


Memory Coalescing: New Hardware On the newest cards, memory coalescing

restrictions are slightly relaxed Sequentialness is no longer mandatory – you

can have random access within an aligned chunk

34


Memory Coalescing: New Hardware It will try to get around misaligned accesses by

grabbing a bigger chunk i.e. accessing 64B from 0x101 will access 128B from

0x100 Can’t always do this - 256B is the maximum

read/write size Also, has to be “sort of” aligned

No way to convert 64B from 0x078 to 1 aligned 128B access

35


Shared Memory Shared memory is generally pretty fast, since

it’s on-chip To maximize usage, we want each thread in a

half-warp to use shared memory at the same time

This can only happen in certain situations – similar in concept to global memory coalescing However, shared memory is optimized for different

use cases

36


Shared Memory Banks Shared memory is divided into 1K banks

Each bank can service one thread at a time If multiple threads access the same bank, one

of them will have to wait Again, it’s thread safe in the sense that both

operations will occur, but the order is undefined

37


Shared Memory Banks Each bank services 32 bit words at addresses

mod 16 * 4 = 64.– i.e. bank 0 services 0x00, 0x40, 0x80, … – Bank 1 services 0x04, 0x44, 0x84, …

Unlike in global memory in older hardware, no need for threads to do sequential accesses – Bank 1 has no special connection with Thread 1, for example

38


Shared Memory Banks Want to avoid multiple threads accessing the

same bank Don’t close-pack data – keep everything spread out

by at least 4 bytes Split data elements that are more than 4 bytes into

multiple accesses Watch out for structures with even strides

39


Shared Memory Banks

● Fast!– One thread, one

bank

– Even misaligned case is fast

40


Shared Memory Banks

● Slow:– Many threads access

one bank

41


Shared Memory Banks One helpful feature for reading from memory

banks: broadcast!– If multiple threads read the same address at the

same time, only one read will be performed, and the data will be broadcast to all reading threads

42


Shared Memory Banks

Fast (with broadcasting)

43


Notes on Constants and Textures. Both of these memories have two levels of

caching– To achieve good usage, locality is far more

important than alignment– Textures use 2D locality rules

In general, writing optimized accesses to constants is about what you’d expect from CPU coding

44


Syntax: Texture Space Syntax similar to GLSL textures

“CUDA arrays” are stored in texture memory and accessed only from texture lookups on device side

You can declare regions of global memory as “texture memory”, but you lose many benefits. No support for wrapping/clamping No support for floating point coordinates No support for filtering

45


Textures from Linear Memory However, still get caching benefits

– Since cache is stored on-chip, don’t have to worry so much about coalescing

Step1: Creating texture ref– Use templates (C++)– texture<Type> texRef;– Type: float, float4, etc.

Step 2: Bind memory to texture reference– cudaBindTexture(0, texRef, devPtr, size);

46


Performing Texture Fetches Special syntax for accessing textures in device

code Must get texRef from host somehow – probably

parameter to kernel. Remember texRef is a templated type

Accessing is simple: tex1Dfetch(texRef, x);

x is an int!

47


Texture Fetch Extra Feature If you want, you can convert ints to normalized

floats automatically (fast) To do this, set the readmode of the texRef texRef<unsigned char, 1,

cudaReadModeNormalizedFloat> 0 -> 0.0 255 -> 1.0

Here, the 1 is for 1 dimensional texture

48


Important Caveat Since this “texture” is stored in global memory, it

is possible to write to it from the same kernel that is reading from it

– While this is possible, this is a TERRIBLE idea!– Even without caching, it’s a synchronization

nightmare Nothing is done in hardware to keep texture

caches coherent with global memory

49


More notes on linear “textures” Dimension is forced to be 1

– This means if you’re doing some sort of 2D array, pitch might not be ideal

● Because of caching, doesn’t really matter

However, caching is done assuming 1D locality accesses

Conclusion: Since this isn’t really read-write, might as well use “real” textures, i.e. CUDA arrays

50


CUDA Arrays CUDA Arrays live in texture memory space, and

can only be accessed through texture fetches– This allows us to use some of the hardware

support for texturing, which might come in handy, even for non-graphics applications

Most of our options here will look familiar from GLSL

51


CUDA Arrays Step 1: Create “channel description”

Describes what format the texture is in cudaCreateChannelDesc(int x, int y, int z, int w, enum

cudaChannelFormatKind f); cudaChannelFormatKindFloat, …Signed, or …

Unsigned x, y, z, w are number of bytes in each field

52


CUDA Arrays Step 2: Allocate array

Similar to global memory, however we MUST dynamically allocate

cudaMallocArray Takes channel desc as a parameter

Most available functions to interface with global memory are available for arrays cudaMemcpyToArray, etc

53


CUDA Arrays Step 3: Create texRef:

Same as global memory case, but type *must* match channelDesc given when you allocated array:

texture<Type, Dim, ReadMode> texRef Step 4: Edit texture settings

Settings are encoded as members of the texRef struct

54


CUDA Array Parameters normalized

– Texture coords go from 0->1 filterMode

– Point or Linear– No 3D texture support– Linear has basically no guarantees on

precision/accuracy addressMode

– Clamp or Wrap All these are ignored for linear mode textures

55


CUDA Arrays Step 5: Bind Texture reference

cudaBindTextureToArray(texRef, array); Access it from device code in a similar way:

tex1DFetch(texRef, x) Here, x is an int, or a float if normalized

tex2DFetch(texRef, x, y) Again, x and y are ints or normalized floats

Remember, coords are either clamped or wrapped

56


One Final note on Textures Textures are cached in two levels Texture caches are shared between pairs of

MPs

57


Memory Best Practices Remember, the key to optimization in CUDA is

to balance memory optimization with parallelism In general:

– More blocks/threads = more parallel– Fewer blocks/threads = better memory usage

58


Memory Best Practices The balance usually lies with better memory

accesses, until multiprocessors begin idling

59


Memory Best Practices Better coalescing means less time waiting for

data, which means less latency. In general, you should break up the problem so

you can do coalesced copies into shared memory

– Then do processing in shared memory without bank conflicts, and copy back into global memory

60

Documents

CS101c GPU Programming - Caltech Computingcourses.cms.caltech.edu/cs101gpu/2013/lec12_cuda_memory.pdf · 2013-04-29 · CS179 GPU Programming Syntax: Global Device Memory Just __device__

CS101c GPU Programming - Caltech Computingcourses.cms.caltech.edu/cs101gpu/2013/lec12_cuda_memory.pdf · 2013-04-29 · CS179 GPU Programming Syntax: Global Device Memory Just device