Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,

Lecture:Manycore GPUArchitecturesandProgramming,Part2

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE569/

Manycore GPUArchitecturesandProgramming:Outline

• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP

2

HowistheGPUcontrolled?

• TheCUDAAPIissplitinto:– TheCUDAManagementAPI– TheCUDAKernelAPI

• TheCUDAManagementAPIisforavarietyofoperations– GPUmemoryallocation,datatransfer,execution,resource

creation– MostlyregularCfunctionandcalls

• TheCUDAKernelAPIisusedtodefinethecomputationtobeperformedbytheGPU– Cextensions

3

HowistheGPUcontrolled?

• A CUDAkernel:– Definestheoperationstobeperformedbyasinglethreadon

theGPU– JustasaC/C++functiondefinesworktobedoneontheCPU– Syntactically,akernellookslikeC/C++withsomeextensions

__global__ void kernel(...) {...

}

– EveryCUDAthreadexecutesthesamekernellogic(SIMT)– Initially,theonlydifferencebetweenthreadsaretheirthread

coordinates

4

HowareGPUthreadsorganized?

• CUDA threadhierarchy– Warp =SIMTGroup– ThreadBlock=SIMTGroupsthatrun

concurrentlyonanSM– Grid =AllThreadBlockscreatedbythesame

kernellaunch

• Launchingakernelissimpleandsimilartoafunctioncall.– kernelnameandarguments– #ofthreadblocks/gridand#ofthreads/blocktocreate:

kernel<<<nblocks,threads_per_block>>>(arg1, arg2, ...);

5


• InCUDA,onlythreadblocksandgridsarefirst-classcitizensoftheprogrammingmodel.

• Thenumberofwarps createdandtheirorganizationareimplicitlycontrolled bythekernellaunchconfiguration,butneversetexplicitly.

kernel<<<nblocks, threads_per_block>>>(arg1, arg2, ...);

kernel launch configuration

6


• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts

– One-dimensionalblocksandgrids:int nblocks = 4;int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);

7

Block 0 Block 1 Block 2 Block 3



– Two-dimensionalblocksandgrids:dim3 nblocks(2,2)dim3 threads_per_block(4,2);kernel<<<nblocks, threads_per_block>>>(...);

8



– Two-dimensionalgridandone-dimensionalblocks:dim3 nblocks(2,2);int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);

9


• OntheGPU,thenumberofblocksandthreadsperblockisexposedthroughintrinsicthreadcoordinatevariables:– Dimensions– IDs

Variable MeaninggridDim.x, gridDim.y,

gridDim.zNumberofblocksinakernellaunch.

blockIdx.x, blockIdx.y, blockIdx.z

UniqueIDoftheblockthatcontainsthecurrentthread.

blockDim.x, blockDim.y, blockDim.z

Numberofthreadsineachblock.

threadIdx.x, threadIdx.y, threadIdx.z

UniqueIDofthecurrentthreadwithinitsblock.

10


tocalculateagloballyuniqueIDforathreadontheGPUinsideaone-dimensionalgridandone-dimensionalblock:kernel<<<4, 8>>>(...);

__global__ void kernel(...) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;

...

}

11

Block 0 Block 1 Block 2 Block 3

blockIdx.x = 2;blockDim.x = 8;threadIdx.x = 2;

01234567

8


• Threadcoordinatesofferawaytodifferentiatethreadsandidentifythread-specificinputdataorcodepaths.– Linkdataandcomputation,amapping

__global__ void kernel(int *arr) {

int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < 32) {

arr[tid] = f(arr[tid]);

} else {

arr[tid] = g(arr[tid]);

}

12

codepathforthreadswithtid <32

codepathforthreadswithtid >=32

ThreadDivergence:recallthatuselesscodepathisexecuted,butthendisabledinSIMTexecutionmodel

HowisGPUmemorymanaged?

• CUDAMemoryManagementAPI– AllocationofGPUmemory– TransferofdatafromthehosttoGPUmemory– Free-ing GPUmemory– Foo(int A[][N]){}

HostFunction CUDAAnalogue

malloc cudaMalloc

memcpy cudaMemcpy

free cudaFree

13


cudaError_t cudaMalloc(void **devPtr, size_t size);

– Allocatesize bytesofGPUmemoryandstoretheiraddressat*devPtr

cudaError_t cudaFree(void *devPtr);– ReleasethedevicememoryallocationstoredatdevPtr– MustbeanallocationthatwascreatedusingcudaMalloc

14


cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);– Transferscountbytesfromthememorypointedtobysrc to

dst– kind canbe:• cudaMemcpyHostToHost,• cudaMemcpyHostToDevice,• cudaMemcpyDeviceToHost,• cudaMemcpyDeviceToDevice

– Thelocationsofdst andsrc mustmatchkind,e.g.ifkind iscudaMemcpyHostToDevice thensrc mustbeahostarrayanddst mustbeadevicearray

15


void *d_arr, *h_arr;h_addr = … ; /* init host memory and data */// Allocate memory on GPU and its address is in d_arrcudaMalloc((void **)&d_arr, nbytes);

// Transfer data from host to devicecudaMemcpy(d_arr, h_arr, nbytes,

cudaMemcpyHostToDevice);

// Transfer data from a device to a hostcudaMemcpy(h_arr, d_arr, nbytes,

cudaMemcpyDeviceToHost);

// Free the allocated memorycudaFree(d_arr);

16

CUDAProgramFlow

• Atitsmostbasic,theflowofaCUDAprogramisasfollows:1. AllocateGPUmemory2. PopulateGPUmemorywithinputsfromthehost3. ExecuteaGPUkernelonthoseinputs4. TransferoutputsfromtheGPUbacktothehost5. FreeGPUmemory

• Let’stakealookatasimpleexamplethatmanipulatesdata

17

AXPYExamplewithOpenMP:Multicore

• y=α·x+y– x andy arevectorsofsizen– α isscalar

• Data(x,yanda)areshared– Parallelizationisrelativelyeasy

18

CUDAProgramFlow

• AXPYisanembarrassinglyparallelproblem– Howcanvectoradditionbeparallelized?– HowcanwemapthistoGPUs?• Eachthreaddoesoneelement

A B C 19

AXPYOffloadingToaGPUusingCUDA

20

Memoryallocationondevice

Memcpy fromhosttodevice

Launchparallelexecution

Memcpy fromdevicetohost

Deallocation ofdev memory

CUDAProgramFlow• ConsidertheworkflowoftheexamplevectoradditionvecAdd.cu:

1. AllocatespaceforA,B,andC ontheGPU2. TransfertheinitialcontentsofA andB totheGPU3. ExecuteakernelinwhicheachthreadsumsAi andBi,andstoresthe

resultinCi4. TransferthefinalcontentsofC backtothehost5. FreeA,B,andC ontheGPUModifytoC=A+B+C

A=B*C;wewillneedbothCandAinthehostsideafterGPU

computation.• Compileandrunningonbridges:– https://passlab.github.io/CSCE569/resources/HardwareSoftware.html#inte

ractive– copygpu_code_examples folderfrommyhomefolder• cp –r~yan/gpu_code_examples ~

– $nvcc –Xcompiler –fopenmp vectorAdd.cu– $./a.out

21

MoreExamplesandExercises

• Matvec:– Version1:eachthreadcomputesoneelementofthefinal

vector– Version2:• Matmul inassignment#4– Version1:eachthreadcomputesonerowofthefinalmatrixC

22

CUDASDKExamples

• CUDAProgrammingManual:– http://docs.nvidia.com/cuda/cuda-c-programming-guide

• CUDASDKExamplesonbridges– moduleloadgcc/5.3.0cuda/8.0– exportCUDA_PATH=/opt/packages/cuda/8.0– /opt/packages/cuda/8.0/samples• Copytoyourhomefolder– cp –r/opt/packages/cuda/8.0/samples~/CUDA_samples• Doa“make”inthefolder,anditwillbuildallthesources• Orgotoaspecificexamplefolderandmake,itwillbuildonlythe

binary

• Findonesyouareinterestedinandruntosee

23

InspectingCUDAPrograms

• DebuggingCUDAprogram:– cuda-gdb debuggingtool, likegdb

• Profilingaprogramtoexaminetheperformance– Nvprof tool,likegprof– Nvprof ./vecAdd

24

Manycore GPUArchitecturesandProgramming:Outline

• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP

25

StoringDataontheCPU

• Amemoryhierarchyemulatesalargeamountoflow-latencymemory– Cachedatafromalarge,high-latencymemorybankinasmall

low-latencymemorybankDRAM

L2Cache

L1Cache

26

CPUMemoryHierarchy

CPU

GPUMemoryHierarchy

27

SIMT Thread Groups on a GPU

SIMT Thread Groups on an SM

SIMT Thread Group

Registers Local Memory

On-Chip Shared Memory/Cache

Global Memory

Constant Memory

Texture Memory

• MorecomplexthantheCPUmemory– Many different types

ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM

– Moreexplicit controloverdatamovement

StoringDataontheGPU

28

• Registers(SRAM)– Lowestlatencymemoryspaceonthe

GPU– PrivatetoeachCUDAthread– Constantpoolofregistersper-SM

dividedamongthreadsinresidentthreadblocks

– Architecture-dependentlimitonnumberofregistersperthread

– Registersarenotexplicitlyusedbytheprogrammer,implicitlyallocatedbythecompiler

– -maxrregcount compileroptionallowsyoutolimit#registersperthread

StoringDataontheGPU

Shared Memory

Tran

sfer

29

• SharedMemory(SRAM)– Declaredwiththe__shared__

keyword– Low-latency,highbandwidth– Sharedbyallthreadsinathreadblock– Explicitlyallocatedandmanagedby

theprogrammer,manualL1cache– Storedon-SM,samephysicalmemory

astheGPUL1cache– On-SMmemoryisstatically

partitionedbetweenL1cacheandsharedmemory

StoringDataontheGPU

L1 Cache

L2 Cache

Global Memory

30

• GPUCaches(SRAM)– BehaviourofGPUcachesis

architecture-dependent– Per-SML1cachestoredon-chip– Per-GPUL2cachestoredoff-chip,

cachesvaluesforallSMs– Duetoparallelismofaccesses,GPU

cachesdonotfollowthesameLRUrulesasCPUcaches

StoringDataontheGPU

Constant Memory

Constant Cache

31

• ConstantMemory(DRAM)– Declaredwiththe__constant__

keyword– Read-only– Limitedinsize:64KB– Storedindevicememory(same

physicallocationasGlobalMemory)– Cachedinaper-SMconstantcache– Optimizedforallthreadsinawarp

accessingthesamememorycell

StoringDataontheGPU

Texture Memory Read-Only Cache

Texture Memory

32

• TextureMemory(DRAM)– Read-only– Storedindevicememory(same

physicallocationasGlobalMemory)– Cachedinatexture-onlyon-SMcache– Optimizedfor2Dspatiallocality

(cachescommonlyonlyoptimizedfor1Dlocality)

– Explicitlyusedbytheprogrammer– Special-purposememory

StoringDataontheGPU

L1 Cache

L2 Cache

Global Memory

33

• GlobalMemory(DRAM)– Large,high-latencymemory– Storedindevicememory(along

withconstantandtexturememory)– Canbedeclaredstaticallywith__device__

– CanbeallocateddynamicallywithcudaMalloc

– Explicitlymanagedbytheprogrammer

– Optimizedforallthreadsinawarpaccessingneighbouringmemorycells

StoringDataontheGPU

34

StoringDataontheGPU

35

StaticGlobalMemory

• StaticGlobalMemoryhasafixedsizethroughoutexecutiontime:__device__ float devData;

__global__ void checkGlobalVariable()

printf(“devData has value %f\n”, devData);

}

• InitializedusingcudaMemcpyToSymbol:cudaMemcpyToSymbol(devData, &hostData, sizeof(float));

• FetchedusingcudaMemcpyFromSymbol:cudaMemcpyFromSymbol(&hostData, devData, sizeof(float));

36

DynamicGlobalMemory

• Wehavealreadyseendynamicglobalmemory– cudaMalloc dynamicallyallocatesglobalmemory– cudaMemcpy transfersto/fromglobalmemory– cudaFree freesglobalmemoryallocatedbycudaMalloc

• cudaMemcpy supports4typesoftransfer:– cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice

• Youcanalsomemset globalmemorycudaError_t cudaMemset(void *devPtr, int value, size_t count);

37

GlobalMemoryAccessPatterns

• CPUcachesareoptimizedforlinear,iterativememoryaccesses– Cachelinesensurethataccessingonememorycellbrings

neighbouringmemorycellsintocache– Ifanapplicationexhibitsgoodspatialortemporallocality

(whichmanydo),laterreferenceswillalsohitincache

CPU

SystemMemory

Cache

38


• GPUcachingisamorechallengingproblem– Thousandsofthreadscooperatingonaproblem– Difficulttopredictthenextroundofaccessesforallthreads

• Forefficientglobalmemoryaccess,GPUsinsteadrelyon:– Largedevicememorybandwidth– Alignedandcoalescedmemoryaccesspatterns– MaintainingsufficientpendingI/Ooperationstokeepthe

memorybussaturatedandhideglobalmemorylatency

39


• Achievingaligned andcoalesced globalmemoryaccessesiskeytooptimizinganapplication’suseofglobalmemorybandwidth

– Coalesced:thethreadswithinawarpreferencememoryaddressesthatcanallbeservicedbyasingleglobalmemorytransaction(thinkofamemorytransactionastheprocessofbringacachelineintothecache)

– Aligned:theglobalmemoryaccessesbythreadswithinawarpstartatanaddressboundarythatisanevenmultipleofthesizeofaglobalmemorytransaction

40


• Aglobalmemorytransactioniseither32or128bytes– Thesizeofamemorytransactiondependsonwhichcachesit

passesthrough– IfL1+L2:128byte– IfL2only:32bytes– Whichcachesaglobalmemorytransactionpassesthrough

dependsonGPUarchitectureandthetypeofaccess(readvs.write)

41


• AlignedandCoalescedMemoryAccess(w/L1cache)– 32-threadwrap,128-bytesmemorytransaction

• With128-byteaccess,asingletransactionisrequiredandalloftheloadedbytesareused

42


• MisalignedandCoalescedMemoryAccess (w/L1cache)

• With128-byteaccess,twomemorytransactionsarerequiredtoloadallrequestedbytes.Onlyhalfoftheloadedbytesareused.

43


• MisalignedandUncoalesced MemoryAccess (w/L1cache)

• Withuncoalesced loads,manymorebytesloadedthanrequested

44


• MisalignedandUncoalesced MemoryAccess (w/L1cache)

• Onefactortoconsiderwithuncoalesced loads:whiletheefficiencyofthisaccessisverylowitmaybringmanycachelinesintoL1/L2cachewhichareusedbylatermemoryaccesses.TheGPUisflexibleenoughtoperformwell,evenforapplicationsthatpresentsuboptimalaccesspatterns.

45


• MemoryaccessesthatarenotcachedinL1cacheareservicedby32-bytetransactions– Thiscanimprovememorybandwidthutilization– However,theL2cacheisdevice-wide,higherlatencythanL1,

andstillrelativelysmallèmanyapplicationsmaytakeaperformancehitifL1cacheisnotusedforreads

46


• AlignedandCoalescedMemoryAccess(w/oL1cache)

• With32-bytetransactions,fourtransactionsarerequiredandalloftheloadedbytesareused

47


• MisalignedandCoalescedMemoryAccess (w/oL1cache)

• With32-bytetransactions,extramemorytransactionsarestillrequiredtoloadallrequestedbytesbutthenumberofwastedbytesislikelyreduced,comparedto128-bytetransactions.

48


• MisalignedandUncoalesced MemoryAccess (w/oL1cache)

• Withuncoalesced loads,morebytesloadedthanrequestedbutbetterefficiencythanwith128-bytetransactions

49


• GlobalMemoryWritesarealwaysservicedby32-bytetransactions

50

GlobalMemoryandSpecial-PurposeMemory

• GlobalmemoryiswidelyusefulandaseasytouseasCPUDRAM

• Limitations– Easytofindapplicationswithmemoryaccesspatternsthatare

intrinsicallypoorforglobalmemory– Manythreadsaccessingthesamememorycellè poorglobal

memoryefficiency– Manythreadsaccessingsparsememorycellsè poorglobal

memoryefficiency

• Special-purposememoryspacestoaddressthesedeficienciesinglobalmemory– Specializedfordifferenttypesofdata,differentaccesspatterns– Givemorecontroloverdatamovementanddataplacementthan

CPUarchitecturesdo53

SharedMemory

Shared Memory

Tran

sfer

54

• Declaredwiththe__shared__keyword

• Low-latency,highbandwidth• Sharedbyallthreadsinathreadblock

• Explicitlyallocatedandmanagedbytheprogrammer,manualL1cache

• Storedon-SM,samephysicalmemoryastheGPUL1cache

• On-SMmemoryisstaticallypartitionedbetweenL1cacheandsharedmemory

SharedMemoryAllocation

• Sharedmemorycanbeallocatedstaticallyordynamically

• StaticallyAllocatedSharedMemory– Sizeisfixedatcompile-time– Candeclaremanystaticallyallocatedsharedmemoryvariables– Canbedeclaredgloballyorinsideadevicefunction– Canbemulti-dimensionalarrays

__shared__ int s_arr[256][256];

55

SharedMemoryAllocation

• DynamicallyAllocatedSharedMemory– Sizeinbytesissetatkernellaunchwithathirdkernellaunch

configurable– Canonlyhaveonedynamicallyallocatedsharedmemoryarray

perkernel– Mustbeone-dimensionalarrays

__global__ void kernel(...) {extern __shared__ int s_arr[];...

}

kernel<<<nblocks, threads_per_block, shared_memory_bytes>>>(...);

56

Matvec usingsharedmemory

57

MatrixVectorMultiplication

58

MatrixVectorMultiplication

59

MatrixMultiplicationV1andV2inAssignment#4

• https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory

60

GPUMemoryHierarchy

68

SIMT Thread Groups on a GPU

SIMT Thread Groups on an SM

SIMT Thread Group

Registers Local Memory

On-Chip Shared Memory/Cache

Global Memory

Constant Memory

Texture Memory

• MorecomplexthantheCPUmemory– Many different types

ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM

– Moreexplicit controloverdatamovement

ConstantMemory

Constant Memory

Constant Cache

69

• Declaredwiththe__constant__keyword

• Read-only• Limitedinsize:64KB• Storedindevicememory(samephysicallocationasGlobalMemory)

• Cachedinaper-SMconstantcache• Optimizedforallthreadsinawarpaccessingthesamememorycell

ConstantMemory

• Asitsnamesuggests,constantmemoryisbestusedforstoringconstants– Valueswhichareread-only– Valuesthatareaccessedidenticallybyallthreads

• Forexample:supposeallthreadsareevaluatingtheequation

y = mx + b

fordifferentvaluesofx,butidenticalvaluesofm andb– Allthreadswouldreferencem andb withthesamememory

operation– Thisbroadcastaccesspatternisoptimalforconstantmemory

70

ConstantMemory

• Asimple1Dstencil– targetcellisupdatedbasedonits8neighbors,weightedby

someconstantsc0,c1,c2,c3

71

ConstantMemory

• constantStencil.cu containsanexample1Dstencilthatusesconstantmemory

__constant__ float coef[RADIUS + 1];

cudaMemcpyToSymbol(coef, h_coef, (RADIUS + 1) * sizeof(float));

__global__ void stencil_1d(float *in, float *out, int N) {...for (int i = 1; i <= RADIUS; i++) {tmp += coef[i] * (smem[sidx + i] - smem[sidx - i]);

}}

72

CUDASynchronization

• Whenusingsharedmemory,youoftenmustcoordinateaccessesbymultiplethreadstothesamedata

• CUDAofferssynchronizationprimitivesthatallowyoutosynchronizeamongthreads

73

CUDASynchronization

__syncthreads– Synchronizesexecutionacrossallthreadsinathreadblock– Nothreadinathreadblockcanprogresspasta__syncthreads

beforeallotherthreadshavereachedit– __syncthreads ensuresthatallchangestosharedand

globalmemorybythreadsinthisblockarevisibletoallotherthreadsinthisblock

__threadfence_block– Allwritestosharedandglobalmemorybythecallingthread

arevisibletoallotherthreadsinitsblockafterthisfence– Doesnotblockthreadexecution

74

CUDASynchronization

__threadfence– Allwritestoglobalmemorybythecallingthreadarevisibleto

allotherthreadsinitsgridafterthisfence– Doesnotblockthreadexecution

__threadfence_system– Allwritestoglobalmemory,page-lockedhostmemory,and

memoryofotherCUDAdevicesbythecallingthreadarevisibletoallotherthreadsonallCUDAdevicesandallhostthreadsafterthisfence

– Doesnotblockthreadexecution

75

SuggestedReadings

1. Chapter2,4,5inProfessionalCUDACProgramming2. CliffWoolley.GPUOptimizationFundamentals.2013.

https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf

3. MarkHarris.UsingSharedMemoryinCUDAC/C++.http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

4. MarkHarris.OptimizingParallelReductioninCUDA.http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

76

Documents

Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,