Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Lecture:Manycore GPUArchitecturesandProgramming,Part2
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE569/
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
2
HowistheGPUcontrolled?
• TheCUDAAPIissplitinto:– TheCUDAManagementAPI– TheCUDAKernelAPI
• TheCUDAManagementAPIisforavarietyofoperations– GPUmemoryallocation,datatransfer,execution,resource
creation– MostlyregularCfunctionandcalls
• TheCUDAKernelAPIisusedtodefinethecomputationtobeperformedbytheGPU– Cextensions
3
HowistheGPUcontrolled?
• A CUDAkernel:– Definestheoperationstobeperformedbyasinglethreadon
theGPU– JustasaC/C++functiondefinesworktobedoneontheCPU– Syntactically,akernellookslikeC/C++withsomeextensions
__global__ void kernel(...) {...
}
– EveryCUDAthreadexecutesthesamekernellogic(SIMT)– Initially,theonlydifferencebetweenthreadsaretheirthread
coordinates
4
HowareGPUthreadsorganized?
• CUDA threadhierarchy– Warp =SIMTGroup– ThreadBlock=SIMTGroupsthatrun
concurrentlyonanSM– Grid =AllThreadBlockscreatedbythesame
kernellaunch
• Launchingakernelissimpleandsimilartoafunctioncall.– kernelnameandarguments– #ofthreadblocks/gridand#ofthreads/blocktocreate:
kernel<<<nblocks,threads_per_block>>>(arg1, arg2, ...);
5
HowareGPUthreadsorganized?
• InCUDA,onlythreadblocksandgridsarefirst-classcitizensoftheprogrammingmodel.
• Thenumberofwarps createdandtheirorganizationareimplicitlycontrolled bythekernellaunchconfiguration,butneversetexplicitly.
kernel<<<nblocks, threads_per_block>>>(arg1, arg2, ...);
kernel launch configuration
6
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– One-dimensionalblocksandgrids:int nblocks = 4;int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
7
Block 0 Block 1 Block 2 Block 3
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalblocksandgrids:dim3 nblocks(2,2)dim3 threads_per_block(4,2);kernel<<<nblocks, threads_per_block>>>(...);
8
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalgridandone-dimensionalblocks:dim3 nblocks(2,2);int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
9
HowareGPUthreadsorganized?
• OntheGPU,thenumberofblocksandthreadsperblockisexposedthroughintrinsicthreadcoordinatevariables:– Dimensions– IDs
Variable MeaninggridDim.x, gridDim.y,
gridDim.zNumberofblocksinakernellaunch.
blockIdx.x, blockIdx.y, blockIdx.z
UniqueIDoftheblockthatcontainsthecurrentthread.
blockDim.x, blockDim.y, blockDim.z
Numberofthreadsineachblock.
threadIdx.x, threadIdx.y, threadIdx.z
UniqueIDofthecurrentthreadwithinitsblock.
10
HowareGPUthreadsorganized?
tocalculateagloballyuniqueIDforathreadontheGPUinsideaone-dimensionalgridandone-dimensionalblock:kernel<<<4, 8>>>(...);
__global__ void kernel(...) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
...
}
11
Block 0 Block 1 Block 2 Block 3
blockIdx.x = 2;blockDim.x = 8;threadIdx.x = 2;
01234567
8
HowareGPUthreadsorganized?
• Threadcoordinatesofferawaytodifferentiatethreadsandidentifythread-specificinputdataorcodepaths.– Linkdataandcomputation,amapping
__global__ void kernel(int *arr) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < 32) {
arr[tid] = f(arr[tid]);
} else {
arr[tid] = g(arr[tid]);
}
12
codepathforthreadswithtid <32
codepathforthreadswithtid >=32
ThreadDivergence:recallthatuselesscodepathisexecuted,butthendisabledinSIMTexecutionmodel
HowisGPUmemorymanaged?
• CUDAMemoryManagementAPI– AllocationofGPUmemory– TransferofdatafromthehosttoGPUmemory– Free-ing GPUmemory– Foo(int A[][N]){}
HostFunction CUDAAnalogue
malloc cudaMalloc
memcpy cudaMemcpy
free cudaFree
13
HowisGPUmemorymanaged?
cudaError_t cudaMalloc(void **devPtr, size_t size);
– Allocatesize bytesofGPUmemoryandstoretheiraddressat*devPtr
cudaError_t cudaFree(void *devPtr);– ReleasethedevicememoryallocationstoredatdevPtr– MustbeanallocationthatwascreatedusingcudaMalloc
14
HowisGPUmemorymanaged?
cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);– Transferscountbytesfromthememorypointedtobysrc to
dst– kind canbe:• cudaMemcpyHostToHost,• cudaMemcpyHostToDevice,• cudaMemcpyDeviceToHost,• cudaMemcpyDeviceToDevice
– Thelocationsofdst andsrc mustmatchkind,e.g.ifkind iscudaMemcpyHostToDevice thensrc mustbeahostarrayanddst mustbeadevicearray
15
HowisGPUmemorymanaged?
void *d_arr, *h_arr;h_addr = … ; /* init host memory and data */// Allocate memory on GPU and its address is in d_arrcudaMalloc((void **)&d_arr, nbytes);
// Transfer data from host to devicecudaMemcpy(d_arr, h_arr, nbytes,
cudaMemcpyHostToDevice);
// Transfer data from a device to a hostcudaMemcpy(h_arr, d_arr, nbytes,
cudaMemcpyDeviceToHost);
// Free the allocated memorycudaFree(d_arr);
16
CUDAProgramFlow
• Atitsmostbasic,theflowofaCUDAprogramisasfollows:1. AllocateGPUmemory2. PopulateGPUmemorywithinputsfromthehost3. ExecuteaGPUkernelonthoseinputs4. TransferoutputsfromtheGPUbacktothehost5. FreeGPUmemory
• Let’stakealookatasimpleexamplethatmanipulatesdata
17
AXPYExamplewithOpenMP:Multicore
• y=α·x+y– x andy arevectorsofsizen– α isscalar
• Data(x,yanda)areshared– Parallelizationisrelativelyeasy
18
CUDAProgramFlow
• AXPYisanembarrassinglyparallelproblem– Howcanvectoradditionbeparallelized?– HowcanwemapthistoGPUs?• Eachthreaddoesoneelement
A B C 19
AXPYOffloadingToaGPUusingCUDA
20
Memoryallocationondevice
Memcpy fromhosttodevice
Launchparallelexecution
Memcpy fromdevicetohost
Deallocation ofdev memory
CUDAProgramFlow• ConsidertheworkflowoftheexamplevectoradditionvecAdd.cu:
1. AllocatespaceforA,B,andC ontheGPU2. TransfertheinitialcontentsofA andB totheGPU3. ExecuteakernelinwhicheachthreadsumsAi andBi,andstoresthe
resultinCi4. TransferthefinalcontentsofC backtothehost5. FreeA,B,andC ontheGPUModifytoC=A+B+C
A=B*C;wewillneedbothCandAinthehostsideafterGPU
computation.• Compileandrunningonbridges:– https://passlab.github.io/CSCE569/resources/HardwareSoftware.html#inte
ractive– copygpu_code_examples folderfrommyhomefolder• cp –r~yan/gpu_code_examples ~
– $nvcc –Xcompiler –fopenmp vectorAdd.cu– $./a.out
21
MoreExamplesandExercises
• Matvec:– Version1:eachthreadcomputesoneelementofthefinal
vector– Version2:• Matmul inassignment#4– Version1:eachthreadcomputesonerowofthefinalmatrixC
22
CUDASDKExamples
• CUDAProgrammingManual:– http://docs.nvidia.com/cuda/cuda-c-programming-guide
• CUDASDKExamplesonbridges– moduleloadgcc/5.3.0cuda/8.0– exportCUDA_PATH=/opt/packages/cuda/8.0– /opt/packages/cuda/8.0/samples• Copytoyourhomefolder– cp –r/opt/packages/cuda/8.0/samples~/CUDA_samples• Doa“make”inthefolder,anditwillbuildallthesources• Orgotoaspecificexamplefolderandmake,itwillbuildonlythe
binary
• Findonesyouareinterestedinandruntosee
23
InspectingCUDAPrograms
• DebuggingCUDAprogram:– cuda-gdb debuggingtool, likegdb
• Profilingaprogramtoexaminetheperformance– Nvprof tool,likegprof– Nvprof ./vecAdd
24
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
25
StoringDataontheCPU
• Amemoryhierarchyemulatesalargeamountoflow-latencymemory– Cachedatafromalarge,high-latencymemorybankinasmall
low-latencymemorybankDRAM
L2Cache
L1Cache
26
CPUMemoryHierarchy
CPU
GPUMemoryHierarchy
27
SIMT Thread Groups on a GPU
SIMT Thread Groups on an SM
SIMT Thread Group
Registers Local Memory
On-Chip Shared Memory/Cache
Global Memory
Constant Memory
Texture Memory
• MorecomplexthantheCPUmemory– Many different types
ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM
– Moreexplicit controloverdatamovement
StoringDataontheGPU
28
• Registers(SRAM)– Lowestlatencymemoryspaceonthe
GPU– PrivatetoeachCUDAthread– Constantpoolofregistersper-SM
dividedamongthreadsinresidentthreadblocks
– Architecture-dependentlimitonnumberofregistersperthread
– Registersarenotexplicitlyusedbytheprogrammer,implicitlyallocatedbythecompiler
– -maxrregcount compileroptionallowsyoutolimit#registersperthread
StoringDataontheGPU
Shared Memory
Tran
sfer
29
• SharedMemory(SRAM)– Declaredwiththe__shared__
keyword– Low-latency,highbandwidth– Sharedbyallthreadsinathreadblock– Explicitlyallocatedandmanagedby
theprogrammer,manualL1cache– Storedon-SM,samephysicalmemory
astheGPUL1cache– On-SMmemoryisstatically
partitionedbetweenL1cacheandsharedmemory
StoringDataontheGPU
L1 Cache
L2 Cache
Global Memory
30
• GPUCaches(SRAM)– BehaviourofGPUcachesis
architecture-dependent– Per-SML1cachestoredon-chip– Per-GPUL2cachestoredoff-chip,
cachesvaluesforallSMs– Duetoparallelismofaccesses,GPU
cachesdonotfollowthesameLRUrulesasCPUcaches
StoringDataontheGPU
Constant Memory
Constant Cache
31
• ConstantMemory(DRAM)– Declaredwiththe__constant__
keyword– Read-only– Limitedinsize:64KB– Storedindevicememory(same
physicallocationasGlobalMemory)– Cachedinaper-SMconstantcache– Optimizedforallthreadsinawarp
accessingthesamememorycell
StoringDataontheGPU
Texture Memory Read-Only Cache
Texture Memory
32
• TextureMemory(DRAM)– Read-only– Storedindevicememory(same
physicallocationasGlobalMemory)– Cachedinatexture-onlyon-SMcache– Optimizedfor2Dspatiallocality
(cachescommonlyonlyoptimizedfor1Dlocality)
– Explicitlyusedbytheprogrammer– Special-purposememory
StoringDataontheGPU
L1 Cache
L2 Cache
Global Memory
33
• GlobalMemory(DRAM)– Large,high-latencymemory– Storedindevicememory(along
withconstantandtexturememory)– Canbedeclaredstaticallywith__device__
– CanbeallocateddynamicallywithcudaMalloc
– Explicitlymanagedbytheprogrammer
– Optimizedforallthreadsinawarpaccessingneighbouringmemorycells
StoringDataontheGPU
34
StoringDataontheGPU
35
StaticGlobalMemory
• StaticGlobalMemoryhasafixedsizethroughoutexecutiontime:__device__ float devData;
__global__ void checkGlobalVariable()
printf(“devData has value %f\n”, devData);
}
• InitializedusingcudaMemcpyToSymbol:cudaMemcpyToSymbol(devData, &hostData, sizeof(float));
• FetchedusingcudaMemcpyFromSymbol:cudaMemcpyFromSymbol(&hostData, devData, sizeof(float));
36
DynamicGlobalMemory
• Wehavealreadyseendynamicglobalmemory– cudaMalloc dynamicallyallocatesglobalmemory– cudaMemcpy transfersto/fromglobalmemory– cudaFree freesglobalmemoryallocatedbycudaMalloc
• cudaMemcpy supports4typesoftransfer:– cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice
• Youcanalsomemset globalmemorycudaError_t cudaMemset(void *devPtr, int value, size_t count);
37
GlobalMemoryAccessPatterns
• CPUcachesareoptimizedforlinear,iterativememoryaccesses– Cachelinesensurethataccessingonememorycellbrings
neighbouringmemorycellsintocache– Ifanapplicationexhibitsgoodspatialortemporallocality
(whichmanydo),laterreferenceswillalsohitincache
CPU
SystemMemory
Cache
38
GlobalMemoryAccessPatterns
• GPUcachingisamorechallengingproblem– Thousandsofthreadscooperatingonaproblem– Difficulttopredictthenextroundofaccessesforallthreads
• Forefficientglobalmemoryaccess,GPUsinsteadrelyon:– Largedevicememorybandwidth– Alignedandcoalescedmemoryaccesspatterns– MaintainingsufficientpendingI/Ooperationstokeepthe
memorybussaturatedandhideglobalmemorylatency
39
GlobalMemoryAccessPatterns
• Achievingaligned andcoalesced globalmemoryaccessesiskeytooptimizinganapplication’suseofglobalmemorybandwidth
– Coalesced:thethreadswithinawarpreferencememoryaddressesthatcanallbeservicedbyasingleglobalmemorytransaction(thinkofamemorytransactionastheprocessofbringacachelineintothecache)
– Aligned:theglobalmemoryaccessesbythreadswithinawarpstartatanaddressboundarythatisanevenmultipleofthesizeofaglobalmemorytransaction
40
GlobalMemoryAccessPatterns
• Aglobalmemorytransactioniseither32or128bytes– Thesizeofamemorytransactiondependsonwhichcachesit
passesthrough– IfL1+L2:128byte– IfL2only:32bytes– Whichcachesaglobalmemorytransactionpassesthrough
dependsonGPUarchitectureandthetypeofaccess(readvs.write)
41
GlobalMemoryAccessPatterns
• AlignedandCoalescedMemoryAccess(w/L1cache)– 32-threadwrap,128-bytesmemorytransaction
• With128-byteaccess,asingletransactionisrequiredandalloftheloadedbytesareused
42
GlobalMemoryAccessPatterns
• MisalignedandCoalescedMemoryAccess (w/L1cache)
• With128-byteaccess,twomemorytransactionsarerequiredtoloadallrequestedbytes.Onlyhalfoftheloadedbytesareused.
43
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/L1cache)
• Withuncoalesced loads,manymorebytesloadedthanrequested
44
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/L1cache)
• Onefactortoconsiderwithuncoalesced loads:whiletheefficiencyofthisaccessisverylowitmaybringmanycachelinesintoL1/L2cachewhichareusedbylatermemoryaccesses.TheGPUisflexibleenoughtoperformwell,evenforapplicationsthatpresentsuboptimalaccesspatterns.
45
GlobalMemoryAccessPatterns
• MemoryaccessesthatarenotcachedinL1cacheareservicedby32-bytetransactions– Thiscanimprovememorybandwidthutilization– However,theL2cacheisdevice-wide,higherlatencythanL1,
andstillrelativelysmallèmanyapplicationsmaytakeaperformancehitifL1cacheisnotusedforreads
46
GlobalMemoryAccessPatterns
• AlignedandCoalescedMemoryAccess(w/oL1cache)
• With32-bytetransactions,fourtransactionsarerequiredandalloftheloadedbytesareused
47
GlobalMemoryAccessPatterns
• MisalignedandCoalescedMemoryAccess (w/oL1cache)
• With32-bytetransactions,extramemorytransactionsarestillrequiredtoloadallrequestedbytesbutthenumberofwastedbytesislikelyreduced,comparedto128-bytetransactions.
48
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/oL1cache)
• Withuncoalesced loads,morebytesloadedthanrequestedbutbetterefficiencythanwith128-bytetransactions
49
GlobalMemoryAccessPatterns
• GlobalMemoryWritesarealwaysservicedby32-bytetransactions
50
GlobalMemoryandSpecial-PurposeMemory
• GlobalmemoryiswidelyusefulandaseasytouseasCPUDRAM
• Limitations– Easytofindapplicationswithmemoryaccesspatternsthatare
intrinsicallypoorforglobalmemory– Manythreadsaccessingthesamememorycellè poorglobal
memoryefficiency– Manythreadsaccessingsparsememorycellsè poorglobal
memoryefficiency
• Special-purposememoryspacestoaddressthesedeficienciesinglobalmemory– Specializedfordifferenttypesofdata,differentaccesspatterns– Givemorecontroloverdatamovementanddataplacementthan
CPUarchitecturesdo53
SharedMemory
Shared Memory
Tran
sfer
54
• Declaredwiththe__shared__keyword
• Low-latency,highbandwidth• Sharedbyallthreadsinathreadblock
• Explicitlyallocatedandmanagedbytheprogrammer,manualL1cache
• Storedon-SM,samephysicalmemoryastheGPUL1cache
• On-SMmemoryisstaticallypartitionedbetweenL1cacheandsharedmemory
SharedMemoryAllocation
• Sharedmemorycanbeallocatedstaticallyordynamically
• StaticallyAllocatedSharedMemory– Sizeisfixedatcompile-time– Candeclaremanystaticallyallocatedsharedmemoryvariables– Canbedeclaredgloballyorinsideadevicefunction– Canbemulti-dimensionalarrays
__shared__ int s_arr[256][256];
55
SharedMemoryAllocation
• DynamicallyAllocatedSharedMemory– Sizeinbytesissetatkernellaunchwithathirdkernellaunch
configurable– Canonlyhaveonedynamicallyallocatedsharedmemoryarray
perkernel– Mustbeone-dimensionalarrays
__global__ void kernel(...) {extern __shared__ int s_arr[];...
}
kernel<<<nblocks, threads_per_block, shared_memory_bytes>>>(...);
56
Matvec usingsharedmemory
57
MatrixVectorMultiplication
58
MatrixVectorMultiplication
59
MatrixMultiplicationV1andV2inAssignment#4
• https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory
60
GPUMemoryHierarchy
68
SIMT Thread Groups on a GPU
SIMT Thread Groups on an SM
SIMT Thread Group
Registers Local Memory
On-Chip Shared Memory/Cache
Global Memory
Constant Memory
Texture Memory
• MorecomplexthantheCPUmemory– Many different types
ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM
– Moreexplicit controloverdatamovement
ConstantMemory
Constant Memory
Constant Cache
69
• Declaredwiththe__constant__keyword
• Read-only• Limitedinsize:64KB• Storedindevicememory(samephysicallocationasGlobalMemory)
• Cachedinaper-SMconstantcache• Optimizedforallthreadsinawarpaccessingthesamememorycell
ConstantMemory
• Asitsnamesuggests,constantmemoryisbestusedforstoringconstants– Valueswhichareread-only– Valuesthatareaccessedidenticallybyallthreads
• Forexample:supposeallthreadsareevaluatingtheequation
y = mx + b
fordifferentvaluesofx,butidenticalvaluesofm andb– Allthreadswouldreferencem andb withthesamememory
operation– Thisbroadcastaccesspatternisoptimalforconstantmemory
70
ConstantMemory
• Asimple1Dstencil– targetcellisupdatedbasedonits8neighbors,weightedby
someconstantsc0,c1,c2,c3
71
ConstantMemory
• constantStencil.cu containsanexample1Dstencilthatusesconstantmemory
__constant__ float coef[RADIUS + 1];
cudaMemcpyToSymbol(coef, h_coef, (RADIUS + 1) * sizeof(float));
__global__ void stencil_1d(float *in, float *out, int N) {...for (int i = 1; i <= RADIUS; i++) {tmp += coef[i] * (smem[sidx + i] - smem[sidx - i]);
}}
72
CUDASynchronization
• Whenusingsharedmemory,youoftenmustcoordinateaccessesbymultiplethreadstothesamedata
• CUDAofferssynchronizationprimitivesthatallowyoutosynchronizeamongthreads
73
CUDASynchronization
__syncthreads– Synchronizesexecutionacrossallthreadsinathreadblock– Nothreadinathreadblockcanprogresspasta__syncthreads
beforeallotherthreadshavereachedit– __syncthreads ensuresthatallchangestosharedand
globalmemorybythreadsinthisblockarevisibletoallotherthreadsinthisblock
__threadfence_block– Allwritestosharedandglobalmemorybythecallingthread
arevisibletoallotherthreadsinitsblockafterthisfence– Doesnotblockthreadexecution
74
CUDASynchronization
__threadfence– Allwritestoglobalmemorybythecallingthreadarevisibleto
allotherthreadsinitsgridafterthisfence– Doesnotblockthreadexecution
__threadfence_system– Allwritestoglobalmemory,page-lockedhostmemory,and
memoryofotherCUDAdevicesbythecallingthreadarevisibletoallotherthreadsonallCUDAdevicesandallhostthreadsafterthisfence
– Doesnotblockthreadexecution
75
SuggestedReadings
1. Chapter2,4,5inProfessionalCUDACProgramming2. CliffWoolley.GPUOptimizationFundamentals.2013.
https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
3. MarkHarris.UsingSharedMemoryinCUDAC/C++.http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
4. MarkHarris.OptimizingParallelReductioninCUDA.http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
76