View
3
Download
0
Category
Preview:
Citation preview
Introduction to CUDA (1 of n*)
Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011
* Where n is 2 or 3
Agenda
GPU architecture reviewCUDA
First of two or three dedicated classes
Acknowledgements
Many slides are fromKayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work:
http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
David Kirk and Wen-mei Hwu’s UIUC course:http://courses.engr.illinois.edu/ece498/al/
GPU Architecture Review
GPUs are:ParallelMultithreadedMany-core
GPUs have:Tremendous computational horsepowerHigh memory bandwidth
GPU Architecture Review
GPUs are specialized forCompute-intensive, highly parallel computationGraphics!
Transistors are devoted to:ProcessingNot:
Data cachingFlow control
GPU Architecture Review
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Transistor Usage
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Threading Hardware in G80
Sources
Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei HwuJohn Nickolls, NVIDIA
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
Tra
nsf
orm
edVer
tice
s
ProgrammableVertex
Processor
ProgrammableVertex
Processor
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre-transfo
rmed
Vertices
Pre-transfo
rmed
Fragm
ents
Tra
nsf
orm
edFr
agm
ents
GPU
Com
man
d &
Data S
tream
CPU-GPU Boundary (AGP/PCIe)
Fixed-function pipeline
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
Tra
nsf
orm
edVer
tice
s
ProgrammableVertex
Processor
ProgrammableVertex
Processor
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre-transfo
rmed
Vertices
Pre-transfo
rmed
Fragm
ents
Tra
nsf
orm
edFr
agm
ents
GPU
Com
man
d &
Data S
tream
CPU-GPU Boundary (AGP/PCIe)
Programmable pipeline
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
Unified Vertex,Fragment, GeometryProcessor
Unified Vertex,Fragment, GeometryProcessor
Tra
nsf
orm
edVer
tice
sGPU
Front End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre-transfo
rmed
Vertices
Pre-transfo
rmed
Fragm
ents
Tra
nsf
orm
edFr
agm
ents
GPU
Com
man
d &
Data S
tream
CPU-GPU Boundary (AGP/PCIe)
Unified Programmable pipeline
General Diagram (6800/NV40) TurboCache
Uses PCI-Express bandwidth to render directly to system memoryCard needs less memoryPerformance boost while lowering costTurboCache Manager dynamically allocates from main memoryLocal memory used to cache data and to deliver peak performance when needed
NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle
NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Why NV40 series was better
Massive parallelismScalability
Lower end products have fewer pixel pipes and fewer vertex shader units
Computation Power222 million transistorsFirst to comply with Microsoft’s DirectX 9 spec
Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shadingInstruction flow handled in groups of pixelsSpecify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines
General Diagram (7800/G70)General Diagram (7800/G70)
General Diagram (6800/NV40)
GeForce Go 7800 – Power Issues
Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very muchabout their thermal designsDynamic clock scaling can run as slow as 16 MHz
This is true for the engine, memory, and pixel clocksHeavier use of clock gating than the desktop versionRuns at voltages lower than any other mobile performance partRegardless, you won’t get much battery-based runtime for a 3D game
Triangle Setup/Raster
Shader Instruction Dispatch
Fragment Crossbar
MemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
Z-Cull
8 Vertex Engines
24 Pixel Shaders
16 Raster Operation Pipelines
GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism
L2
FB
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
The future of GPUs is programmable processingSo – build the architecture around the processor
G80 – Graphics Mode
G80 CUDA mode – A Device Example
Processors execute computing threadsNew operating mode/HW interface for computing
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
The GPU has evolved into a very flexible and powerful processor:
It’s programmable using high-level languagesIt supports 32-bit floating point precisionIt offers lots of GFLOPS:
GPU in every PC and workstation
GFL
OP
S
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Why Use the GPU for Computing ?
What is Behind such an Evolution?The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about)
So, more transistors can be devoted to data processing rather than data caching and flow control
The fast-growing video game industry exerts
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
What is (Historical) GPGPU ?
General Purpose computation using GPU and graphics API in applications other than 3D graphics
GPU accelerates critical path of application
Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation
Applications – see //GPGPU.orgGame effects (FX) physics image processing
Previous GPGPU ConstraintsDealing with graphics API
Working with the corner cases of the graphics API
Addressing modesLimited texture size/dimension
Shader capabilitiesLimited outputs
Instruction setsLack of Integer & bit ops
Communication limitedBetween pixelsScatter a[i] = p
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
An Example of Physical Reality Behind CUDA CPU
(host)GPU w/
local DRAM(device)
Arrays of Parallel Threads
• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD) – Each thread has an ID that it uses to compute
memory addresses and make control decisions
76543210
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Thread Block 0
……float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block 0
…float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block N - 1
Thread Blocks: Scalable Cooperation
Divide monolithic thread array into multiple blocks
Threads within a block cooperate via shared memory, atomic operations and barrier synchronizationThreads in different blocks cannot cooperate
76543210 76543210 76543210
Thread Batching: Grids and BlocksA kernel is executed as a grid of thread blocks
All threads share data memory space
A thread block is a batch of threads that can cooperatewith each other by:
Synchronizing their executionFor hazard-free shared memory accesses
Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
Block and Thread IDs
Threads and blocks have IDsSo each thread can decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data
Image processingSolving PDEs on volumes…
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview
Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
HostThe host can R/W global, constant, and texture memories
Global, Constant, and Texture Memories(Long Latency Accesses)
Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threads
Texture and Constant Memories
Constants initialized by host Contents visible to all threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Block IDs and Thread IDsEach thread uses IDs to decide what data to work on
Block ID: 1D or 2DThread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data
Image processingSolving PDEs on
CUDA Memory Model OverviewGlobal memory
Main means of communicating R/W Data between host and deviceContents visible to all threadsLong latency access
We will focus on global memory for now
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications
Available in laptops, desktops, and clusters
GPU parallelism is doubling every yearProgramming model scales transparently
Programmable in C with CUDA tools
GeForce 8800
Tesla S870
Tesla D870
Single-Program Multiple-Data (SPMD)
CUDA integrated CPU + GPU application C program
Serial C code executes on CPUParallel Kernel C code executes on GPU thread blocksCPU Serial Code
Grid 0
. . .
. . .
GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);
Grid 1CPU Serial Code
GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grids and BlocksA kernel is executed as a grid of thread blocks
All threads share global memory space
A thread block is a batch of threads that can cooperate with each other by:
Synchronizing their execution using barrierEfficiently sharing data through a low latency shared memoryTwo threads from two different blocks cannot
CUDA Thread BlockProgrammer declares (Thread) Block:
Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threads
All threads in a Block execute the same thread programThreads share data and synchronize while doing their share of the workThreads have thread idnumbers within Block
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
GeForce-8 Series HW Overview
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor
SM
Shared Memory
Streaming Processor Array
…
SPAStreaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)
TPCTexture Processor Cluster (2 SM + TEX)
SMStreaming Multiprocessor (8 SP)Multi-threaded processor coreFundamental processing unit for CUDA thread block
SPStreaming ProcessorScalar ALU for a single CUDA thread
CUDA Processor Terminology
Streaming Multiprocessor (SM)Streaming Multiprocessor (SM)
8 Streaming Processors (SP)2 Super Function Units (SFU)
Multi-threaded instruction dispatch
1 to 512 threads activeShared instruction fetch per 32 threadsCover latency of texture/memory loads
20+ GFLOPS16 KB shared memorytexture and global memory access
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
G80 Thread Computing PipelineProcessors execute computing threadsAlternative operating mode specifically for computing
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
The future of GPUs is programmable processingSo – build the architecture around the processor
L2
FB
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Generates Thread grids based on
kernel calls
Thread Life Cycle in HWGrid is launched on the SPAThread Blocks are serially distributed to all the SM’s
Potentially >1 Thread Block per SM
Each SM launches Warps of Threads
2 levels of parallelismSM schedules and executes Warps that are ready to runAs Warps and Thread Blocks complete,
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
SM Executes Blocks
Threads are assigned to SMs in Block granularity
Up to 8 Blocks to each SM as resource allowsSM in G80 can take up to 768 threads
Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.
Threads run concurrently
t0 t1 t2 … tm
Blocks
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Thread Scheduling/ExecutionEach Thread Blocks is divided in 32-thread Warps
This is an implementation decision, not part of the CUDA programming model
Warps are scheduling units in SMIf 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 WarpsThere are 8 * 3 = 24 Warps
…t0 t1 t2 … t31
……
t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
SM Warp SchedulingSM hardware implements zero-overhead Warp scheduling
Warps whose next instruction has its operands ready for consumption are eligible for executionEligible Warps are selected for execution on a prioritized scheduling policyAll threads in a Warp execute the same instruction when selected
4 clock cycles needed to
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
SM Instruction Buffer – Warp Scheduling
Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot
Issue one “ready-to-go” warp instruction/cycle
from any warp - instruction buffer slotoperand scoreboarding used to prevent hazards
Issue selection based on round-robin/age of warpSM broadcasts the same instruction
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
Scoreboarding
All register operands of all instructions in the Instruction Buffer are scoreboarded
Instruction becomes ready after the needed values are depositedprevents hazardscleared instructions are eligible for issue
Decoupled Memory/Processor pipelinesany thread can continue to issue instructions until scoreboarding prevents issueallows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops
Granularity ConsiderationsFor Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32tiles?
For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!
There are 8 warps but each warp is only half full.
For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
There are 16 warps available for scheduling in each SMEach warp spans four slices in the y dimension
For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
There are 24 warps available for scheduling in each SMEach warp spans two slices in the y dimension
Memory Hardware in G80
CUDA Device Memory Space: Review
Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
HostThe host can R/W global, constant, and texturememories
Parallel Memory SharingLocal Memory: per-thread
Private per threadAuto variables, register spill
Shared Memory: per-Block
Shared by threads of the same blockInter-thread communication
Global Memory: per-application
Shared by all threadsInter-Grid communication
Thread
Local Memory
Grid 0
. . .Global
Memory
. . .
Grid 1SequentialGridsin Time
Block
SharedMemory
SM Memory Architecture
Threads in a block share data & results
In Memory and Shared MemorySynchronize at barrier instruction
Per-Block Shared Memory Allocation
Keeps data close to processor
t0 t1 t2 … tm
Blocks
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Courtesy: John Nicols, NVIDIA
SM Register File
Register File (RF)32 KB (8K entries) for each SM in G80
TEX pipe can also read/write RF2 SMs share 1 TEX
Load/Store pipe can also read/write RF
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
Programmer View of Register File
There are 8192 registers in each SM in G80
This is an implementation decision, not part of CUDARegisters are dynamically partitioned across all blocks assigned to the SMOnce assigned to a
4 blocks 3 blocksMatrix Multiplication Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned
How about if each thread increases the use of registers by 1?
Each Block now requires 11*256 = 2816 registers8192 < 2816 *3
More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmers
One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
This allows for finer grain threading than traditional CPU threading models.
The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Let’s program this thing!
GPU Computing History
2001/2002 – researchers see GPU as data-parallel coprocessor
The GPGPU field is born2007 – NVIDIA releases CUDA
CUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing
2008 – Khronos releases OpenCLspecification
CUDA Abstractions
A hierarchy of thread groupsShared memoriesBarrier synchronization
CUDA Terminology
Host – typically the CPUCode written in ANSI C
Device – typically the GPU (data-parallel)Code written in extended ANSI C
Host and device have separate memoriesCUDA Program
Contains both host and device code
CUDA Terminology
Kernel – data-parallel functionInvoking a kernel creates lightweight threads on the device
Threads are generated and scheduled with hardware
Does a kernel remind you of a shader in OpenGL?
CUDA Kernels
Executed N times in parallel by N different CUDA threads
Thread ID
ExecutionConfiguration
DeclarationSpecifier
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Grid – one or more thread blocks1D or 2D
Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of threadsEach thread in a block can
SynchronizeAccess shared memory
Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume
Thread Hierarchies
Thread ID: Scalar thread identifierThread Index: threadIdx
1D: Thread ID == Thread Index2D with size (Dx, Dy)
Thread ID of index (x, y) == x + y Dy
3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy
Thread Hierarchies
1 Thread Block 2D Block
2D Index
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threadsFermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threadsFermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block Index: blockIdxDimension: blockDim
1D or 2D
Thread Hierarchies
2D Thread Block
16x16Threads per block
Thread Hierarchies
Example: N = 3216x16 threads per block (independent of N)
threadIdx ([0, 15], [0, 15])
2x2 thread blocks in gridblockIdx ([0, 1], [0, 1])blockDim = 16
i = [0, 1] * 16 + [0, 15]
Thread Hierarchies
Thread blocks execute independentlyIn any order: parallel or seriesScheduled in any order by any number of cores
Allows code to scale with core count
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Thread Hierarchies
Threads in a blockShare (limited) low-latency memorySynchronize execution
To coordinate memory accesses__syncThreads()
Barrier – threads in block wait until all threads reach thisLightweight
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host can transfer to/from deviceGlobal memoryConstant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
cudaMalloc()Allocate global memory on device
cudaFree()Frees memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Pointer to device memory
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Size in bytes
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to hostHost to deviceDevice to hostDevice to device
Host
Device
Global Memory
Does this remind you of VBOs in OpenGL?
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to hostHost to deviceDevice to hostDevice to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to hostHost to deviceDevice to hostDevice to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to hostHost to deviceDevice to hostDevice to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to hostHost to deviceDevice to hostDevice to device
Host
Device
Global Memory
All transfers are asynchronous
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host
Device
Global Memory
Host to device
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host
Device
Global Memory
Source (host)Destination (device)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host
Device
Global Memory
Matrix Multiply
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
P = M * NAssume M and N are square for simplicityIs this data-parallel?
Matrix Multiply
1,000 x 1,000 matrix1,000,000 dot products
Each 1,000 multiples and 1,000 adds
Matrix Multiply: CPU Implementation
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
void MatrixMulOnHost(float* M, float* N, float* P, intwidth) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j){float sum = 0;for (int k = 0; k < width; ++k){float a = M[i * width + k];float b = N[k * width + j];sum += a * b;
}P[i * width + j] = sum;
}}
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Step 1Add CUDA memory transfers to the skeleton
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate input
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate output
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Read back from device
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Does this remind you of GPGPU with GLSL?
Matrix Multiply
Step 2Implement the kernel in CUDA C
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Accessing a matrix, so using a 2D block
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Each kernel computes one output
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Where did the two outer for loops in the CPU implementation go?
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
No locks or synchronization, why?
Matrix Multiply
Step 3Invoke the kernel in CUDA C
Matrix Multiply: Invoke Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
One block with width by width threads
Matrix Multiply
125
One Block of threads compute matrix Pd
Each thread computes one element of Pd
Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)
Size of matrix limited by the number of threads allowed in a thread block
Grid 1Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
Matrix Multiply
What is the major performance problem with our implementation?What is the major limitation?
Recommended