1. CUDA Programming model ReviewParallel kernels composed of
many threads Threads execute the same sequential program Thread Use
parallel threads rather than sequential loopsThreads grouped in
Cooperative Thread Arrays Threads in same CTA cooperate & share
memory CTA implements a CUDA thread block CTA / BlockCTAs are
grouped into grids t0 t1 tB Threads and blocks have unique IDs :
threadIdx, blockIdx Blocks and Grids have dimensions : blockDim,
gridDim A warp in CUDA is a group of 32 threads, which is the
minimum size of the data processed in SIMD fashion by a CUDA
multiprocessor. Grid CTA 0 CTA 1 CTA 2 CTA m ...
2. GPU Architecture:Two Main Components Global memory Analogous
to RAM in a CPU server Accessible by both GPU and CPU Currently up
to 6 GB per GPU Bandwidth currently up to ~180 GB/s (Tesla DRAM I/F
DRAM I/F products) ECC on/off (Quadro and Tesla products) HOST I/F
DRAM I/F Streaming Multiprocessors (SMs) L2 Perform the actual
computations DRAM I/F Thread Giga Each SM has its own: DRAM I/F
DRAM I/F Control units, registers, execution pipelines, caches
3. GPU Architecture Fermi: Instruction Cache Scheduler
Scheduler Streaming Multiprocessor (SM) Dispatch Dispatch Register
File 32 CUDA Cores per SM Core Core Core Core 32 fp32 ops/clock
Core Core Core Core Core Core Core Core 16 fp64 ops/clock Core Core
Core Core 32 int32 ops/clock Core Core Core Core 2 warp schedulers
Core Core Core Core Up to 1536 threads Core Core Core Core
concurrently Core Core Core Core 4 special-function units
Load/Store Units x 16 Special Func Units x 4 64KB shared mem + L1
cache Interconnect Network 32K 32-bit registers 64K Configurable
Cache/Shared Mem Uniform Cache
4. GPU Architecture Fermi: Instruction Cache Scheduler
Scheduler CUDA Core Dispatch Dispatch Register File Floating point
& Integer unit Core Core Core Core IEEE 754-2008 floating-point
Core Core Core Core standard Core Core Core Core Fused multiply-add
(FMA) Core Core Core Core CUDA Core instruction for both single and
Dispatch Port Core Core Core Core double precision Operand
Collector Core Core Core Core Logic unit FP Unit INT Unit Core Core
Core Core Move, compare unit Core Core Core Core Load/Store Units x
16 Branch unit Result Queue Special Func Units x 4 Interconnect
Network 64K Configurable Cache/Shared Mem Uniform Cache
5. CUDA Execution Model Blocks run on multiprocessor(SM) Kernel
launched by host Entire Block gets scheduled on a single SM .
Multiple blocks can reside on an SM at . the same time . Limit is 8
blocks/SM on Fermi Limit is 16 blocks/SM on Kepler Device processor
array MT IU MT IU SP SP MT IU SP MTIU SP ... MT IU SP MTIU SP MT IU
SP MTIU SP Sh Sh Sh Share Share Share Shared Shared are are are d d
d d d d Memory Memory Mem Mem Mem Me Me Me ory ory ory mo mo mo ry
ry ry Device Memory
6. Hardware Multithreading Hardware allocates resources to
blocks blocks need: thread slots, registers, shared memory blocks
dont run until resources are available for all of its threads.
Hardware schedules threads in units of warps threads have their own
registers context switching is (basically) free every cycle
Hardware picks from warps that have an instruction ready(i.e. all
operands ready) to execute. Hardware relies on threads to hide
latency i.e., parallelism is necessary for performance
7. Hardware Multithreading Hardware allocates resources to
blocks blocks need: thread slots, registers, shared memory blocks
dont run until resources are available for all of its threads.
Hardware schedules threads in units of warps threads have their own
registers context switching is (basically) free every cycle
Hardware picks from warps that have an instruction ready(i.e. all
operands ready) to execute. Hardware relies on threads to hide
latency i.e., parallelism is necessary for performance
8. SM schedules warps & issues instructions Dual issue
pipelines select two warps to issue SIMT warp executes one
instruction for up to 32 threads Warp Scheduler Warp Scheduler
Instruction Dispatch Unit Instruction Dispatch Unit Warp 8
instruction 11 Warp 9 instruction 11 Warp 2 instruction 42 Warp 3
instruction 33 Warp 14 instruction 95 Warp 15 instruction 95 time
Warp 8 instruction 12 Warp 9 instruction 12 Warp 14 instruction 96
Warp 3 instruction 34 Warp 2 instruction 43 Warp 15 instruction
96
9. Introducing: Kepler GK110
10. Welcome the Kepler GK110 GPU Performance
EfficiencyProgrammability
12. SMX: Efficient Performance Power-Aware SMX Architecture
Clocks & Feature Size SMX result - Performance up Power
down
13. Power vs Clock Speed Example Logic Clocking Area Power Area
Power Fermi A B 1.0x 1.0x 1.0x 1.0x2x clock A B Kepler 1.8x 0.9x
1.0x 0.5x1x clock A B
14. Kepler Fermi Kepler SM InstructionCache InstructionCache
Scheduler Scheduler CUDA Core WarpScheduler WarpScheduler
WarpScheduler WarpS Dispatch Dispatch Disp Dispa atch tchPo Port rt
DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit
DispatchUnit Dispatch Register File ALU Ope Coll rand ecto Result
Queue r Core Core Core Core Core Core Core Core
RegisterFile(65,536x32-bit) Core Core Core Core C C C C C C L S C C
C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e
e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o
o o o o o o o o o / F r r r r r r S r r r r r Core Core Core Core e
e e e e e T U e e e e e C C C C C C L C C C C C D S Core Core Core
Core o o o o o o / F o o o o o r r r r r r S r r r r r U e e e e e
e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o
o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e
e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o
o o Load/Store Units x 16 r r r r r r / F r r r r r S U e e e e e e
T e e e e e Special Func Units x 4 C C C C C C L C C C C C
InterconnectNetwork D S o o o o o o o o o o o / F r r r r r r S r r
r r r U e e e e e e T e e e e e 64K Configurable Cache/Shared Mem C
C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r
r r r r U e e e e e e T e e e e e Uniform Cache C C C C C C L C C C
C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e
e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o /
F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C
C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r
U
15. SMX Balance of Resources Resource Kepler GK110 vs Fermi
Floating point throughput 2-3x Max Blocks per SMX 2x Max Threads
per SMX 1.3x Register File Bandwidth 2x Register File Capacity
2xShared Memory Bandwidth 2x Shared Memory Capacity 1x
16. New ISA Encoding: 255 Registers perThread Fermi limit: 63
registers per thread A common Fermi performance limiter Leads to
excessive spilling Kepler : Up to 255 registers per thread
Especially helpful for FP64 apps
17. New High-Performance SMX Instructions
Compiler-generated,SHFL (shuffle) -- Intra-warp data exchange high
performance instructions: bit shift bit rotateATOM -- Broader
functionality, Faster fp32 division read-only cache
18. New Instruction: SHFLData exchange between threads within a
warp Avoids use of shared memory One 32-bit value per exchange 4
variants: a b c d e f g h __shfl() __shfl_up() __shfl_down()
__shfl_xor() h d f e a c c b g h a b c d e f c d e f g h a b c d a
b g h e f Indexed Shift right to nth Shift left to nth neighbour
Butterfly (XOR) any-to-any neighbour exchange
19. SHFL Example: Warp Prefix-Sum__global__ void
shfl_prefix_sum(int *data){ 3 8 2 6 3 9 1 4 int id = threadIdx.x;
int value = data[id]; n = __shfl_up(value, 1) int lane_id =
threadIdx.x & warpSize; value += n 3 11 10 8 9 12 10 5 // Now
accumulate in log2(32) steps n = __shfl_up(value, 2) for(int i=1;
i= i) value += n; n = __shfl_up(value, 4) } value += n 3 11 13 19
21 31 32 36 // Write out our result data[id] = value;}
20. ATOM instruction enhancementsAdded int64 functions to 2 10x
performance gainsmatch existing int32 Shorter processing pipeline
More atomic processorsAtom Op int32 int64 Slowest 10x fasteradd x x
Fastest 2x fastercas x xexch x xmin/max x Xand/or/xor x X
21. High Speed Atomics Enable New UsesAtomics are now fast
enough to use within inner loops Example: Data reduction (sum of
all values) Without Atomics 1. Divide input data array into N
sections 2. Launch N blocks, each reduces one section 3. Output is
N values 4. Second launch of N threads, reduces outputs to single
value
22. High Speed Atomics Enable New UsesAtomics are now fast
enough to use within inner loops Example: Data reduction (sum of
all values) With Atomics 1. Divide input data array into N sections
2. Launch N blocks, each reduces one section 3. Write output
directly via atomic. No need for second kernel launch.
23. TexturesUsing textures in CUDA 4.0 Global Memory ptr1. Bind
texture to memory region width2. Launch kernel height3. Use tex1D /
tex2D to access memory from kernel cudaBindTexture2D(ptr, width,
height) (0,0) Texture (x,y) int value = tex2D(texture, x, y)
24. Texture Pros & Cons Good Stuff Bad Stuff Explicit
global binding Dedicated cache Limited number of global textures
Separate memory pipe No dynamic texture indexing Relaxed coalescing
No arrays of texture references Samplers & filters Different
read/write instructions Separate memory region (uses offsets not
pointers)
25. Bindless TexturesKepler permits dynamic binding of Bad
Stuff textures: Explicit global binding Textures now referenced by
ID Limited number of global textures Create new ID when needed,
destroy when needed No dynamic texture indexing Can pass IDs as
parameters No arrays of texture references Dynamic texture indexing
Different read/write instructions Arrays of texture IDs supported
Separate memory region (uses 1000s of IDs possible offsets not
pointers)
26. Global Load Through TextureLoad from direct address,
through Bad Stuff texture pipeline: Explicit global binding
Eliminates need for texture setup Limited number of global textures
Access entire memory space through texture No dynamic texture
indexing Use normal pointers to read via texture No arrays of
texture references Emitted automatically by compiler where possible
Different read/write instructions Can hint to compiler with "const
__restrict" Separate memory region (uses offsets not pointers)
27. const __restrict Example Annotate eligible kernel
__global__ void saxpy(float x, float y, const float * __restrict
input, parameters with float * output) { const __restrict size_t
offset = threadIdx.x + (blockIdx.x * blockDim.x); Compiler will
automatically // Compiler will automatically use texture map loads
to use read-only // for "input" output[offset] = (input[offset] *
x) + y; data cache path }