Gpu archi

1. CUDA Programming model ReviewParallel kernels composed of many threads Threads execute the same sequential program Thread Use parallel threads rather than sequential loopsThreads grouped in Cooperative Thread Arrays Threads in same CTA cooperate & share memory CTA implements a CUDA thread block CTA / BlockCTAs are grouped into grids t0 t1 tB Threads and blocks have unique IDs : threadIdx, blockIdx Blocks and Grids have dimensions : blockDim, gridDim A warp in CUDA is a group of 32 threads, which is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor. Grid CTA 0 CTA 1 CTA 2 CTA m ...

2. GPU Architecture:Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB per GPU Bandwidth currently up to ~180 GB/s (Tesla DRAM I/F DRAM I/F products) ECC on/off (Quadro and Tesla products) HOST I/F DRAM I/F Streaming Multiprocessors (SMs) L2 Perform the actual computations DRAM I/F Thread Giga Each SM has its own: DRAM I/F DRAM I/F Control units, registers, execution pipelines, caches

3. GPU Architecture Fermi: Instruction Cache Scheduler Scheduler Streaming Multiprocessor (SM) Dispatch Dispatch Register File 32 CUDA Cores per SM Core Core Core Core 32 fp32 ops/clock Core Core Core Core Core Core Core Core 16 fp64 ops/clock Core Core Core Core 32 int32 ops/clock Core Core Core Core 2 warp schedulers Core Core Core Core Up to 1536 threads Core Core Core Core concurrently Core Core Core Core 4 special-function units Load/Store Units x 16 Special Func Units x 4 64KB shared mem + L1 cache Interconnect Network 32K 32-bit registers 64K Configurable Cache/Shared Mem Uniform Cache

4. GPU Architecture Fermi: Instruction Cache Scheduler Scheduler CUDA Core Dispatch Dispatch Register File Floating point & Integer unit Core Core Core Core IEEE 754-2008 floating-point Core Core Core Core standard Core Core Core Core Fused multiply-add (FMA) Core Core Core Core CUDA Core instruction for both single and Dispatch Port Core Core Core Core double precision Operand Collector Core Core Core Core Logic unit FP Unit INT Unit Core Core Core Core Move, compare unit Core Core Core Core Load/Store Units x 16 Branch unit Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

5. CUDA Execution Model Blocks run on multiprocessor(SM) Kernel launched by host Entire Block gets scheduled on a single SM . Multiple blocks can reside on an SM at . the same time . Limit is 8 blocks/SM on Fermi Limit is 16 blocks/SM on Kepler Device processor array MT IU MT IU SP SP MT IU SP MTIU SP ... MT IU SP MTIU SP MT IU SP MTIU SP Sh Sh Sh Share Share Share Shared Shared are are are d d d d d d Memory Memory Mem Mem Mem Me Me Me ory ory ory mo mo mo ry ry ry Device Memory

6. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks dont run until resources are available for all of its threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance

7. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks dont run until resources are available for all of its threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance

8. SM schedules warps & issues instructions Dual issue pipelines select two warps to issue SIMT warp executes one instruction for up to 32 threads Warp Scheduler Warp Scheduler Instruction Dispatch Unit Instruction Dispatch Unit Warp 8 instruction 11 Warp 9 instruction 11 Warp 2 instruction 42 Warp 3 instruction 33 Warp 14 instruction 95 Warp 15 instruction 95 time Warp 8 instruction 12 Warp 9 instruction 12 Warp 14 instruction 96 Warp 3 instruction 34 Warp 2 instruction 43 Warp 15 instruction 96

9. Introducing: Kepler GK110

10. Welcome the Kepler GK110 GPU Performance EfficiencyProgrammability

11. Kepler GK110 Block DiagramArchitecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3

12. SMX: Efficient Performance Power-Aware SMX Architecture Clocks & Feature Size SMX result - Performance up Power down

13. Power vs Clock Speed Example Logic Clocking Area Power Area Power Fermi A B 1.0x 1.0x 1.0x 1.0x2x clock A B Kepler 1.8x 0.9x 1.0x 0.5x1x clock A B

14. Kepler Fermi Kepler SM InstructionCache InstructionCache Scheduler Scheduler CUDA Core WarpScheduler WarpScheduler WarpScheduler WarpS Dispatch Dispatch Disp Dispa atch tchPo Port rt DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit Dispatch Register File ALU Ope Coll rand ecto Result Queue r Core Core Core Core Core Core Core Core RegisterFile(65,536x32-bit) Core Core Core Core C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r Core Core Core Core e e e e e e T U e e e e e C C C C C C L C C C C C D S Core Core Core Core o o o o o o / F o o o o o r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o Load/Store Units x 16 r r r r r r / F r r r r r S U e e e e e e T e e e e e Special Func Units x 4 C C C C C C L C C C C C InterconnectNetwork D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e 64K Configurable Cache/Shared Mem C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Uniform Cache C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U

15. SMX Balance of Resources Resource Kepler GK110 vs Fermi Floating point throughput 2-3x Max Blocks per SMX 2x Max Threads per SMX 1.3x Register File Bandwidth 2x Register File Capacity 2xShared Memory Bandwidth 2x Shared Memory Capacity 1x

16. New ISA Encoding: 255 Registers perThread Fermi limit: 63 registers per thread A common Fermi performance limiter Leads to excessive spilling Kepler : Up to 255 registers per thread Especially helpful for FP64 apps

17. New High-Performance SMX Instructions Compiler-generated,SHFL (shuffle) -- Intra-warp data exchange high performance instructions: bit shift bit rotateATOM -- Broader functionality, Faster fp32 division read-only cache

18. New Instruction: SHFLData exchange between threads within a warp Avoids use of shared memory One 32-bit value per exchange 4 variants: a b c d e f g h __shfl() __shfl_up() __shfl_down() __shfl_xor() h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f Indexed Shift right to nth Shift left to nth neighbour Butterfly (XOR) any-to-any neighbour exchange

19. SHFL Example: Warp Prefix-Sum__global__ void shfl_prefix_sum(int *data){ 3 8 2 6 3 9 1 4 int id = threadIdx.x; int value = data[id]; n = __shfl_up(value, 1) int lane_id = threadIdx.x & warpSize; value += n 3 11 10 8 9 12 10 5 // Now accumulate in log2(32) steps n = __shfl_up(value, 2) for(int i=1; i= i) value += n; n = __shfl_up(value, 4) } value += n 3 11 13 19 21 31 32 36 // Write out our result data[id] = value;}

20. ATOM instruction enhancementsAdded int64 functions to 2 10x performance gainsmatch existing int32 Shorter processing pipeline More atomic processorsAtom Op int32 int64 Slowest 10x fasteradd x x Fastest 2x fastercas x xexch x xmin/max x Xand/or/xor x X

21. High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) Without Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Output is N values 4. Second launch of N threads, reduces outputs to single value

22. High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) With Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Write output directly via atomic. No need for second kernel launch.

23. TexturesUsing textures in CUDA 4.0 Global Memory ptr1. Bind texture to memory region width2. Launch kernel height3. Use tex1D / tex2D to access memory from kernel cudaBindTexture2D(ptr, width, height) (0,0) Texture (x,y) int value = tex2D(texture, x, y)

24. Texture Pros & Cons Good Stuff Bad Stuff Explicit global binding Dedicated cache Limited number of global textures Separate memory pipe No dynamic texture indexing Relaxed coalescing No arrays of texture references Samplers & filters Different read/write instructions Separate memory region (uses offsets not pointers)

25. Bindless TexturesKepler permits dynamic binding of Bad Stuff textures: Explicit global binding Textures now referenced by ID Limited number of global textures Create new ID when needed, destroy when needed No dynamic texture indexing Can pass IDs as parameters No arrays of texture references Dynamic texture indexing Different read/write instructions Arrays of texture IDs supported Separate memory region (uses 1000s of IDs possible offsets not pointers)

26. Global Load Through TextureLoad from direct address, through Bad Stuff texture pipeline: Explicit global binding Eliminates need for texture setup Limited number of global textures Access entire memory space through texture No dynamic texture indexing Use normal pointers to read via texture No arrays of texture references Emitted automatically by compiler where possible Different read/write instructions Can hint to compiler with "const __restrict" Separate memory region (uses offsets not pointers)

27. const __restrict Example Annotate eligible kernel __global__ void saxpy(float x, float y, const float * __restrict input, parameters with float * output) { const __restrict size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); Compiler will automatically // Compiler will automatically use texture map loads to use read-only // for "input" output[offset] = (input[offset] * x) + y; data cache path }

28. Thank you

Documents

Gpu archi