Upload
desirae-wheeler
View
37
Download
4
Tags:
Embed Size (px)
DESCRIPTION
INF5063 – GPU & CUDA. Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory. GPU – Graphics Processing Units. Basic 3D Graphics Pipeline. Application. Host. Scene Management. Geometry. Rasterization. Frame Buffer Memory. GPU. Pixel Processing. - PowerPoint PPT Presentation
Citation preview
INF5063 – GPU & CUDA
Håkon Kvale StenslandDepartment for Informatics / Simula Research
Laboratory
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
GPU – Graphics Processing Units
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Basic 3D Graphics Pipeline
Application
Scene Management
Geometry
Rasterization
Pixel Processing
ROP/FBI/Display
FrameBuffer
Memory
Host
GPU
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
PC Graphics Timeline
Challenges:− Render infinitely complex scenes− And extremely high resolution− In 1/60th of one second (60 frames per second)
Graphics hardware has evolved from a simple hardwired pipeline to a highly programmable multiword processor
1998 1999 2000 2001 2002 2003 2004
DirectX 6Multitexturing
Riva TNT
DirectX 8SM 1.x
GeForce 3 Cg
DirectX 9SM 2.0
GeForceFX
DirectX 9.0cSM 3.0
GeForce 6DirectX 5Riva 128
DirectX 7T&L TextureStageState
GeForce 256
2005 2006
GeForce 7 GeForce 8SM 3.0 SM 4.0
DirectX 9.0c DirectX 10
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Graphics in the PC Architecture
DMI (Direct Media Interface) between processor and chipset− Memory Control now
integrated in CPU The old “Northbridge”
integrated onto CPU− Intel calls this part of the
CPU “System Agent”− PCI Express 3.0 x16
bandwidth at 32 GB/s (16 GB in each direction)
“Southbridge” (X99) handles all other peripherals
All mainstream CPUs now come with integrated GPU− Same capabilities as discrete
GPU’s− Less performance (limited by
die space and power) Intel Haswell
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
GPUs not always for Graphics
GPUs are now common in HPC Second largest supercomputer
in November 2013 is the Titan at Oak Ridge National Laboratory− 18688 16-core Opteron
processors− 16688 Nvidia Kepler GPU’s− Theoretical: 27 petaflops
Before: Dedicated compute card released after graphics model
Now: Nvidia’s high-end Kepler GPU (GK110) released first as a compute product
GeForce GTX Titan Black
Tesla K40
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
High-end Compute Hardware nVIDIA Kepler
Architecture The latest generation
GPU, codenamed GK110
7,1 billion transistors 2688 Processing cores
(SP)− IEEE 754-2008 Capable− Shared coherent L2 cache− Full C++ Support− Up to 32 concurrent
kernels− 6 GB memory − Supports GPU
virtualization
Tesla K40
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
High-end Graphics Hardware nVIDIA Maxwell
Architecture The latest generation
GPU, codenamed GM204
5,2 billion transistors 2048 Processing cores
(SP)− IEEE 754-2008 Capable− Shared coherent L2 cache− Full C++ Support− Up to 32 concurrent
kernels− 4 GB memory − Supports GPU
virtualization
GeForce GTX 980
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
GeForce GM204 Architecture
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Lab Hardware
nVIDIA GeForce GTX 750− All machines have the same GPU− Maxwell Architecture
Based on the GM107 chip− 1870 million transistors− 512 Processing cores (SP) at
1022 MHz (4 SMM)− 1024 MB Memory with 80
GB/sec bandwidth− 1044 GFLOPS theoretical
performance− Compute version 5.0
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
CPU and GPU Design Philosophy
GPU Throughput Oriented Cores
Chip
Compute UnitCache/Local Mem
Registers
SIMD Unit
Th
read
ing
CPULatency Oriented Cores
Chip
Core
Local Cache
Registers
SIMD Unit
Co
ntro
l
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
CPUs: Latency Oriented Design
Large caches−Convert long latency
memory accesses to short latency cache accesses
Sophisticated control−Branch prediction for
reduced branch latency−Data forwarding for
reduced data latency
Powerful ALU−Reduced operation
latency
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
GPUs: Throughput Oriented Design
Small caches−To boost memory
throughput
Simple control−No branch prediction−No data forwarding
Energy efficient ALUs−Many, long latency but
heavily pipelined for high throughput
Require massive number of threads to tolerate latencies
DRAM
GPU
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Think both about CPU and GPU…
CPUs for sequential parts where latency matters−CPUs can be 10+X
faster than GPUs for sequential code
GPUs for parallel parts where throughput wins−GPUs can be 10+X
faster than CPUs for parallel code
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
The Core: The basic processing block
The nVIDIA Approach:− Called Stream Processor and
CUDA cores. Works on a single operation.
The AMD Approach: − VLIW5: The GPU work on up
to five operations− VLIW4: The GPU work on up
to four operations− GCN: 16-wide SIMD vector
unit
Graphics Core Next (GCN):
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
The Core: The basic processing block
The (failed) Intel Approach:− 512-bit SIMD units in x86
cores− Failed because of complex
x86 cores and software ROP pipeline
The (new) Intel Approach:− Used in Sandy Bridge, Ivy
Bridge, Haswell & Broadwell− 128 SIMD-8 32-bit registers
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
The nVIDIA GPU Architecture Evolving
Streaming Multiprocessor (SM) 2.0 on the Fermi Architecture (GF1xx) 32 CUDA Cores (Core) 4 Super Function Units (SFU)
Dual schedulers and dispatch units 1 to 1536 threads active
Local register (32k) 64 KB shared memory / Level 1 cache 2 operations per cycle
Streaming Multiprocessor (SM) 1.x on the Tesla Architecture 8 CUDA Cores (Core) 2 Super Function Units (SFU)
Dual schedulers and dispatch units 1 to 512 or 768 threads active
Local register (32k) 16 KB shared memory 2 operations per cycle
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
The nVIDIA GPU Architecture Evolving
Streaming Multiprocessor (SMX) 3.5 on the Kepler Architecture (Compute) 192 CUDA Cores (Core) 64 DP CUDA Cores (DP Core) 32 Super Function Units (SFU)
Four (simple) schedulers and eight dispatch units
1 to 2048 threads active
Local register (64k) 64 KB shared memory / Level 1 cache 48 KB Texture Cache 1 operation per cycle
Streaming Multiprocessor (SMX) 3.0 on the Kepler Architecture (Graphics) 192 CUDA Cores (CC) 8 DP CUDA Cores (DP Core) 32 Super Function Units (SFU)
Four (simple) schedulers and eight dispatch units
1 to 2048 threads active
Local register (32k) 64 KB shared memory / Level 1 cahce 1 operation per cycle
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Streaming Multiprocessor (SMM) – 4.0
Streaming Multiprocessor (SMM) on Maxwell (Graphics) 128 CUDA Cores (Core) 4 DP CUDA Cores (DP Core) 32 Super Function Units
(SFU) Four schedulers and eight
dispatch units 1 to 2048 active threads Software controlled
scheduling Local register (64k) 64 KB shared memory 24 KB Level 1 / Texture Cache 1 operation per cycle GM107 / GM204
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Execution Pipes in SMM
Scalar MAD pipe− Float Multiply, Add, etc.− Integer operations− Conversions− Only one instruction per clock
Scalar SFU pipe− Special functions like Sin, Cos, Log, etc.
• Only one operation per four clocks Load/Store pipe
− CUDA has both global and local memory access through Load/Store
− Access to shared memory Texture Units
− Access to Texture memory with texture cache (per SMM)
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
GPGPU
Foils adapted from nVIDIA
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
What is really GPGPU?
Idea:• Potential for very high performance at low cost• Architecture well suited for certain kinds of parallel
applications (data parallel)• Demonstrations of 30-100X speedup over CPU
Early challenges:− Architectures very customized to graphics problems
(e.g., vertex and fragment processors)− Programmed using graphics-specific programming
models or libraries
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Previous GPGPU use, and limitations
Working with a Graphics API− Special cases with an API like
Microsoft Direct3D or OpenGL
Addressing modes− Limited by texture size
Shader capabilities− Limited outputs of the available
shader programs
Instruction sets− No integer or bit operations
Communication is limited− Between pixels
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Heterogeneous computing is catching on…
Financial Analysis
Scientific Simulation
Engineering
Simulation
Data Intensive Analytics
Medical Imaging
Digital Audio
Processing
Computer Vision
Digital Video
Processing
Biomedical Informatic
s
Electronic Design
Automation
Statistical Modeling
Ray Tracing
Rendering
Interactive Physics
Numerical Methods
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
nVIDIA CUDA
“Compute Unified Device Architecture” General purpose programming model
− User starts several batches of threads on a GPU− GPU is in this case a dedicated super-threaded,
massively data parallel co-processor
Software Stack− Graphics driver, language compilers (Toolkit), and tools
(SDK)
Graphics driver loads programs into GPU− All drivers from nVIDIA now support CUDA− Interface is designed for computing (no graphics )− “Guaranteed” maximum download & readback speeds− Explicit GPU memory management
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Outline
The CUDA Programming Model − Basic concepts and data types
An example application:− The good old Motion JPEG implementation!
− Profiling with CUDA− Make an example program!
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
The CUDA Programming Model
The GPU is viewed as a compute device that:− Is a coprocessor to the CPU, referred to as the
host− Has its own DRAM called device memory− Runs many threads in parallel
Data-parallel parts of an application are executed on the device as kernels, which run in parallel on many threads
Differences between GPU and CPU threads − GPU threads are extremely lightweight
• Very little creation overhead− GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
CUDA C – Execution Model
Integrated host + device C program− Serial or modestly parallel parts in host C code− Highly parallel parts in device SPMD kernel C
code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Thread Batching: Grids and Blocks
A kernel is executed as a grid of thread blocks− All threads share data
memory space A thread block is a batch
of threads that can cooperate with each other by:− Synchronizing their
execution• Non synchronous
execution is very bad for performance!
− Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
CUDA Device Memory Space Overview
Each thread can:− R/W per-thread registers− R/W per-thread local memory− R/W per-block shared
memory− R/W per-grid global memory− Read only per-grid constant
memory− Read only per-grid texture
memory
The host can R/W global, constant, and texture memories
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Global, Constant, and Texture Memories
Global memory:−Main means of
communicating R/W Data between host and device
−Contents visible to all threads
Texture and Constant Memories:−Constants initialized
by host −Contents visible to
all threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Terminology Recap
device = GPU = Set of multiprocessors Multiprocessor = Set of processors & shared
memory Kernel = Program running on the GPU Grid = Array of thread blocks that execute a
kernel Thread block = Group of SIMD threads that
execute a kernel and can communicate via shared memoryMemory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Access Times
Register – Dedicated HW – Single cycle Shared Memory – Dedicated HW – Single cycle Local Memory – DRAM, no cache – “Slow” Global Memory – DRAM, no cache – “Slow” Constant Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s
of cycles, depending on cache locality
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Terminology Recap
device = GPU = Set of multiprocessors Multiprocessor = Set of processors & shared
memory Kernel = Program running on the GPU Grid = Array of thread blocks that execute a
kernel Thread block = Group of SIMD threads that
execute a kernel and can communicate via shared memoryMemory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Global Memory Bandwidth
Ideal Reality
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Conflicting Data Accesses Cause Serialization and Delays
Massively parallel execution cannot afford serialization
Contentions in accessing critical data causes serialization
Some Information on the Toolkit
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Compilation
Any source file containing CUDA language extensions must be compiled with nvcc
nvcc is a compiler driver− Works by invoking all the necessary tools and
compilers like cudacc, g++, etc. nvcc can output:
− Either C code• That must then be compiled with the rest of the
application using another tool− Or object code directly
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Linking & Profiling
Any executable with CUDA code requires two dynamic libraries:−The CUDA runtime library (cudart)−The CUDA core library (cuda)
Several tools are available to optimize your application−nVIDIA CUDA Visual Profiler−nVIDIA Occupancy Calculator
NVIDIA Parallel Nsight for Visual Studio and Eclipse
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Before you start…
Four lines have to be added to your group users .bash_profile or .bashrc file
PATH=$PATH:/usr/local/cuda-6.5/binLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-
6.5/lib64:/lib
export PATHexport LD_LIBRARY_PATH
Code samples is installed with CUDA Copy and build in your users home directory
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Some usefull resources
NVIDIA CUDA Programming Guide 6.5
http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
NVIDIA CUDA C Best Practices Guide 6.5http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
Tuning CUDA Applications for Maxwellhttp://docs.nvidia.com/cuda/pdf/Maxwell_Tuning_Guide.pdf
Parallel Thread Execution ISA – Version 4.1http://docs.nvidia.com/cuda/pdf/ptx_isa_4.1.pdf
NVIDIA CUDA Getting Started Guide for Linuxhttp://docs.nvidia.com/cuda/pdf/CUDA_Getting_Started_Linux.pdf
Example:
Motion JPEG Encoding
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
14 different MJPEG encoders on GPU
Nvidia GeForce GPU
Problems:• Only used global memory• To much synchronization between
threads• Host part of the code not optimized
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Profiling a Motion JPEG encoder on x86
• A small selection of DCT algorithms:• 2D-Plain: Standard forward
2D DCT• 1D-Plain: Two consecutive
1D transformations with transpose in between and after
• 1D-AAN: Optimized version of 1D-Plain
• 2D-Matrix: 2D-Plain implemented with matrix multiplication
• Single threaded application profiled on a Intel Core i5 750
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Optimizing for GPU, use the memory correctly!!
• Several different types of memory on GPU:• Global memory• Constant memory• Texture memory• Shared memory
• First Commandment when using the GPUs:• Select the correct
memory space, AND use it correctly!
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
How about using a better algorithm??
• Used CUDA Visual Profiler to isolate DCT performance
• 2D-Plain Optimized is optimized for GPU:• Shared memory• Coalesced memory
access• Loop unrolling• Branch prevention• Asynchronous transfers
• Second Commandment when using the GPUs:• Choose an algorithm
suited for the architecture!
INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo
Effect of offloading VLC to the GPU
• VLC (Variable Length Coding) can also be offloaded:• One thread per
macro block• CPU does bitstream
merge• Even though algorithm
is not perfectly suited for the architecture, offloading effect is still important!