Upload
erno
View
31
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Getting Started with GPU Computing. Dan Negrut Assistant Professor Simulation-Based Engineering Lab Dept. of Mechanical Engineering University of Wisconsin-Madison. San Diego August 30, 2009. Acknowledgement. Colleagues helping to organize the GPU Workshop: - PowerPoint PPT Presentation
Citation preview
San DiegoAugust 30, 2009
Getting Started with GPU Computing
Dan NegrutAssistant Professor
Simulation-Based Engineering LabDept. of Mechanical EngineeringUniversity of Wisconsin-Madison
Acknowledgement
Colleagues helping to organize the GPU Workshop: Sara McMains, Krishnan Suresh, Roshan D’Souza
Wen-mei W. Hwu
NVIDIA Corporation
My students Hammad Mazhar Toby Heyn
2
Acknowledgements: Financial Support [Dan Negrut]
NSF
NVIDIA Corporation
British Aerospace Engineering (BAE), Land Division
Argonne National Lab
3
Overview
Parallel computing: why, and why now? (15 mins)
GPU Programming: The democratization of parallel computing (60 mins) NVIDIA’s CUDA, a facilitator of GPU computing
Comments on the execution configuration and execution model The memory layout Gauging resource utilization IDE support
Comments on GPU computing (15 mins) Sources of information Beyond CUDA
4
Scientific Computing: A Change of Tide...
A paradigm shift taking place in Scientific Computing
Moving from sequential to parallel data processing
Triggered by changes in the microprocessor industry
5
CPU: Three Walls to Serial Performance
Memory Wall
Instruction Level Parallelism (ILP) Wall
Power Wall
Source: excellent article, “The Many-Core Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation
http://www.ctwatch.org/quarterly/articles/2007/02/the-many-core-inflection-point-for-mass-market-computer-systems/
6
Memory Wall
There is a growing disparity of speed between CPU and memory access outside the CPU chip
S. Cray: “Anyone can build a fast CPU. The trick is to build a fast system”
7
Memory Wall
The processor often data starved (idle) due to latency and limited communication bandwidth beyond chip boundaries From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory
access speed only improved at 10%.
Some fixes Strong push for ever growing caches to improve the average memory reference time to
fetch or write data Hyper-threading Technology (HTT)
8
The Power Wall
“Power, and not manufacturing, limits traditional general purpose microarchitecture improvements” (F. Pollack, Intel Fellow)
Leakage power dissipation gets worse as gates get smaller, because gate dielectric thicknesses must proportionately decrease
W /
cm2
i386i486
Pentium
Pentium Pro
Pentium II
Pentium III
Pentium 4
Nuclear reactor
Technology from older to newer (μm)
Core DUO
Adapted from F. Pollack (MICRO’99) 9
The Power Wall
Power dissipation in clocked digital devices is proportional to the square of clock frequency imposing natural limit on clock rates
Significant increase in clock speed without heroic (and expensive) cooling is not possible. Chips would simply melt.
10
The Power Wall
Clock speed increased by a factor of 4,000 in less than two decades
The ability of manufacturers to dissipate heat is limited though…
Look back at the last five years, the clock rates are pretty much flat
2010 Intel’s Sandy Bridge microprocessor architecture, to go up to 4.0 GHz
11
The Bright Spot: Moore’s Law 1965 paper: Doubling of the number of transistors on integrated
circuits every two years Moore himself wrote only about the density of components (or
transistors) at minimum cost
Increase in transistor count to some extent as a rough measure of computer processing performance
http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html 12
Many-core array• CMP with 10s-100s low
power cores• Scalar cores• Capable of TFLOPS+• Full System-on-Chip• Servers, workstations
embedded…
Dual core• Symmetric multithreading
Multi-core array• CMP with ~10 cores
Evolution
Large, Scalar cores for high single-thread performance
Scalar plus many core for highly threaded workloads
Intel’s Vision: Evolutionary Configurable Architecture
Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005
CMP = “chip multi-processor”Presentation Paul Petersen,Sr. Principal Engineer, Intel 1
3
Putting things in perspective…
Slide Source: Berkeley View of Landscape 1
4
The way business has been run in the past It will probably change to this…
Rely exclusively on frequency increase Parallelism is primary method of performance improvement
For the commoner: Don’t bother parallelizing an application
(after all, you get a meager speedup)
No scientific computing application relies on one core chips
Less than linear scaling for a multiprocessor is failure
Sub-linear speedups are ok as long as you beat the sequential
Some numbers would be good…
15
GPU vs. CPU Flop Rate Comparison(single precision rate for GPU)
16
Seymour Cray: "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
GPU – NVIDIA Tesla C1060
CPU – Intel core I7 975 Extreme
Processing Cores 240 4
Memory 4 GB - 32 KB L1 cache / core- 256 KB L2 (I&D)cache / core- 8 MB L3 (I&D) shared by all cores
Clock speed 1.33 GHz 3.20 GHz
Memory bandwidth 102 GB/s 32.0 GB/s
Floating point operations/s
933 x 109 Single Precision
70 x 109 Double Precision
Key ParametersGPU, CPU
17
The GPU Hardware
18
19
GPU: Underlying Hardware NVIDIA nomenclature used below, reminiscent of GPU’s mission
20
The hardware organized as follows:
One Stream Processor Array (SPA)…
… has a collection of Texture Processor Clusters (TPC, ten of them on C1060) …
…and each TPC has three Stream Multiprocessors (SM) …
…and each SM is made up of eight Stream or Scalar Processor (SP)
eacha¾¾¾®
NVIDIA TESLA C1060
21
240 Scalar Processors
4 GB device memory
Memory Bandwidth: 102 GB/s
Clock Rate: 1.3GHz
Approx. $1,250
Layout of Typical Hardware Architecture
22
CPU (the host)
GPU w/ local
DRAM(the
device)
GPGPU computing: “General Purpose” GPU computing
The GPU can be used for more than just graphics: the computational resources are there, and they are most of the time underutilized
GPU can be used to accelerate data parallel parts of an application
GPGPU Computing
23
GPGPU: Pluses and Minuses
Simple architecture optimized for compute intensive task Large data arrays, streaming throughput Fine-grain SIMD (Singe Instruction Multiple Data) parallelism Low-latency floating point (FP) computation
High precision floating point arithmetic support 32bit floating point IEEE 754
However, GPU was only programmable relying on graphics library APIs
24
Dealing with graphics API
Addressing modes Limited texture size/dimension
Shader capabilities Limited outputs
Instruction sets Lack of Integer & bit ops
Communication limited Between pixels Only gather (can read data from other pixels), but
no scatter (can only write to one pixel)
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
25Summing Up: Mapping computation problems to graphics rendering pipeline tedious…
GPGPU: Pluses and Minuses [Cntd.]
CUDA: Addressing the Minuses in GPGPU
“Compute Unified Device Architecture”
It represents a general purpose programming model User kicks off batches of threads on the GPU
Targeted software stack Scientific computing oriented drivers, language, and tools
Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Guaranteed maximum download & readback speeds Explicit GPU memory management
26
The CUDA Execution Model
GPU Computing – The Basic Idea
The GPU is linked to the CPU by a reasonably fast connection
The idea is to use the GPU as a co-processor
Farm out big parallelizable tasks to the GPU
Keep the CPU busy with the control of the execution and “corner” tasks
28
GPU Computing – The Basic Idea [Cntd.]
You have to copy data onto the GPU and later fetch results back.
For this to pay off, the data transfer should be overshadowed by the number crunching that draws on that data
GPUs also work in asynchronous mode Data transfer for future task can happen while the GPU processes current job
29
Some Nomenclature…
The HOST This is your CPU executing the “master” thread
The DEVICE This is the GPU card, connected to the HOST through a PCIe X16 connection
The HOST (the master thread) calls the DEVICE to execute a KERNEL
When calling the KERNEL, the HOST also has to inform the DEVICE how many threads should each execute the KERNEL This is called “defining the execution configuration”
30
__global__ void KernelFoo(...); // declaration
dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block
KernelFoo<<< DimGrid, DimBlock>>>(...arg list here…);
Calling a Kernel Function, Details
A kernel function must be called with an execution configuration:
31
Any call to a kernel function is asynchronous By default, execution on host doesn’t wait for kernel to finish
Example
The host call below instructs the GPU to execute the function (kernel) “foo” using 25,600 threads Two arguments are passed down to each thread executing the kernel “foo”
In this execution configuration, the host instructs the device that it is supposed to run 100 blocks each having 256 threads in it
The concept of block it’s important, since it represents the entity that gets executed by an SMs
32
30,000 Feet Perspective
33
This is how your C code looks like
This is how the code gets executed on the hardware in heterogeneous computing
34
More on the Execution Model
There is a limitation on the number of blocks in a grid: The grid of blocks can be organized as a 2D structure: max of 65535 by 65535
grid of blocks (that is, no more than 4,294,836,225 blocks for a kernel call)
Threads in each block: The threads can be organized as a 3D structure (x,y,z) The total number of threads in each block cannot be larger than 512
35
Kernel Call Overhead
How much time is it burnt by the CPU calling the GPU? Values reported below are averages over 100,000 kernel calls
No arguments in the kernel call GT 8800 series, CUDA 1.1: 0.115305 milliseconds Tesla C1060, CUDA 1.3: 0.088493 milliseconds
Arguments present in the kernel call GT 8800 series, CUDA 1.1: 0.146812 milliseconds Tesla C1060, CUDA 1.3: 0.116648 milliseconds
36
Languages Supported in CUDA
Note that everything is done in C
Yet minor extensions are needed to flag the fact that a function actually represents a kernel, that there are functions that will only run on the device, etc. Called “C with extensions”
FOTRAN is supported, ongoing project with the Portland Group (PGI)
There is support for C++ programming (operator overload, for instance)37
CUDA Function Declarations(the “C with extensions” part)
Executed on the:
Only callable from the:
__device__ float myDeviceFunc() device device
__global__ void myKernelFunc() device host
__host__ float myHostFunc() host host
__global__ defines a kernel function Must return void
For a full list, see CUDA Reference Manual
38
Block Execution Scheduling Issues
Who’s Executing Here?[The Stream Multiprocessor (SM)]
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Stream Multiprocessor
Shared Memory
40
The SM represents the quantum of scalability on NVIDIA’s architecture My laptop: 4 SMs The Tesla C1060: 30 SMs
Stream Multiprocessor (SM) 8 Scalar Processors (SP) 2 Special Function Units (SFU) It’s where a block lands for execution
Multi-threaded instruction dispatch From 1 up to 1024 (!) threads active Shared instruction fetch per 32 threads
16 KB shared memory + 16 KB of registers
DRAM texture and memory access
Scheduling on the Hardware
Grid is launched on the SPA
Thread Blocks are serially distributed to all the SMs
Potentially >1 Thread Block per SM
Each SM launches Warps of Threads
SM schedules and executes Warps that are ready to run
As Warps and Thread Blocks complete, resources are freed
SPA can launch next Block[s] in line
NOTE: Two levels of scheduling: For running [desirably] a large number of
blocks on a small number of SMs (16/14/etc.)
For running up to 32 warps of threads on the 8 SPs available on each SM
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
41
SM Executes Blocks
Threads are assigned to SMs in Block granularity
Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…)
One SM can take up to 1024 threads This is 32 warps Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc.
Threads run concurrently but time slicing is involved
SM assigns/maintains thread id #s SM manages/schedules thread execution
There is NO time slicing for block execution
t0 t1 t2 … tm
Blocks
Texture L1SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
42
Thread Scheduling/Execution
Each Thread Block is divided in 32-thread
Warps This is an implementation decision, not part
of the CUDA programming model
Warps are the basic scheduling units in SM
If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps
There are 8 * 3 = 24 Warps At any point in time, only *one* of the 24
Warps will be selected for instruction fetch and execution.
…t0 t1 t2 … t31…
…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Streaming Multiprocessor
Shared Memory
43HK-UIUC
SM Warp Scheduling
SM hardware implements zero-overhead Warp scheduling
Warps whose next instruction has its operands ready for consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized scheduling policy
All threads in a Warp execute the same instruction when selected
4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80
Side-comment: Suppose your code has one global
memory access every four instructions Then, a minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 35
warp 8 instruction 12
...
time
warp 3 instruction 3644HK-UIUC
Review: The CUDA Programming Model
GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD)
What’s the overall software (application) development model? CUDA integrated CPU + GPU application C program
Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks
Grid 0. . .
. . .
GPU Parallel KernelKernelA<<< nBlkA, nTidA >>>(args);
Grid 1
CPU Serial Code
GPU Parallel Kernel KernelB<<< nBlkB, nTidB >>>(args);
CPU Serial Code
45
The CPU perspective of the GPU…
The GPU is viewed as a compute device that: Is a co-processor to the CPU or host Runs many threads in parallel
Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
When a kernel is invoked, you will have to instruct the GPU how many threads are supposed to run this kernel You have to indicate the number of blocks of threads You have to indicated how many threads are in each block
46
Caveats [1]
Flop rates for GPUs are reported for single precision operations Double precision is supported but the rule of thumb is that you get about a
4X slowdown relative to single precision
Also, some small deviations from IEEE754 exist Combinations of multiplication and addition in one operation is not compliant
47
Caveats [2]
There is no synchronization between threads that live in different blocks
If all threads need to synchronize, this is accomplished by getting out of the kernel and invoking another one
Average overhead for kernel launch ¼ 90-110 microseconds (small…)
IMPORTANT: Global, constant, and texture memory spaces are persistent across successive kernels calls made by the same application
48
CUDA Memory Spaces
49
The Memory Space
The memory space is the union of Registers Shared memory Device memory, which can be
Global memory Constant memory Texture memory
Remarks The constant memory is cached The texture memory is cached The global memory is NOT cached
Mem Bandwidth, Device Memory: 102 Gb/s
50
CUDA Runtime Partitioning of the Memory Space
The device memory is split in global, constant and texture memory
Note the presence of local memory, which is virtual memory
If too many registers are needed for computation the data overflow is stored in local memory
“Local” means that it’s local, or specific, to one thread
In fact local memory is part of the global memory
Long access times for local mem
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
51
CUDA Device Memory Space
Each thread can: At thread level: R/W registers At thread level: R/W local memory At block level: R/W shared memory At grid level: R/W global memory At grid level: Read only constant memory At grid level: Read only texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
The host can R/W global, constant, and texture memories
NOTE: the texture, constant, and global memory are persistent across kernels called by the same application 52HK-UIUC
Access Times
Register – dedicated HW - single cycle
Shared Memory – dedicated HW - single cycle
Local Memory – DRAM, no cache - *slow*
Global Memory – DRAM, no cache - *slow*
Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality
Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality
Instruction Memory (invisible) – DRAM, cached
53HK-UIUC
Compute Capabilities, Things Change Fast…
54Credit: NVIDIA
Most Common Programming Pattern[interacting with the device memory space]
Sequence of steps most commonly used in GPU computing:
Step 1: Host allocates memory on the device
Step 2: Host copies data into the device
Step 3: Host invokes a kernel that gets executed in parallel and which processes/uses data from the device memory for useful computation
Step 4: Host copies back results from the device
55
56
CUDA Device Memory Allocation
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
57
cudaMalloc() Allocates object in the
device Global Memory Requires two parameters
Address of a pointer to the allocated object
Size of allocated object
cudaFree() Frees object from device
Global Memory Pointer to freed object
HK-UIUC
CUDA Host-Device Data Transfer cudaMemcpy()
memory data transfer Requires four parameters
Pointer to source Pointer to destination Number of bytes copied Type of transfer
Host to Host Host to Device Device to Host Device to Device
Things happen over a PCIe 2.0 16X connection Basically 8 Gb/s (each
way)
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
58HK-UIUC
CUDA Host-Device Data Transfer (cont.)
Example: Transfer a number of “size” bytes M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are
symbolic constants
cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);
cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);
59
CUDA GPU Programming~ Resource Management Considerations ~
60
What Do I Mean By “Resource Management”?
The GPU is a resourceful device
What do you have to do to make sure you capitalize on these resources? In other words, how can you ensure that all the SPs are busy all the time?
To fully exploit the GPU’s potential it is important How many threads you decide to use What memory requirements are associated with a thread How much shared memory gets allocated/used by one block of threads
61
Resource Management – The Key Actors:Threads, Warps, Blocks
A collection of 32 Threads makes up a Warp Warp is something virtual, it’s how the GPU groups the threads together for execution
A Block has at the most 512 threads, that is, 16 Warps Threads are organized in a 3D fashion; each thread has an (Tx,Ty,Tz) unique thread ID Threads in a block get to use together the shared memory
Each Block of threads is executed on a single SM If you run an application with 100 blocks of threads and your GPU has 16 SMs (GTX
8800, for instance), chances are each SM will get to execute about 6 or 7 blocks62
A kernel is executed as a grid of blocks Grid: up to 65535 X 65535 blocks Each block has a unique (Bx, By) unique ID
The threads that belong to the *same* block can cooperate with each other by:
Synchronizing their execution For hazard-free shared memory
accesses Efficiently sharing data through a low
latency shared memory Shared memory is allocated per block
Threads from two different blocks cannot cooperate!!!
This has important software design implications
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
63
Resource Management – The Key Actors:Threads, Warps, Blocks [Cntd.]
Execution Model, Key Observations [1 of 2]
Each block is executed on *one* Stream Multiprocessor (SM)
There is no time slicing when executing a block of threads
Each block is split into warps of threads executed one at a time by the eight SPs of the SM (time slicing in warp execution is constantly done)
64
Execution Model, Key Observations [2 of 2]
A Stream Multiprocessor can execute multiple blocks concurrently
Shared memory and registers are partitioned among the threads of all concurrent blocks
Decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently (very desirable)
The shared memory “belongs” to the block, not to the threads (which merely use it…)
The shared memory space resides in the on-chip shared memory and it “spans” (or encompasses) a thread block
65
Some Hard Constraints [1 of 2]
Max number of warps that one SM can service simultaneously: 32 (on the latest generation of GPUs)
Max number of blocks that one SM can process simultaneously: 8 (it’s been like this for a while)
66
Some Hard Constraints [2 of 2]
The number of registers available on each SM is limited: 16 Kb on latest NVIDIA hardware
The amount of shared memory available to each SM is limited 16 Kb today
67
The Concept of Occupancy
Ideally, you want to have 32 warps serviced at the same time by one SM This keeps the SM busy and hides latencies associated with memory access
Examples: Two blocks with 512 threads running together on one SM: 100% occupancy
Four blocks of 256 threads each running on one SM: 100% occupancy
16 blocks with 64 threads each – not good, can’t have more than 8 blocks running on a SM Effectively this scenario gives you 50% occupancy
68
The Concept of Occupancy [Cntd.]
What prevents you from getting high occupancy?
Many warps means many threads and possibly many blocks
Many blocks ) you can’t have too much shared mem allocated to each one of them Total amount of shared memory in one SM: 16 Kb
Many threads ) you can’t have too many registers used by each thread Size of the register file in one SM: 16 Kb
69
Examples, Occupancy of HW Example 1: If each of your blocks gets assigned 20 Kb of shared
memory, the kernel will fail to launch Not enough memory on the SM to run a block
Example 2: If your blocks each uses 5 Kb of shared mem, you can have three blocks running on one SM (there will be some shared mem that will go unused)
Example 3: Like Example 2 above, and you have 512 threads per block, each thread uses 16 registers. Will one SM be able to handle 2 blocks? Total number of registers ) 512 X 2 X 16 = 16,384 out of the 16,384 are used ) ok Number of warps: 2 blocks X 512 threads = 1024 threads = 32 warps ) ok in CUDA 1.3 You actually have 100% occupancy, maxed out on registers, and lots of shared mem left
70
Resource Utilization There is an “occupancy calculator” that can tell you what percentage
of the HW gets utilized by your kernel
Assumes the form of an Excel spreadsheet
Requires the following input Threads per block Registers per thread Shared memory per block
Google “occupancy calculator cuda” to access it 71
72
CUDA GPU Code Development
73
Code Development Support
How do I compile?
How do I link?
How do I debug?
How do I profile?
74
The CUDA Way: Extended C Declaration specifications:
global, device, shared, local, constant
Keywords threadIdx, blockIdx
Intrinsics __syncthreads
Runtime API For memory, symbol,
execution management
Kernel launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M]; ...
region[threadIdx.x] = image[i];
__syncthreads() ...
image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);
75HK-UIUC
Compiling CUDA nvcc
Compile driver Invokes cudacc, gcc, cl, etc.
PTX Parallel Thread eXecution Like assembly language
NVCC
C/C++ CUDAApplication
PTX to TargetCompiler
G80 … GPU
Target code
PTX Code
CPU Code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;
Courtesy NVIDIA
76
More on the nvcc compiler
File suffix How the nvcc compiler interprets the file
.cuCUDA source file, containing host and device code
.cupPreprocessed CUDA source file, containing host code and device functions
.c ‘C’ source file
.cc, .cxx, .cpp C++ source file
.gpu GPU intermediate file (device code only)
.ptxPTX intermediate assembly file (device code only)
.cubin CUDA device only binary file
77
Compiling CUDA extended C
78 http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf
Gauging Memory Use on GPU
Use compile architecture {sm_10}abiversion {1}modname {cubin}code {
name = _Z21MatVecMulKernelShared6Matrix6VectorS0_lmem = 0smem = 1068reg = 8bar = 1const {
segname = constsegnum = 1offset = 0bytes = 8
mem {0x000000ff 0x0000042c
}}bincode {
0x10004209 0x0023c780 0xa000000d 0x04000780 0x1000c801 0x0423c780 0x301fce11 0xec300780
Compile with the “–keep” flag and investigate the .cubin file:
79
Debugging Using the Device Emulation Mode
An executable compiled in device emulation mode (nvcc -deviceemu) runs entirely on the host using the CUDA runtime
No need of any device and CUDA driver
Each device thread is emulated with a host thread
In Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations
80
When running in device emulation mode, one can: Use host native debug support (breakpoints, variable QuickWatch and edit, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads
Device Emulation Mode Pitfalls [1/3]
Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results
81HK-UIUC
Device Emulation Mode Pitfalls [2/3]
Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
82HK-UIUC
Device Emulation Mode Pitfalls [3/3]
Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results
There are various options to force strict single precision on the host
83HK-UIUC
Concluding Remarks
84
GPU Computing in Engineering
Who stands to benefit in the Engineering community?
FEA Monte Carlo Molecular Dynamics Granular Dynamics Image processing Agent-based modeling …
Generally, any application that fits the SIMD paradigm
85
146X
Medical Imaging U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
MATLAB Computing
AccelerEyes
100X
AstrophysicsRIKEN
149X
Financial simulation
Oxford
47X
Linear AlgebraUniversidad
Jaime
20X
3D UltrasoundTechniscan
130X
Quantum Chemistry
U of Illinois, Urbana
30X
Gene Sequencing
U of Maryland
50x – 150x
Credit: NVIDIA Corporation
A Word on HPC beyond GPU
We are witnessing a very momentous transformation
Shift from sequential to parallel computing
The support for parallel computing is very homogeneous in structure
GPU not alone in this race of capitalizing on parallel computing for scientific apps
87
Parallel Computing, SW Side…
Other options for leveraging parallel computing in scientific applications
Threads (Posix, Windows)
OpenMP
MPI standard (see MPICH implementation)
Intel’s Thread Building Block (TBB) library
OpenCL standard for heterogeneous computing AMD and NVIDIA provided implementations, Apple to follow up shortly
88
Parallel Computing, HW Side…
Hardware options for HPC
GPU (NVIDIA)
The “fusion” idea (Intel’s Larrabee, AMD’s Fusion)
Cell Blades
Cluster computing (IBM’s BlueGene/P, Q,…)
Cloud Computing
89
Sources of Information, GPU Computing
Read, in this order: NVIDIA CUDA Development Tools 2.3: Getting Started (short doc, July 09) NVIDIA CUDA Programming Guide 2.3 (July 09) NVIDIA CUDA C Programming Best Practices Guide 2.3 (short doc, July 09) NVIDIA CUDA Reference Manual 2.3 (comprehensive, July 09)
Lots of very good examples come with the CUDA SDK distribution More than 25 applications ready to compile/run Makefiles available, ready for use Lots of good code available for reuse + templates for applications
Online material NVIDIA website: code available for many application fields Libs: thrust (http://code.google.com/p/thrust/), cudpp (http://gpgpu.org/developer/cudpp) Course on GPU programming: http://sbel.wisc.edu/Courses/ME964/2008/index.htm
Conclusions
91
In the middle of a shift to parallel computing
Hardware changes at higher pace
CUDA – a bright spot in a software landscape otherwise pretty bleak
GPU computing not the silver bullet
GPU for right application can deliver amazing benefits at small time and financial investments
In general, investing in parallel programming skills bound to pay off
Thank You.
92
Review, Execution Model
Move data to device, launch kernel, transfer relevant data back to host
Kernel is a C function executed on the device
Each thread executes the kernel, this happens in parallel
93
Review, Key Concepts
Kernel = GPU program executed by each parallel thread in a block Block = a 3D collection of threads that can cooperate in using the
block’s shared memory and can synchronize during execution
Grid = 2D array of blocks of threads that execute a kernel Device ´ GPU = set of stream multiprocessors (30 SMs) Stream Multiprocessor = 8 scalar processors + shared mem + registers
Memory Location Cached Access WhoLocal Off-chip No Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host
94Off-chip means on-device; i.e., slow access time.
“Parallelism for Everyone” Parallelism changes the game
A large percentage of people who provide applications are going to have to care about parallelism in order to match the capabilities of their competitors.
Vision of the Future Pe
rform
ance
Frequency Era
Time
Multi-core Era
Active SD
Passive SDPlatform Potential Growing gap!
Fixed gap
competitive pressures = demand for parallel applications
Presentation Paul Petersen,Sr. Principal Engineer, Intel
“SD”: Software Development
95
2007