Getting Started with GPU Computing

San DiegoAugust 30, 2009

Getting Started with GPU Computing

Dan NegrutAssistant Professor

Simulation-Based Engineering LabDept. of Mechanical EngineeringUniversity of Wisconsin-Madison

Acknowledgement

Colleagues helping to organize the GPU Workshop: Sara McMains, Krishnan Suresh, Roshan D’Souza

Wen-mei W. Hwu

NVIDIA Corporation

My students Hammad Mazhar Toby Heyn

2

Acknowledgements: Financial Support [Dan Negrut]

NSF

NVIDIA Corporation

British Aerospace Engineering (BAE), Land Division

Argonne National Lab

3

Overview

Parallel computing: why, and why now? (15 mins)

GPU Programming: The democratization of parallel computing (60 mins) NVIDIA’s CUDA, a facilitator of GPU computing

Comments on the execution configuration and execution model The memory layout Gauging resource utilization IDE support

Comments on GPU computing (15 mins) Sources of information Beyond CUDA

4

Scientific Computing: A Change of Tide...

A paradigm shift taking place in Scientific Computing

Moving from sequential to parallel data processing

Triggered by changes in the microprocessor industry

5

CPU: Three Walls to Serial Performance

Memory Wall

Instruction Level Parallelism (ILP) Wall

Power Wall

Source: excellent article, “The Many-Core Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation

http://www.ctwatch.org/quarterly/articles/2007/02/the-many-core-inflection-point-for-mass-market-computer-systems/

6




Memory Wall

There is a growing disparity of speed between CPU and memory access outside the CPU chip

S. Cray: “Anyone can build a fast CPU. The trick is to build a fast system”

7

Memory Wall

The processor often data starved (idle) due to latency and limited communication bandwidth beyond chip boundaries From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory

access speed only improved at 10%.

Some fixes Strong push for ever growing caches to improve the average memory reference time to

fetch or write data Hyper-threading Technology (HTT)

8

The Power Wall

“Power, and not manufacturing, limits traditional general purpose microarchitecture improvements” (F. Pollack, Intel Fellow)

Leakage power dissipation gets worse as gates get smaller, because gate dielectric thicknesses must proportionately decrease

W /

cm2

i386i486

Pentium

Pentium Pro

Pentium II

Pentium III

Pentium 4

Nuclear reactor

Technology from older to newer (μm)

Core DUO

Adapted from F. Pollack (MICRO’99) 9

The Power Wall

Power dissipation in clocked digital devices is proportional to the square of clock frequency imposing natural limit on clock rates

Significant increase in clock speed without heroic (and expensive) cooling is not possible. Chips would simply melt.

10

The Power Wall

Clock speed increased by a factor of 4,000 in less than two decades

The ability of manufacturers to dissipate heat is limited though…

Look back at the last five years, the clock rates are pretty much flat

2010 Intel’s Sandy Bridge microprocessor architecture, to go up to 4.0 GHz

11

The Bright Spot: Moore’s Law 1965 paper: Doubling of the number of transistors on integrated

circuits every two years Moore himself wrote only about the density of components (or

transistors) at minimum cost

Increase in transistor count to some extent as a rough measure of computer processing performance

http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html 12

Many-core array• CMP with 10s-100s low

power cores• Scalar cores• Capable of TFLOPS+• Full System-on-Chip• Servers, workstations

embedded…

Dual core• Symmetric multithreading

Multi-core array• CMP with ~10 cores

Evolution

Large, Scalar cores for high single-thread performance

Scalar plus many core for highly threaded workloads

Intel’s Vision: Evolutionary Configurable Architecture

Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005

CMP = “chip multi-processor”Presentation Paul Petersen,Sr. Principal Engineer, Intel 1

3

http://www.google.com/imgres?imgurl=http://thetechnut.files.wordpress.com/2007/09/intel-logo.jpg&imgrefurl=http://thetechnut.wordpress.com/2007/09/10/amd-vs-intel/&h=793&w=1201&sz=64&tbnid=_oHQ4mUT4HAJ::&tbnh=99&tbnw=150&prev=/images?q=intel+logo&hl=en&sa=X&oi=image_result&resnum=2&ct=image&cd=1

Putting things in perspective…

Slide Source: Berkeley View of Landscape 1

4

The way business has been run in the past It will probably change to this…

Rely exclusively on frequency increase Parallelism is primary method of performance improvement

For the commoner: Don’t bother parallelizing an application

(after all, you get a meager speedup)

No scientific computing application relies on one core chips

Less than linear scaling for a multiprocessor is failure

Sub-linear speedups are ok as long as you beat the sequential

Some numbers would be good…

15

GPU vs. CPU Flop Rate Comparison(single precision rate for GPU)

16

Seymour Cray: "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"

GPU – NVIDIA Tesla C1060

CPU – Intel core I7 975 Extreme

Processing Cores 240 4

Memory 4 GB - 32 KB L1 cache / core- 256 KB L2 (I&D)cache / core- 8 MB L3 (I&D) shared by all cores

Clock speed 1.33 GHz 3.20 GHz

Memory bandwidth 102 GB/s 32.0 GB/s

Floating point operations/s

933 x 109 Single Precision

70 x 109 Double Precision

Key ParametersGPU, CPU

17

The GPU Hardware

18

19

GPU: Underlying Hardware NVIDIA nomenclature used below, reminiscent of GPU’s mission

20

The hardware organized as follows:

One Stream Processor Array (SPA)…

… has a collection of Texture Processor Clusters (TPC, ten of them on C1060) …

…and each TPC has three Stream Multiprocessors (SM) …

…and each SM is made up of eight Stream or Scalar Processor (SP)

eacha¾¾¾®

NVIDIA TESLA C1060

21

240 Scalar Processors

4 GB device memory

Memory Bandwidth: 102 GB/s

Clock Rate: 1.3GHz

Approx. $1,250

Layout of Typical Hardware Architecture

22

CPU (the host)

GPU w/ local

DRAM(the

device)

GPGPU computing: “General Purpose” GPU computing

The GPU can be used for more than just graphics: the computational resources are there, and they are most of the time underutilized

GPU can be used to accelerate data parallel parts of an application

GPGPU Computing

23

GPGPU: Pluses and Minuses

Simple architecture optimized for compute intensive task Large data arrays, streaming throughput Fine-grain SIMD (Singe Instruction Multiple Data) parallelism Low-latency floating point (FP) computation

High precision floating point arithmetic support 32bit floating point IEEE 754

However, GPU was only programmable relying on graphics library APIs

24

Dealing with graphics API

Addressing modes Limited texture size/dimension

Shader capabilities Limited outputs

Instruction sets Lack of Integer & bit ops

Communication limited Between pixels Only gather (can read data from other pixels), but

no scatter (can only write to one pixel)

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

25Summing Up: Mapping computation problems to graphics rendering pipeline tedious…

GPGPU: Pluses and Minuses [Cntd.]

CUDA: Addressing the Minuses in GPGPU

“Compute Unified Device Architecture”

It represents a general purpose programming model User kicks off batches of threads on the GPU

Targeted software stack Scientific computing oriented drivers, language, and tools

Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Guaranteed maximum download & readback speeds Explicit GPU memory management

26

The CUDA Execution Model

GPU Computing – The Basic Idea

The GPU is linked to the CPU by a reasonably fast connection

The idea is to use the GPU as a co-processor

Farm out big parallelizable tasks to the GPU

Keep the CPU busy with the control of the execution and “corner” tasks

28

GPU Computing – The Basic Idea [Cntd.]

You have to copy data onto the GPU and later fetch results back.

For this to pay off, the data transfer should be overshadowed by the number crunching that draws on that data

GPUs also work in asynchronous mode Data transfer for future task can happen while the GPU processes current job

29

Some Nomenclature…

The HOST This is your CPU executing the “master” thread

The DEVICE This is the GPU card, connected to the HOST through a PCIe X16 connection

The HOST (the master thread) calls the DEVICE to execute a KERNEL

When calling the KERNEL, the HOST also has to inform the DEVICE how many threads should each execute the KERNEL This is called “defining the execution configuration”

30

__global__ void KernelFoo(...); // declaration

dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block

KernelFoo<<< DimGrid, DimBlock>>>(...arg list here…);

Calling a Kernel Function, Details

A kernel function must be called with an execution configuration:

31

Any call to a kernel function is asynchronous By default, execution on host doesn’t wait for kernel to finish

Example

The host call below instructs the GPU to execute the function (kernel) “foo” using 25,600 threads Two arguments are passed down to each thread executing the kernel “foo”

In this execution configuration, the host instructs the device that it is supposed to run 100 blocks each having 256 threads in it

The concept of block it’s important, since it represents the entity that gets executed by an SMs

32

30,000 Feet Perspective

33

This is how your C code looks like

This is how the code gets executed on the hardware in heterogeneous computing

34

More on the Execution Model

There is a limitation on the number of blocks in a grid: The grid of blocks can be organized as a 2D structure: max of 65535 by 65535

grid of blocks (that is, no more than 4,294,836,225 blocks for a kernel call)

Threads in each block: The threads can be organized as a 3D structure (x,y,z) The total number of threads in each block cannot be larger than 512

35

Kernel Call Overhead

How much time is it burnt by the CPU calling the GPU? Values reported below are averages over 100,000 kernel calls

No arguments in the kernel call GT 8800 series, CUDA 1.1: 0.115305 milliseconds Tesla C1060, CUDA 1.3: 0.088493 milliseconds

Arguments present in the kernel call GT 8800 series, CUDA 1.1: 0.146812 milliseconds Tesla C1060, CUDA 1.3: 0.116648 milliseconds

36

Languages Supported in CUDA

Note that everything is done in C

Yet minor extensions are needed to flag the fact that a function actually represents a kernel, that there are functions that will only run on the device, etc. Called “C with extensions”

FOTRAN is supported, ongoing project with the Portland Group (PGI)

There is support for C++ programming (operator overload, for instance)37

CUDA Function Declarations(the “C with extensions” part)

Executed on the:

Only callable from the:

__device__ float myDeviceFunc() device device

__global__ void myKernelFunc() device host

__host__ float myHostFunc() host host

__global__ defines a kernel function Must return void

For a full list, see CUDA Reference Manual

38

Block Execution Scheduling Issues

Who’s Executing Here?[The Stream Multiprocessor (SM)]

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Stream Multiprocessor

Shared Memory

40

The SM represents the quantum of scalability on NVIDIA’s architecture My laptop: 4 SMs The Tesla C1060: 30 SMs

Stream Multiprocessor (SM) 8 Scalar Processors (SP) 2 Special Function Units (SFU) It’s where a block lands for execution

Multi-threaded instruction dispatch From 1 up to 1024 (!) threads active Shared instruction fetch per 32 threads

16 KB shared memory + 16 KB of registers

DRAM texture and memory access

Scheduling on the Hardware

Grid is launched on the SPA

Thread Blocks are serially distributed to all the SMs

Potentially >1 Thread Block per SM

Each SM launches Warps of Threads

SM schedules and executes Warps that are ready to run

As Warps and Thread Blocks complete, resources are freed

SPA can launch next Block[s] in line

NOTE: Two levels of scheduling: For running [desirably] a large number of

blocks on a small number of SMs (16/14/etc.)

For running up to 32 warps of threads on the 8 SPs available on each SM

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

41

SM Executes Blocks

Threads are assigned to SMs in Block granularity

Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…)

One SM can take up to 1024 threads This is 32 warps Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc.

Threads run concurrently but time slicing is involved

SM assigns/maintains thread id #s SM manages/schedules thread execution

There is NO time slicing for block execution

t0 t1 t2 … tm

Blocks

Texture L1SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

42

Thread Scheduling/Execution

Each Thread Block is divided in 32-thread

Warps This is an implementation decision, not part

of the CUDA programming model

Warps are the basic scheduling units in SM

If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?

Each Block is divided into 256/32 = 8 Warps

There are 8 * 3 = 24 Warps At any point in time, only *one* of the 24

Warps will be selected for instruction fetch and execution.

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

43HK-UIUC

SM Warp Scheduling

SM hardware implements zero-overhead Warp scheduling

Warps whose next instruction has its operands ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a Warp execute the same instruction when selected

4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80

Side-comment: Suppose your code has one global

memory access every four instructions Then, a minimal of 13 Warps are needed to

fully tolerate 200-cycle memory latency

warp 8 instruction 11

SM multithreadedWarp scheduler




...

time

warp 3 instruction 3644HK-UIUC

Review: The CUDA Programming Model

GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD)

What’s the overall software (application) development model? CUDA integrated CPU + GPU application C program

Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks

Grid 0. . .

. . .

GPU Parallel KernelKernelA<<< nBlkA, nTidA >>>(args);

Grid 1

CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlkB, nTidB >>>(args);

CPU Serial Code

45

The CPU perspective of the GPU…

The GPU is viewed as a compute device that: Is a co-processor to the CPU or host Runs many threads in parallel

Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

When a kernel is invoked, you will have to instruct the GPU how many threads are supposed to run this kernel You have to indicate the number of blocks of threads You have to indicated how many threads are in each block

46

Caveats [1]

Flop rates for GPUs are reported for single precision operations Double precision is supported but the rule of thumb is that you get about a

4X slowdown relative to single precision

Also, some small deviations from IEEE754 exist Combinations of multiplication and addition in one operation is not compliant

47

Caveats [2]

There is no synchronization between threads that live in different blocks

If all threads need to synchronize, this is accomplished by getting out of the kernel and invoking another one

Average overhead for kernel launch ¼ 90-110 microseconds (small…)

IMPORTANT: Global, constant, and texture memory spaces are persistent across successive kernels calls made by the same application

48

CUDA Memory Spaces

49

The Memory Space

The memory space is the union of Registers Shared memory Device memory, which can be

Global memory Constant memory Texture memory

Remarks The constant memory is cached The texture memory is cached The global memory is NOT cached

Mem Bandwidth, Device Memory: 102 Gb/s

50

CUDA Runtime Partitioning of the Memory Space

The device memory is split in global, constant and texture memory

Note the presence of local memory, which is virtual memory

If too many registers are needed for computation the data overflow is stored in local memory

“Local” means that it’s local, or specific, to one thread

In fact local memory is part of the global memory

Long access times for local mem

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

51

CUDA Device Memory Space

Each thread can: At thread level: R/W registers At thread level: R/W local memory At block level: R/W shared memory At grid level: R/W global memory At grid level: Read only constant memory At grid level: Read only texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

The host can R/W global, constant, and texture memories

NOTE: the texture, constant, and global memory are persistent across kernels called by the same application 52HK-UIUC

Access Times

Register – dedicated HW - single cycle

Shared Memory – dedicated HW - single cycle

Local Memory – DRAM, no cache - *slow*

Global Memory – DRAM, no cache - *slow*

Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality

Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality

Instruction Memory (invisible) – DRAM, cached

53HK-UIUC

Compute Capabilities, Things Change Fast…

54Credit: NVIDIA

Most Common Programming Pattern[interacting with the device memory space]

Sequence of steps most commonly used in GPU computing:

Step 1: Host allocates memory on the device

Step 2: Host copies data into the device

Step 3: Host invokes a kernel that gets executed in parallel and which processes/uses data from the device memory for useful computation

Step 4: Host copies back results from the device

55

56

CUDA Device Memory Allocation

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

57

cudaMalloc() Allocates object in the

device Global Memory Requires two parameters

Address of a pointer to the allocated object

Size of allocated object

cudaFree() Frees object from device

Global Memory Pointer to freed object

HK-UIUC

CUDA Host-Device Data Transfer cudaMemcpy()

memory data transfer Requires four parameters

Pointer to source Pointer to destination Number of bytes copied Type of transfer

Host to Host Host to Device Device to Host Device to Device

Things happen over a PCIe 2.0 16X connection Basically 8 Gb/s (each

way)

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

58HK-UIUC

CUDA Host-Device Data Transfer (cont.)

Example: Transfer a number of “size” bytes M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are

symbolic constants

cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);

cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);

59

CUDA GPU Programming~ Resource Management Considerations ~

60

What Do I Mean By “Resource Management”?

The GPU is a resourceful device

What do you have to do to make sure you capitalize on these resources? In other words, how can you ensure that all the SPs are busy all the time?

To fully exploit the GPU’s potential it is important How many threads you decide to use What memory requirements are associated with a thread How much shared memory gets allocated/used by one block of threads

61

Resource Management – The Key Actors:Threads, Warps, Blocks

A collection of 32 Threads makes up a Warp Warp is something virtual, it’s how the GPU groups the threads together for execution

A Block has at the most 512 threads, that is, 16 Warps Threads are organized in a 3D fashion; each thread has an (Tx,Ty,Tz) unique thread ID Threads in a block get to use together the shared memory

Each Block of threads is executed on a single SM If you run an application with 100 blocks of threads and your GPU has 16 SMs (GTX

8800, for instance), chances are each SM will get to execute about 6 or 7 blocks62

A kernel is executed as a grid of blocks Grid: up to 65535 X 65535 blocks Each block has a unique (Bx, By) unique ID

The threads that belong to the *same* block can cooperate with each other by:

Synchronizing their execution For hazard-free shared memory

accesses Efficiently sharing data through a low

latency shared memory Shared memory is allocated per block

Threads from two different blocks cannot cooperate!!!

This has important software design implications

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

63

Resource Management – The Key Actors:Threads, Warps, Blocks [Cntd.]

Execution Model, Key Observations [1 of 2]

Each block is executed on *one* Stream Multiprocessor (SM)

There is no time slicing when executing a block of threads

Each block is split into warps of threads executed one at a time by the eight SPs of the SM (time slicing in warp execution is constantly done)

64

Execution Model, Key Observations [2 of 2]

A Stream Multiprocessor can execute multiple blocks concurrently

Shared memory and registers are partitioned among the threads of all concurrent blocks

Decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently (very desirable)

The shared memory “belongs” to the block, not to the threads (which merely use it…)

The shared memory space resides in the on-chip shared memory and it “spans” (or encompasses) a thread block

65

Some Hard Constraints [1 of 2]

Max number of warps that one SM can service simultaneously: 32 (on the latest generation of GPUs)

Max number of blocks that one SM can process simultaneously: 8 (it’s been like this for a while)

66

Some Hard Constraints [2 of 2]

The number of registers available on each SM is limited: 16 Kb on latest NVIDIA hardware

The amount of shared memory available to each SM is limited 16 Kb today

67

The Concept of Occupancy

Ideally, you want to have 32 warps serviced at the same time by one SM This keeps the SM busy and hides latencies associated with memory access

Examples: Two blocks with 512 threads running together on one SM: 100% occupancy

Four blocks of 256 threads each running on one SM: 100% occupancy

16 blocks with 64 threads each – not good, can’t have more than 8 blocks running on a SM Effectively this scenario gives you 50% occupancy

68

The Concept of Occupancy [Cntd.]

What prevents you from getting high occupancy?

Many warps means many threads and possibly many blocks

Many blocks ) you can’t have too much shared mem allocated to each one of them Total amount of shared memory in one SM: 16 Kb

Many threads ) you can’t have too many registers used by each thread Size of the register file in one SM: 16 Kb

69

Examples, Occupancy of HW Example 1: If each of your blocks gets assigned 20 Kb of shared

memory, the kernel will fail to launch Not enough memory on the SM to run a block

Example 2: If your blocks each uses 5 Kb of shared mem, you can have three blocks running on one SM (there will be some shared mem that will go unused)

Example 3: Like Example 2 above, and you have 512 threads per block, each thread uses 16 registers. Will one SM be able to handle 2 blocks? Total number of registers ) 512 X 2 X 16 = 16,384 out of the 16,384 are used ) ok Number of warps: 2 blocks X 512 threads = 1024 threads = 32 warps ) ok in CUDA 1.3 You actually have 100% occupancy, maxed out on registers, and lots of shared mem left

70

Resource Utilization There is an “occupancy calculator” that can tell you what percentage

of the HW gets utilized by your kernel

Assumes the form of an Excel spreadsheet

Requires the following input Threads per block Registers per thread Shared memory per block

Google “occupancy calculator cuda” to access it 71

72

CUDA GPU Code Development

73

Code Development Support

How do I compile?

How do I link?

How do I debug?

How do I profile?

74

The CUDA Way: Extended C Declaration specifications:

global, device, shared, local, constant

Keywords threadIdx, blockIdx

Intrinsics __syncthreads

Runtime API For memory, symbol,

execution management

Kernel launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx.x] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

75HK-UIUC

Compiling CUDA nvcc

Compile driver Invokes cudacc, gcc, cl, etc.

PTX Parallel Thread eXecution Like assembly language

NVCC

C/C++ CUDAApplication

PTX to TargetCompiler

G80 … GPU

Target code

PTX Code

CPU Code

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;

Courtesy NVIDIA

76

More on the nvcc compiler

File suffix How the nvcc compiler interprets the file

.cuCUDA source file, containing host and device code

.cupPreprocessed CUDA source file, containing host code and device functions

.c ‘C’ source file

.cc, .cxx, .cpp C++ source file

.gpu GPU intermediate file (device code only)

.ptxPTX intermediate assembly file (device code only)

.cubin CUDA device only binary file

77

Compiling CUDA extended C

78 http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf

http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf

Gauging Memory Use on GPU

Use compile architecture {sm_10}abiversion {1}modname {cubin}code {

name = _Z21MatVecMulKernelShared6Matrix6VectorS0_lmem = 0smem = 1068reg = 8bar = 1const {

segname = constsegnum = 1offset = 0bytes = 8

mem {0x000000ff 0x0000042c

}}bincode {

0x10004209 0x0023c780 0xa000000d 0x04000780 0x1000c801 0x0423c780 0x301fce11 0xec300780

Compile with the “–keep” flag and investigate the .cubin file:

79

Debugging Using the Device Emulation Mode

An executable compiled in device emulation mode (nvcc -deviceemu) runs entirely on the host using the CUDA runtime

No need of any device and CUDA driver

Each device thread is emulated with a host thread

In Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations

80

When running in device emulation mode, one can: Use host native debug support (breakpoints, variable QuickWatch and edit, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads

Device Emulation Mode Pitfalls [1/3]

Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results

81HK-UIUC


Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode

82HK-UIUC


Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results

There are various options to force strict single precision on the host

83HK-UIUC

Concluding Remarks

84

GPU Computing in Engineering

Who stands to benefit in the Engineering community?

FEA Monte Carlo Molecular Dynamics Granular Dynamics Image processing Agent-based modeling …

Generally, any application that fits the SIMD paradigm

85

146X

Medical Imaging U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

MATLAB Computing

AccelerEyes

100X

AstrophysicsRIKEN

149X

Financial simulation

Oxford

47X

Linear AlgebraUniversidad

Jaime

20X

3D UltrasoundTechniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x – 150x

Credit: NVIDIA Corporation

A Word on HPC beyond GPU

We are witnessing a very momentous transformation

Shift from sequential to parallel computing

The support for parallel computing is very homogeneous in structure

GPU not alone in this race of capitalizing on parallel computing for scientific apps

87

Parallel Computing, SW Side…

Other options for leveraging parallel computing in scientific applications

Threads (Posix, Windows)

OpenMP

MPI standard (see MPICH implementation)

Intel’s Thread Building Block (TBB) library

OpenCL standard for heterogeneous computing AMD and NVIDIA provided implementations, Apple to follow up shortly

88

Parallel Computing, HW Side…

Hardware options for HPC

GPU (NVIDIA)

The “fusion” idea (Intel’s Larrabee, AMD’s Fusion)

Cell Blades

Cluster computing (IBM’s BlueGene/P, Q,…)

Cloud Computing

89

Sources of Information, GPU Computing

Read, in this order: NVIDIA CUDA Development Tools 2.3: Getting Started (short doc, July 09) NVIDIA CUDA Programming Guide 2.3 (July 09) NVIDIA CUDA C Programming Best Practices Guide 2.3 (short doc, July 09) NVIDIA CUDA Reference Manual 2.3 (comprehensive, July 09)

Lots of very good examples come with the CUDA SDK distribution More than 25 applications ready to compile/run Makefiles available, ready for use Lots of good code available for reuse + templates for applications

Online material NVIDIA website: code available for many application fields Libs: thrust (http://code.google.com/p/thrust/), cudpp (http://gpgpu.org/developer/cudpp) Course on GPU programming: http://sbel.wisc.edu/Courses/ME964/2008/index.htm

http://code.google.com/p/thrust/

http://gpgpu.org/developer/cudpp

http://sbel.wisc.edu/Courses/ME964/2008/index.htm

Conclusions

91

In the middle of a shift to parallel computing

Hardware changes at higher pace

CUDA – a bright spot in a software landscape otherwise pretty bleak

GPU computing not the silver bullet

GPU for right application can deliver amazing benefits at small time and financial investments

In general, investing in parallel programming skills bound to pay off

Thank You.

92

Review, Execution Model

Move data to device, launch kernel, transfer relevant data back to host

Kernel is a C function executed on the device

Each thread executes the kernel, this happens in parallel

93

Review, Key Concepts

Kernel = GPU program executed by each parallel thread in a block Block = a 3D collection of threads that can cooperate in using the

block’s shared memory and can synchronize during execution

Grid = 2D array of blocks of threads that execute a kernel Device ´ GPU = set of stream multiprocessors (30 SMs) Stream Multiprocessor = 8 scalar processors + shared mem + registers

Memory Location Cached Access WhoLocal Off-chip No Read/write One threadShared On-chip N/A - resident Read/write All threads in a blockGlobal Off-chip No Read/write All threads + hostConstant Off-chip Yes Read All threads + hostTexture Off-chip Yes Read All threads + host

94Off-chip means on-device; i.e., slow access time.

“Parallelism for Everyone” Parallelism changes the game

A large percentage of people who provide applications are going to have to care about parallelism in order to match the capabilities of their competitors.

Vision of the Future Pe

rform

ance

Frequency Era

Time

Multi-core Era

Active SD

Passive SDPlatform Potential Growing gap!

Fixed gap

competitive pressures = demand for parallel applications

Presentation Paul Petersen,Sr. Principal Engineer, Intel

“SD”: Software Development

95

2007

http://www.google.com/imgres?imgurl=http://thetechnut.files.wordpress.com/2007/09/intel-logo.jpg&imgrefurl=http://thetechnut.wordpress.com/2007/09/10/amd-vs-intel/&h=793&w=1201&sz=64&tbnid=_oHQ4mUT4HAJ::&tbnh=99&tbnw=150&prev=/images?q=intel+logo&hl=en&sa=X&oi=image_result&resnum=2&ct=image&cd=1