47
INF5063 – GPU & CUDA Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory

INF5063 – GPU & CUDA

Embed Size (px)

DESCRIPTION

INF5063 – GPU & CUDA. Håkon Kvale Stensland Department for Informatics / Simula Research Laboratory. GPU – Graphics Processing Units. Basic 3D Graphics Pipeline. Application. Host. Scene Management. Geometry. Rasterization. Frame Buffer Memory. GPU. Pixel Processing. - PowerPoint PPT Presentation

Citation preview

Page 1: INF5063 – GPU & CUDA

INF5063 – GPU & CUDA

Håkon Kvale StenslandDepartment for Informatics / Simula Research

Laboratory

Page 2: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

GPU – Graphics Processing Units

Page 3: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Basic 3D Graphics Pipeline

Application

Scene Management

Geometry

Rasterization

Pixel Processing

ROP/FBI/Display

FrameBuffer

Memory

Host

GPU

Page 4: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

PC Graphics Timeline

Challenges:− Render infinitely complex scenes− And extremely high resolution− In 1/60th of one second (60 frames per second)

Graphics hardware has evolved from a simple hardwired pipeline to a highly programmable multiword processor

1998 1999 2000 2001 2002 2003 2004

DirectX 6Multitexturing

Riva TNT

DirectX 8SM 1.x

GeForce 3 Cg

DirectX 9SM 2.0

GeForceFX

DirectX 9.0cSM 3.0

GeForce 6DirectX 5Riva 128

DirectX 7T&L TextureStageState

GeForce 256

2005 2006

GeForce 7 GeForce 8SM 3.0 SM 4.0

DirectX 9.0c DirectX 10

Page 5: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Graphics in the PC Architecture

DMI (Direct Media Interface) between processor and chipset− Memory Control now

integrated in CPU The old “Northbridge”

integrated onto CPU− Intel calls this part of the

CPU “System Agent”− PCI Express 3.0 x16

bandwidth at 32 GB/s (16 GB in each direction)

“Southbridge” (X99) handles all other peripherals

All mainstream CPUs now come with integrated GPU− Same capabilities as discrete

GPU’s− Less performance (limited by

die space and power) Intel Haswell

Page 6: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

GPUs not always for Graphics

GPUs are now common in HPC Second largest supercomputer

in November 2013 is the Titan at Oak Ridge National Laboratory− 18688 16-core Opteron

processors− 16688 Nvidia Kepler GPU’s− Theoretical: 27 petaflops

Before: Dedicated compute card released after graphics model

Now: Nvidia’s high-end Kepler GPU (GK110) released first as a compute product

GeForce GTX Titan Black

Tesla K40

Page 7: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

High-end Compute Hardware nVIDIA Kepler

Architecture The latest generation

GPU, codenamed GK110

7,1 billion transistors 2688 Processing cores

(SP)− IEEE 754-2008 Capable− Shared coherent L2 cache− Full C++ Support− Up to 32 concurrent

kernels− 6 GB memory − Supports GPU

virtualization

Tesla K40

Page 8: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

High-end Graphics Hardware nVIDIA Maxwell

Architecture The latest generation

GPU, codenamed GM204

5,2 billion transistors 2048 Processing cores

(SP)− IEEE 754-2008 Capable− Shared coherent L2 cache− Full C++ Support− Up to 32 concurrent

kernels− 4 GB memory − Supports GPU

virtualization

GeForce GTX 980

Page 9: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

GeForce GM204 Architecture

Page 10: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Lab Hardware

nVIDIA GeForce GTX 750− All machines have the same GPU− Maxwell Architecture

Based on the GM107 chip− 1870 million transistors− 512 Processing cores (SP) at

1022 MHz (4 SMM)− 1024 MB Memory with 80

GB/sec bandwidth− 1044 GFLOPS theoretical

performance− Compute version 5.0

Page 11: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

CPU and GPU Design Philosophy

GPU Throughput Oriented Cores

Chip

Compute UnitCache/Local Mem

Registers

SIMD Unit

Th

read

ing

CPULatency Oriented Cores

Chip

Core

Local Cache

Registers

SIMD Unit

Co

ntro

l

Page 12: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

CPUs: Latency Oriented Design

Large caches−Convert long latency

memory accesses to short latency cache accesses

Sophisticated control−Branch prediction for

reduced branch latency−Data forwarding for

reduced data latency

Powerful ALU−Reduced operation

latency

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

Page 13: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

GPUs: Throughput Oriented Design

Small caches−To boost memory

throughput

Simple control−No branch prediction−No data forwarding

Energy efficient ALUs−Many, long latency but

heavily pipelined for high throughput

Require massive number of threads to tolerate latencies

DRAM

GPU

Page 14: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Think both about CPU and GPU…

CPUs for sequential parts where latency matters−CPUs can be 10+X

faster than GPUs for sequential code

GPUs for parallel parts where throughput wins−GPUs can be 10+X

faster than CPUs for parallel code

Page 15: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

The Core: The basic processing block

The nVIDIA Approach:− Called Stream Processor and

CUDA cores. Works on a single operation.

The AMD Approach: − VLIW5: The GPU work on up

to five operations− VLIW4: The GPU work on up

to four operations− GCN: 16-wide SIMD vector

unit

Graphics Core Next (GCN):

Page 16: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

The Core: The basic processing block

The (failed) Intel Approach:− 512-bit SIMD units in x86

cores− Failed because of complex

x86 cores and software ROP pipeline

The (new) Intel Approach:− Used in Sandy Bridge, Ivy

Bridge, Haswell & Broadwell− 128 SIMD-8 32-bit registers

Page 17: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

The nVIDIA GPU Architecture Evolving

Streaming Multiprocessor (SM) 2.0 on the Fermi Architecture (GF1xx) 32 CUDA Cores (Core) 4 Super Function Units (SFU)

Dual schedulers and dispatch units 1 to 1536 threads active

Local register (32k) 64 KB shared memory / Level 1 cache 2 operations per cycle

Streaming Multiprocessor (SM) 1.x on the Tesla Architecture 8 CUDA Cores (Core) 2 Super Function Units (SFU)

Dual schedulers and dispatch units 1 to 512 or 768 threads active

Local register (32k) 16 KB shared memory 2 operations per cycle

Page 18: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

The nVIDIA GPU Architecture Evolving

Streaming Multiprocessor (SMX) 3.5 on the Kepler Architecture (Compute) 192 CUDA Cores (Core) 64 DP CUDA Cores (DP Core) 32 Super Function Units (SFU)

Four (simple) schedulers and eight dispatch units

1 to 2048 threads active

Local register (64k) 64 KB shared memory / Level 1 cache 48 KB Texture Cache 1 operation per cycle

Streaming Multiprocessor (SMX) 3.0 on the Kepler Architecture (Graphics) 192 CUDA Cores (CC) 8 DP CUDA Cores (DP Core) 32 Super Function Units (SFU)

Four (simple) schedulers and eight dispatch units

1 to 2048 threads active

Local register (32k) 64 KB shared memory / Level 1 cahce 1 operation per cycle

Page 19: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Streaming Multiprocessor (SMM) – 4.0

Streaming Multiprocessor (SMM) on Maxwell (Graphics) 128 CUDA Cores (Core) 4 DP CUDA Cores (DP Core) 32 Super Function Units

(SFU) Four schedulers and eight

dispatch units 1 to 2048 active threads Software controlled

scheduling Local register (64k) 64 KB shared memory 24 KB Level 1 / Texture Cache 1 operation per cycle GM107 / GM204

Page 20: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Execution Pipes in SMM

Scalar MAD pipe− Float Multiply, Add, etc.− Integer operations− Conversions− Only one instruction per clock

Scalar SFU pipe− Special functions like Sin, Cos, Log, etc.

• Only one operation per four clocks Load/Store pipe

− CUDA has both global and local memory access through Load/Store

− Access to shared memory Texture Units

− Access to Texture memory with texture cache (per SMM)

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 21: INF5063 – GPU & CUDA

GPGPU

Foils adapted from nVIDIA

Page 22: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

What is really GPGPU?

Idea:• Potential for very high performance at low cost• Architecture well suited for certain kinds of parallel

applications (data parallel)• Demonstrations of 30-100X speedup over CPU

Early challenges:− Architectures very customized to graphics problems

(e.g., vertex and fragment processors)− Programmed using graphics-specific programming

models or libraries

Page 23: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Previous GPGPU use, and limitations

Working with a Graphics API− Special cases with an API like

Microsoft Direct3D or OpenGL

Addressing modes− Limited by texture size

Shader capabilities− Limited outputs of the available

shader programs

Instruction sets− No integer or bit operations

Communication is limited− Between pixels

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

Page 24: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Heterogeneous computing is catching on…

Financial Analysis

Scientific Simulation

Engineering

Simulation

Data Intensive Analytics

Medical Imaging

Digital Audio

Processing

Computer Vision

Digital Video

Processing

Biomedical Informatic

s

Electronic Design

Automation

Statistical Modeling

Ray Tracing

Rendering

Interactive Physics

Numerical Methods

Page 25: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

nVIDIA CUDA

“Compute Unified Device Architecture” General purpose programming model

− User starts several batches of threads on a GPU− GPU is in this case a dedicated super-threaded,

massively data parallel co-processor

Software Stack− Graphics driver, language compilers (Toolkit), and tools

(SDK)

Graphics driver loads programs into GPU− All drivers from nVIDIA now support CUDA− Interface is designed for computing (no graphics )− “Guaranteed” maximum download & readback speeds− Explicit GPU memory management

Page 26: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Outline

The CUDA Programming Model − Basic concepts and data types

An example application:− The good old Motion JPEG implementation!

− Profiling with CUDA− Make an example program!

Page 27: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

The CUDA Programming Model

The GPU is viewed as a compute device that:− Is a coprocessor to the CPU, referred to as the

host− Has its own DRAM called device memory− Runs many threads in parallel

Data-parallel parts of an application are executed on the device as kernels, which run in parallel on many threads

Differences between GPU and CPU threads − GPU threads are extremely lightweight

• Very little creation overhead− GPU needs 1000s of threads for full efficiency

• Multi-core CPU needs only a few

Page 28: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

CUDA C – Execution Model

Integrated host + device C program− Serial or modestly parallel parts in host C code− Highly parallel parts in device SPMD kernel C

code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Page 29: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Thread Batching: Grids and Blocks

A kernel is executed as a grid of thread blocks− All threads share data

memory space A thread block is a batch

of threads that can cooperate with each other by:− Synchronizing their

execution• Non synchronous

execution is very bad for performance!

− Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 30: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

CUDA Device Memory Space Overview

Each thread can:− R/W per-thread registers− R/W per-thread local memory− R/W per-block shared

memory− R/W per-grid global memory− Read only per-grid constant

memory− Read only per-grid texture

memory

The host can R/W global, constant, and texture memories

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 31: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Global, Constant, and Texture Memories

Global memory:−Main means of

communicating R/W Data between host and device

−Contents visible to all threads

Texture and Constant Memories:−Constants initialized

by host −Contents visible to

all threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 32: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Terminology Recap

device = GPU = Set of multiprocessors Multiprocessor = Set of processors & shared

memory Kernel = Program running on the GPU Grid = Array of thread blocks that execute a

kernel Thread block = Group of SIMD threads that

execute a kernel and can communicate via shared memoryMemory Location Cached Access Who

Local Off-chip No Read/write One thread

Shared On-chip N/A - resident Read/write All threads in a block

Global Off-chip No Read/write All threads + host

Constant Off-chip Yes Read All threads + host

Texture Off-chip Yes Read All threads + host

Page 33: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Access Times

Register – Dedicated HW – Single cycle Shared Memory – Dedicated HW – Single cycle Local Memory – DRAM, no cache – “Slow” Global Memory – DRAM, no cache – “Slow” Constant Memory – DRAM, cached, 1…10s…

100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s

of cycles, depending on cache locality

Page 34: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Terminology Recap

device = GPU = Set of multiprocessors Multiprocessor = Set of processors & shared

memory Kernel = Program running on the GPU Grid = Array of thread blocks that execute a

kernel Thread block = Group of SIMD threads that

execute a kernel and can communicate via shared memoryMemory Location Cached Access Who

Local Off-chip No Read/write One thread

Shared On-chip N/A - resident Read/write All threads in a block

Global Off-chip No Read/write All threads + host

Constant Off-chip Yes Read All threads + host

Texture Off-chip Yes Read All threads + host

Page 35: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Global Memory Bandwidth

Ideal Reality

Page 36: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Conflicting Data Accesses Cause Serialization and Delays

Massively parallel execution cannot afford serialization

Contentions in accessing critical data causes serialization

Page 37: INF5063 – GPU & CUDA

Some Information on the Toolkit

Page 38: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Compilation

Any source file containing CUDA language extensions must be compiled with nvcc

nvcc is a compiler driver− Works by invoking all the necessary tools and

compilers like cudacc, g++, etc. nvcc can output:

− Either C code• That must then be compiled with the rest of the

application using another tool− Or object code directly

Page 39: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Linking & Profiling

Any executable with CUDA code requires two dynamic libraries:−The CUDA runtime library (cudart)−The CUDA core library (cuda)

Several tools are available to optimize your application−nVIDIA CUDA Visual Profiler−nVIDIA Occupancy Calculator

NVIDIA Parallel Nsight for Visual Studio and Eclipse

Page 40: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Before you start…

Four lines have to be added to your group users .bash_profile or .bashrc file

PATH=$PATH:/usr/local/cuda-6.5/binLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-

6.5/lib64:/lib

export PATHexport LD_LIBRARY_PATH

Code samples is installed with CUDA Copy and build in your users home directory

Page 41: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Some usefull resources

NVIDIA CUDA Programming Guide 6.5

http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

NVIDIA CUDA C Best Practices Guide 6.5http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf

Tuning CUDA Applications for Maxwellhttp://docs.nvidia.com/cuda/pdf/Maxwell_Tuning_Guide.pdf

Parallel Thread Execution ISA – Version 4.1http://docs.nvidia.com/cuda/pdf/ptx_isa_4.1.pdf

NVIDIA CUDA Getting Started Guide for Linuxhttp://docs.nvidia.com/cuda/pdf/CUDA_Getting_Started_Linux.pdf

Page 42: INF5063 – GPU & CUDA

Example:

Motion JPEG Encoding

Page 43: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

14 different MJPEG encoders on GPU

Nvidia GeForce GPU

Problems:• Only used global memory• To much synchronization between

threads• Host part of the code not optimized

Page 44: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Profiling a Motion JPEG encoder on x86

• A small selection of DCT algorithms:• 2D-Plain: Standard forward

2D DCT• 1D-Plain: Two consecutive

1D transformations with transpose in between and after

• 1D-AAN: Optimized version of 1D-Plain

• 2D-Matrix: 2D-Plain implemented with matrix multiplication

• Single threaded application profiled on a Intel Core i5 750

Page 45: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Optimizing for GPU, use the memory correctly!!

• Several different types of memory on GPU:• Global memory• Constant memory• Texture memory• Shared memory

• First Commandment when using the GPUs:• Select the correct

memory space, AND use it correctly!

Page 46: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

How about using a better algorithm??

• Used CUDA Visual Profiler to isolate DCT performance

• 2D-Plain Optimized is optimized for GPU:• Shared memory• Coalesced memory

access• Loop unrolling• Branch prevention• Asynchronous transfers

• Second Commandment when using the GPUs:• Choose an algorithm

suited for the architecture!

Page 47: INF5063 – GPU & CUDA

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon StenslandUniversity of Oslo

Effect of offloading VLC to the GPU

• VLC (Variable Length Coding) can also be offloaded:• One thread per

macro block• CPU does bitstream

merge• Even though algorithm

is not perfectly suited for the architecture, offloading effect is still important!