CUDA GPU Computing

Preview:

DESCRIPTION

CUDA GPU Computing. Advisor : Cho-Chin Lin Student : Chien-Chen Lai. Outline. Introduction and Motivation. What is driving the many-cores?. Control. ALU. ALU. ALU. ALU. DRAM. Cache. DRAM. Design philosophies are different. - PowerPoint PPT Presentation

Citation preview

1

CUDA GPU Computing

Advisor: Cho-Chin Lin

Student : Chien-Chen Lai

2

Outline

Introduction and Motivation

3

What is driving the many-cores?

Quadro FX 5600

NV35 NV40

G70G70-512

G71

Tesla C870

NV30

3.0 GHzCore 2 Quad3.0 GHz

Core 2 Duo3.0 GHz Pentium 4

GeForce8800 GTX

0

100

200

300

400

500

600

Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007

GF

LO

PS

4

Design philosophies are different.

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about).

So, more transistors can be devoted to data processing rather than data caching and flow control

5

6

CPU VS. GPU

Jamie and Adam demonstrate the difference between a CPU and GPU.

7

This is not your advisor’s parallel computer! Significant application-level speedup over

uni-processor executionNo more “killer micros”

Easy entrance An initial, naïve code typically get at least 2-

3X speedup

8

This is not your advisor’s parallel computer! Wide availability to end users

available on laptops, desktops, clusters, super-computers

Numerical precision and accuracy IEEE floating-point and double precision

9

Historic GPGPU Constraints

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

Dealing with graphics API Working with the corner cases of

the graphics API Addressing modes

Limited texture size/dimension Shader capabilities

Limited outputs Instruction sets

Lack of Integer & bit ops Communication limited

No interaction between pixels No scatter store ability - a[i] = p

10

CUDA - No more shader functions. CUDA integrated CPU+GPU application C program

Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

11

CUDA for Multi-Core CPU A single GPU thread is too small for a CPU Thread

CUDA emulation does this and performs poorly CPU cores designed for ILP, SIMD

Optimizing compilers work well with iterative loops Turn GPU thread blocks from CUDA into iterative CPU loops

CUDA Grid

GPU CPU

Compiler

12

CUDA for Multi-Core CPU

Application C on single core CPU

Time

CUDA on 4-core CPU

Time

Speedup*

CUDA on G80

Time

MRI-FHD ~1000s 230s ~4x 8.5s

CP 180s 45s 4x .28s

SAD 42.5ms 25.6ms 1.66x 4.75ms

MM (4Kx4K) 7.84s** 15.5s 3.69x 1.12s

Recommended