GPUs – Under the Hoodstanley.gatech.edu/.../466/2016/07/gpulecture07su16_underthehood.pdf · GPUs – Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering

GPUs – Under the Hood

Prof. Aaron Lanterman School of Electrical and Computer Engineering

Georgia Institute of Technology

2

Bandwidth – Gravity of modern computer systems • The bandwidth between key components

ultimately dictates system performance – Especially true for massively parallel systems

processing massive amount of data – Tricks like buffering, reordering, caching can

temporarily defy the rules in some cases – Ultimately, the performance falls back to what

the “feeds and speeds” dictate – PCIe replaced AGP (Advanced Graphics Port)

Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 6, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al

3

3D buzzwords • Fill Rate – how fast the GPU can generate

pixels, often a strong predictor for application frame rate

• Performance Metrics – Mtris/sec - Triangle Rate – Mverts/sec - Vertex Rate – Mpixels/sec - Pixel Fill (Write) Rate – Mtexels/sec - Texture Fill (Read) Rate – Msamples/sec - Antialiasing Fill (Write) Rate


4

Adding programmability to the pipeline

3D Application or Game

3D API: OpenGL or Direct3D

Programmable Vertex

Processor

Primitive Assembly

Rasterization & Interpolation

3D API Commands

Transformed Vertices

Assembled Polygons, Lines, and

Points

GPU Command &

Data Stream

Programmable Fragment Processor

Rasterized Pre-transformed

Fragments

Transformed Fragments

Raster Operations Framebuffer

Pixel Updates GPU

Front End

Pre-transformed Vertices

Vertex Index Stream

Pixel Location Stream

CPU – GPU Boundary


Shader data •  Typically floats, and vectors/matrices

of floats •  Fixed size arrays •  Three main types:

– Per-instance data, e.g., per-vertex position

– Per-pixel interpolated data, e.g., texture coordinates

– Per-batch data, e.g., light position •  Data are tightly bound to the GPU

Shader flow control •  Very simple •  No recursion •  Fixed size loops for Shader Model 2.0

or earlier •  Simple if-then-else statements

allowed in the latest APIs • Texkill (asm) or clip (HLSL) or discard (GLSL) allows you to abort a write to a pixel (form of flow control)

7

Specialized instructions (GeForce 6)

•  Dot products •  Exponential instructions:

–  EXP, LOG –  LIT (Blinn specular lighting model calculation!)

•  Reciprocal instructions: –  RCP (reciprocal) –  RSQ (reciprocal square root!)

•  Trignometric functions –  SIN, COS

•  Swizzling (swapping xyzw), write masking (only some xyzw get assigned), and negation is “free”

From GPU Gems 2, p. 484

Vertex shader • Transform to clip space • Inputs:

– Common inputs: • Vertex position (x, y, z, w) • Texture coordinate • Vertex colors • Constant inputs

– Output to a pixel (fragment) shader • Vertex shader is executed once per vertex, so

usually less expensive than pixel shader

oD1

Vertex shader data flow (3.0)

Vertex Shader

v15 v0 v1 v2

16 Vertex data registers

Vertex stream

Cn

C0

C1

C2

Con

stan

t flo

at re

gist

ers

(at l

east

256

) 16

Con

stan

t Int

eger

Reg

iste

rs

r31

r0

r1

r2

32 Temporary registers

Each register is a 4-component vector register except aL

aL Loop

Register

a0 Address Register

oPos oTn

texture position fog

oFog oD0

Diff. color Spec. color

oPts

Output Pt size

12 output registers

Vertex shader: logical view Vertex Processing Unit

Per-vertex Input Data

Per-vertex Output Data

Register File

r0 r1 r2 r3 ...

Swizzle / Mask Unit

.rgba

.xyzw

.zzzz

.xxyz ...

cosine log sine sub add ...

Math/Logic Unit

Shader Resources (bound by application)

Shader Start Addr Bound Textures Bound Samplers Bound Consants

Sampler Unit

Texture Memory

Shader Constants

Input Data Architectural State

Output Data Control Logic

State Information

Memory

Transformed and

Lit vertices

Some uses of vertex shaders • Transform vertices to clip space • Pass normal, texture coordinates to PS • Transform vectors to other spaces (e.g.,

texture space) • Calculate per-vertex lighting (e.g., Gouraud

shading) • Distort geometry (waves)

Adapted from Mart Slot’s presentation

12

Easy cross products and normalization

From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall

13

Blinn lighting in “one” instruction


14

Simple graphics pipeline


Pixel (or fragment) shader (1) • Determine each fragment’s color

– Custom (sophisticated) pixel operations – Texture sampling

• Inputs –  Interpolated output from vertex shader – Typically vertex position, vertex normals, texture

coordinates, etc. – These registers could be reused for other purpose

• Output – Color (including alpha) – Depth value (optional)

Pixel (or fragment) shader (2)

• Executed once per pixel, hence typically executed many more times than a vertex shader

• It is advantageous to compute stuff on a per-vertex basis to improve performance

Pixel shader data flow (3.0)

Pixel Shader

Color (diff/spec) and texture coord. registers

Pixel stream

Cn

C0

C1

Con

stan

t reg

iste

rs

(16

INT,

224

Flo

at)

r31

r0

r1

Temporary registers

oC0 oDepth

Depth color

s15

s0

s1

Sam

pler

Reg

iste

rs

(Up

to 1

6 te

xtur

e su

rface

s

can

be re

ad in

a s

ingl

e pa

ss)

v9 v0 v1

Pixel shader: logical view Pixel Processing Unit

Per-pixel Input Data

Per-pixel Output Data

Register File

r0 r1 r2 r3 ...

Swizzle / Mask Unit

.rgba

.xyzw

.zzzz

.xxyz ...

cosine log sine sub add ...

Math/Logic Unit

Shader Resources (bound by application)

Shader Start Addr Bound Textures Bound Samplers Bound Consants

Sampler Unit

Texture Memory

Shader Constants

Input Data Architectural State

Output Data Control Logic

State Information

Memory

Interpolator

Pixel Color Depth Info Stencil Info

Color buffer Depth Buffer Stencil Buffer

Some uses of pixel shaders • Texturing objects • Per-pixel lighting (e.g., Phong shading) • Normal mapping (each pixel has its own

normal) • Shadows (determine whether a pixel is

shadowed or not) • Environment mapping

Adapted from Mart Slot’s presentation

20

Old GeForce graphics pipeline Host

Vertex Control Vertex Cache

VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory


21

Vertex cache • Reusing vertices between primitives

saves PCIe bus bandwidth and GPU computational resources

• A vertex cache attempts to exploit “commonality” between triangles to generate vertex reuse

• Unfortunately, many applications do not use efficient triangular ordering

Host


VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory


22


Texture cache • Stores temporally local texel values

to reduce bandwidth requirements

• Due to nature of texture filtering high degrees of efficiency are possible (75% or better hit rates)

• Reduces texture (memory) bandwidth by a factor of four for bilinear filtering

Host


T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory

23

Built-in texture filtering (GeForce 6) • Pixel texturing

– Hardware supports 2D, 3D, and cube map – Non power-of-2 textures OK – Hardware handles addressing and interpolation

• Bilinear, trilinear (3D or mipmap), anisotropic

• Vertex texturing – Vertex processors can access texture memory too – Only nearest-neighbor filtering supported in G60

hardware

24

ROP (Raster Operations)

•  C-ROP performs frame buffer blending –  Combinations of colors and transparency –  Antialiasing –  Read/Modify/Write the Color Buffer

•  Z-ROP performs the Z operations –  Determine the visible pixels –  Discard the occluded pixels –  Read/Modify/Write the Z-Buffer

•  ROP on GeForce also performs –  “Coalescing” of transactions –  Z-Buffer compression/decompression

Host


T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory


25

The frame buffer • The primary determinant of graphics

performance other than the GPU • The most expensive component of a

graphics product other than the GPU • Memory bandwidth is the key • Frame buffer size also determines

– Local texture storage – Maximum resolutions – Anitaliasing resolution limits

Host


T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory


26

Frame Buffer Interface (FBI)

• Manages reading from and writing to frame buffer

• Perhaps the most performance-critical component of a GPU

• GeForce’s FBI is a crossbar • Independent memory controllers for

4+ independent memory banks for more efficient access to frame buffer

Host

Vertex Control

Surface Engine Vertex Cache

T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture Cache Frame

Buffer Memory


27

GeForce 7800 GTX board details

256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express

SLI Connector

DVI x 2

sVideo TV Out

Single slot cooling


28

From www.xbitlabs.com/articles/video/display/g70-indepth.html NVIDIA 7800 GTX

ROPs (Raster Op. Units)

Vertex Processors

Pixel Processors

G70 Architecture

29

NVIDIA 7800 GTX

Vertex Processors

G70 Architecture

NVIDIA 7800 GTX – Vertex processors

G70 Architecture

7800 GTX has 8 of these

30

NVIDIA 7800 GTX

G70 Architecture

NVIDIA 7800 GTX – Pixel processors

8 MADD (multiply/add) instructions in a single cycle

From http://www.xbitlabs.com/articles/video/display/g70-indepth_3.html

7800 GTX has 24 of these

31

NVIDIA 7800 GTX

Vertex Processors

G70 Architecture

Modern GPUs: unified design G70 Architecture

Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

GeForce 8 architecture

32

Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Why unify? (1)

33 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Why unify? (2)


Dynamic load balancing – Company of Heroes


Motivation for shader languages

From The Cg Tutorial

•  Programming powerful hardware with assembly code is hard

•  Programmers need the benefits of

a high-level language: –  Easier programming –  Easier code reuse –  Easier debugging –  Portability

Assembly … DP3 R0, c[11].xyzx, c[11].xyzx; RSQ R0, R0.x; MUL R0, R0.x, c[11].xyzx; MOV R1, c[3]; MUL R1, R1.x, c[0].xyzx; DP3 R2, R1.xyzx, R1.xyzx; RSQ R2, R2.x; MUL R1, R2.x, R1.xyzx; ADD R2, R0.xyzx, R1.xyzx; DP3 R3, R2.xyzx, R2.xyzx; RSQ R3, R3.x; MUL R2, R3.x, R2.xyzx; DP3 R2, R1.xyzx, R2.xyzx; MAX R2, c[3].z, R2.x; MOV R2.z, c[3].y; MOV R2.w, c[3].y; LIT R2, R2; ...

float3 cSpecular = pow(max(0, dot(Nf, H)), phongExp).xxx; float3 cPlastic = Cd * (cAmbient + cDiffuse) + Cs * cSpecular;

Shader languages • HLSL/Cg most common

– Both are more-or-less compatible

• Other alternatives: – GLSL (for OpenGL) – Assembly? (not anymore…)

Documents

GPUs – Under the Hoodstanley.gatech.edu/.../466/2016/07/gpulecture07su16_underthehood.pdf · GPUs – Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering