Upload
hoangcong
View
216
Download
0
Embed Size (px)
Citation preview
GPUs – Under the Hood
Prof. Aaron Lanterman School of Electrical and Computer Engineering
Georgia Institute of Technology
2
Bandwidth – Gravity of modern computer systems • The bandwidth between key components
ultimately dictates system performance – Especially true for massively parallel systems
processing massive amount of data – Tricks like buffering, reordering, caching can
temporarily defy the rules in some cases – Ultimately, the performance falls back to what
the “feeds and speeds” dictate – PCIe replaced AGP (Advanced Graphics Port)
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 6, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
3
3D buzzwords • Fill Rate – how fast the GPU can generate
pixels, often a strong predictor for application frame rate
• Performance Metrics – Mtris/sec - Triangle Rate – Mverts/sec - Vertex Rate – Mpixels/sec - Pixel Fill (Write) Rate – Mtexels/sec - Texture Fill (Read) Rate – Msamples/sec - Antialiasing Fill (Write) Rate
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
4
Adding programmability to the pipeline
3D Application or Game
3D API: OpenGL or Direct3D
Programmable Vertex
Processor
Primitive Assembly
Rasterization & Interpolation
3D API Commands
Transformed Vertices
Assembled Polygons, Lines, and
Points
GPU Command &
Data Stream
Programmable Fragment Processor
Rasterized Pre-transformed
Fragments
Transformed Fragments
Raster Operations Framebuffer
Pixel Updates GPU
Front End
Pre-transformed Vertices
Vertex Index Stream
Pixel Location Stream
CPU – GPU Boundary
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
Shader data • Typically floats, and vectors/matrices
of floats • Fixed size arrays • Three main types:
– Per-instance data, e.g., per-vertex position
– Per-pixel interpolated data, e.g., texture coordinates
– Per-batch data, e.g., light position • Data are tightly bound to the GPU
Shader flow control • Very simple • No recursion • Fixed size loops for Shader Model 2.0
or earlier • Simple if-then-else statements
allowed in the latest APIs • Texkill (asm) or clip (HLSL) or discard (GLSL) allows you to abort a write to a pixel (form of flow control)
7
Specialized instructions (GeForce 6)
• Dot products • Exponential instructions:
– EXP, LOG – LIT (Blinn specular lighting model calculation!)
• Reciprocal instructions: – RCP (reciprocal) – RSQ (reciprocal square root!)
• Trignometric functions – SIN, COS
• Swizzling (swapping xyzw), write masking (only some xyzw get assigned), and negation is “free”
From GPU Gems 2, p. 484
Vertex shader • Transform to clip space • Inputs:
– Common inputs: • Vertex position (x, y, z, w) • Texture coordinate • Vertex colors • Constant inputs
– Output to a pixel (fragment) shader • Vertex shader is executed once per vertex, so
usually less expensive than pixel shader
oD1
Vertex shader data flow (3.0)
Vertex Shader
v15 v0 v1 v2
16 Vertex data registers
Vertex stream
Cn
C0
C1
C2
Con
stan
t flo
at re
gist
ers
(at l
east
256
) 16
Con
stan
t Int
eger
Reg
iste
rs
r31
r0
r1
r2
32 Temporary registers
Each register is a 4-component vector register except aL
aL Loop
Register
a0 Address Register
oPos oTn
texture position fog
oFog oD0
Diff. color Spec. color
oPts
Output Pt size
12 output registers
Vertex shader: logical view Vertex Processing Unit
Per-vertex Input Data
Per-vertex Output Data
Register File
r0 r1 r2 r3 ...
Swizzle / Mask Unit
.rgba
.xyzw
.zzzz
.xxyz ...
cosine log sine sub add ...
Math/Logic Unit
Shader Resources (bound by application)
Shader Start Addr Bound Textures Bound Samplers Bound Consants
Sampler Unit
Texture Memory
Shader Constants
Input Data Architectural State
Output Data Control Logic
State Information
Memory
Transformed and
Lit vertices
Some uses of vertex shaders • Transform vertices to clip space • Pass normal, texture coordinates to PS • Transform vectors to other spaces (e.g.,
texture space) • Calculate per-vertex lighting (e.g., Gouraud
shading) • Distort geometry (waves)
Adapted from Mart Slot’s presentation
12
Easy cross products and normalization
From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall
13
Blinn lighting in “one” instruction
From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall
14
Simple graphics pipeline
From Stanford CS448A: Real-Time Graphics Architectures See graphics.stanford.edu/courses/cs448a-01-fall
Pixel (or fragment) shader (1) • Determine each fragment’s color
– Custom (sophisticated) pixel operations – Texture sampling
• Inputs – Interpolated output from vertex shader – Typically vertex position, vertex normals, texture
coordinates, etc. – These registers could be reused for other purpose
• Output – Color (including alpha) – Depth value (optional)
Pixel (or fragment) shader (2)
• Executed once per pixel, hence typically executed many more times than a vertex shader
• It is advantageous to compute stuff on a per-vertex basis to improve performance
Pixel shader data flow (3.0)
Pixel Shader
Color (diff/spec) and texture coord. registers
Pixel stream
Cn
C0
C1
Con
stan
t reg
iste
rs
(16
INT,
224
Flo
at)
r31
r0
r1
Temporary registers
oC0 oDepth
Depth color
s15
s0
s1
Sam
pler
Reg
iste
rs
(Up
to 1
6 te
xtur
e su
rface
s
can
be re
ad in
a s
ingl
e pa
ss)
v9 v0 v1
Pixel shader: logical view Pixel Processing Unit
Per-pixel Input Data
Per-pixel Output Data
Register File
r0 r1 r2 r3 ...
Swizzle / Mask Unit
.rgba
.xyzw
.zzzz
.xxyz ...
cosine log sine sub add ...
Math/Logic Unit
Shader Resources (bound by application)
Shader Start Addr Bound Textures Bound Samplers Bound Consants
Sampler Unit
Texture Memory
Shader Constants
Input Data Architectural State
Output Data Control Logic
State Information
Memory
Interpolator
Pixel Color Depth Info Stencil Info
Color buffer Depth Buffer Stencil Buffer
Some uses of pixel shaders • Texturing objects • Per-pixel lighting (e.g., Phong shading) • Normal mapping (each pixel has its own
normal) • Shadows (determine whether a pixel is
shadowed or not) • Environment mapping
Adapted from Mart Slot’s presentation
20
Old GeForce graphics pipeline Host
Vertex Control Vertex Cache
VS/T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
21
Vertex cache • Reusing vertices between primitives
saves PCIe bus bandwidth and GPU computational resources
• A vertex cache attempts to exploit “commonality” between triangles to generate vertex reuse
• Unfortunately, many applications do not use efficient triangular ordering
Host
Vertex Control Vertex Cache
VS/T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
22
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
Texture cache • Stores temporally local texel values
to reduce bandwidth requirements
• Due to nature of texture filtering high degrees of efficiency are possible (75% or better hit rates)
• Reduces texture (memory) bandwidth by a factor of four for bilinear filtering
Host
Vertex Control Vertex Cache
T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
23
Built-in texture filtering (GeForce 6) • Pixel texturing
– Hardware supports 2D, 3D, and cube map – Non power-of-2 textures OK – Hardware handles addressing and interpolation
• Bilinear, trilinear (3D or mipmap), anisotropic
• Vertex texturing – Vertex processors can access texture memory too – Only nearest-neighbor filtering supported in G60
hardware
24
ROP (Raster Operations)
• C-ROP performs frame buffer blending – Combinations of colors and transparency – Antialiasing – Read/Modify/Write the Color Buffer
• Z-ROP performs the Z operations – Determine the visible pixels – Discard the occluded pixels – Read/Modify/Write the Z-Buffer
• ROP on GeForce also performs – “Coalescing” of transactions – Z-Buffer compression/decompression
Host
Vertex Control Vertex Cache
T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
25
The frame buffer • The primary determinant of graphics
performance other than the GPU • The most expensive component of a
graphics product other than the GPU • Memory bandwidth is the key • Frame buffer size also determines
– Local texture storage – Maximum resolutions – Anitaliasing resolution limits
Host
Vertex Control Vertex Cache
T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
26
Frame Buffer Interface (FBI)
• Manages reading from and writing to frame buffer
• Perhaps the most performance-critical component of a GPU
• GeForce’s FBI is a crossbar • Independent memory controllers for
4+ independent memory banks for more efficient access to frame buffer
Host
Vertex Control
Surface Engine Vertex Cache
T&L
Triangle Setup
Raster
Shader
ROP
FBI
Texture Cache Frame
Buffer Memory
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 5, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
27
GeForce 7800 GTX board details
256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express
SLI Connector
DVI x 2
sVideo TV Out
Single slot cooling
Slide by David Kirk/NVIDIA and Wen-mei. W. Hwu, 2007, from UIUC ECE498 Lecture 6, Fall 2007; used with permission See http://courses.engr.illinois.edu/ece498/al
28
From www.xbitlabs.com/articles/video/display/g70-indepth.html NVIDIA 7800 GTX
ROPs (Raster Op. Units)
Vertex Processors
Pixel Processors
G70 Architecture
29
NVIDIA 7800 GTX
Vertex Processors
G70 Architecture
NVIDIA 7800 GTX – Vertex processors
G70 Architecture
7800 GTX has 8 of these
30
NVIDIA 7800 GTX
G70 Architecture
NVIDIA 7800 GTX – Pixel processors
8 MADD (multiply/add) instructions in a single cycle
From http://www.xbitlabs.com/articles/video/display/g70-indepth_3.html
7800 GTX has 24 of these
31
NVIDIA 7800 GTX
Vertex Processors
G70 Architecture
Modern GPUs: unified design G70 Architecture
Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
GeForce 8 architecture
32
Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Why unify? (1)
33 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Why unify? (2)
34 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Dynamic load balancing – Company of Heroes
35 Slide by David Luebke from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Motivation for shader languages
From The Cg Tutorial
• Programming powerful hardware with assembly code is hard
• Programmers need the benefits of
a high-level language: – Easier programming – Easier code reuse – Easier debugging – Portability
Assembly … DP3 R0, c[11].xyzx, c[11].xyzx; RSQ R0, R0.x; MUL R0, R0.x, c[11].xyzx; MOV R1, c[3]; MUL R1, R1.x, c[0].xyzx; DP3 R2, R1.xyzx, R1.xyzx; RSQ R2, R2.x; MUL R1, R2.x, R1.xyzx; ADD R2, R0.xyzx, R1.xyzx; DP3 R3, R2.xyzx, R2.xyzx; RSQ R3, R3.x; MUL R2, R3.x, R2.xyzx; DP3 R2, R1.xyzx, R2.xyzx; MAX R2, c[3].z, R2.x; MOV R2.z, c[3].y; MOV R2.w, c[3].y; LIT R2, R2; ...
float3 cSpecular = pow(max(0, dot(Nf, H)), phongExp).xxx; float3 cPlastic = Cd * (cAmbient + cDiffuse) + Cs * cSpecular;
Shader languages • HLSL/Cg most common
– Both are more-or-less compatible
• Other alternatives: – GLSL (for OpenGL) – Assembly? (not anymore…)