Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel

Status – Week Status – Week 276276

Victor MoyaVictor Moya

Hardware PipelineHardware Pipeline

Command Processor.Command Processor. Vertex Shader.Vertex Shader. Rasterization.Rasterization. Pixel Shader.Pixel Shader. Fragment Operations and Tests.Fragment Operations and Tests.

Com

man

dPr

oces

sor

Ver

tex

Sha

der

Rast

eriza

tion

Pix

elS

hade

r

Frag

men

tO

pera

tions

and

Test

s

vertex data (16x4D):1 pos1 weight1 normal2 colors1 fog coord8 texture coords

Vertex Output (15x4D):1 Homogeneous pos4 colors1 fog coord1 point size8 texture coord

Fragment 10x4D)2 colors8 texture coords

Fragment Coords

Fragment Output (10x4D)1 color1 depth coordinate

Fragment Coords

FramebufferCPU

Memory

Vertex Programand Constants

OGLStateFragment

Program andConstants

TextureMemory

OGLState

OGLState

Color BufferZBufferStencil Buffer

Command ProcessorCommand Processor

Recieves commands from the CPU Recieves commands from the CPU (driver, OpenGL/Direct3D).(driver, OpenGL/Direct3D).

Fetches data from memory: vertex Fetches data from memory: vertex data (DMA).data (DMA).

Updates and stores Updates and stores OpenGL/Direct3D render state.OpenGL/Direct3D render state.

Vertex ShaderVertex Shader

Transforms and lits vertex streams.Transforms and lits vertex streams. Vertex shader program (from GPU Vertex shader program (from GPU

memory?).memory?). Vertex shader constans (from GPU Vertex shader constans (from GPU

memory?).memory?). Inputs: vertex data 16x4D Inputs: vertex data 16x4D Outputs: vertex data 14x4DOutputs: vertex data 14x4D

RasterizationRasterization Includes:Includes:

ClippingClipping Divide by wDivide by w Affine transformAffine transform Primitive assemblyPrimitive assembly CullingCulling SetupSetup Fragment generation.Fragment generation.

Recieves vertexs and produces fragments.Recieves vertexs and produces fragments. Uses OpenGL/Direct3D render state.Uses OpenGL/Direct3D render state. Input: vertex (15x4D).Input: vertex (15x4D). Output: fragments (10x4D).Output: fragments (10x4D).

Pixel ShaderPixel Shader Shades fragments: calculate texture address, Shades fragments: calculate texture address,

read texture, color operations.read texture, color operations. Pixel Shader program and constants (from Pixel Shader program and constants (from

GPU memory?).GPU memory?). Texture read: TMU (texture sample, filter Texture read: TMU (texture sample, filter

unit, texture cache, GPU memory).unit, texture cache, GPU memory). Optional:Optional:

Modify depth coordinate (1 Z output).Modify depth coordinate (1 Z output). Render to texture (up to 4 colors outputs).Render to texture (up to 4 colors outputs).

Input: fragment (12x4D).Input: fragment (12x4D). Output: color (2x4D).Output: color (2x4D).

Fragment Operations and Fragment Operations and TestsTests

Includes (OpenGL):Includes (OpenGL): Fog.Fog. Color Sum.Color Sum. Ownership Test.Ownership Test. Scissor Test.Scissor Test. Alpha Test.Alpha Test. Stencil Test.Stencil Test. Depth Test.Depth Test. Blend.Blend. Logic Operation.Logic Operation.

Accesses framebuffer (GPU memory). Updates framebuffer.Accesses framebuffer (GPU memory). Updates framebuffer. Framebuffer: color, Z and stencil.Framebuffer: color, Z and stencil. OpenGL/Direct3D render state defines operations.OpenGL/Direct3D render state defines operations. Input: color.Input: color. Output: FB updated.Output: FB updated.

COMMAND PROCESSOR

VERTEXBUFFER

VERTEX SETUP

VERTEX SHADER

VERTEXCACHE

PRIMITIVE ASSEMBLY

TRIANGLE SETUP

FRAGMENTGENERATOR

EARLY Z TEST

PIXEL SHADER SETUP

PIXEL SHADER

FOG & COLOR SUM

OWNERSHIP & SCISSORTESTS

ALPHA TEST

BLEND

STENCIL TST

DEPTH TEST

LOGIC OP

ME

MO

RY

AGP

VERTEX

COMMANDS

TRIANGLES

FRAGMENT(color, position, Z,

textures)

FRAGMENT(color, position, Z)

PIXEL

Z Buffer

Vertex Array

Textures

StencilBuffer

Z Buffer

ColorBuffer

Vertex ProgramVertex Constants

Primitive List

Pixel Shader ProgramPixel Shader Constants

GL_COLOR_SUMGL_FOGGL_Fog()

GL_SCISSOR_TESTGL_Scissor()

GL_ALPHA_TESTGL_AlphaFunc()

GL_STENCIL_TESTGL_StencilFunc()GL_StencilOp()

GL_DEPTH_TESTGL_Depth_Func()

GL_BLENDGL_BlendEquation()

GL_BlendFuncSeparate()GL_BlendFunc()GL_BlendColor()

GL_COLOR_LOGIC_OPGL_LogicOp()

Vertex ShaderVertex Shader

The command processor sends a vertex The command processor sends a vertex stream to the vertex shaders.stream to the vertex shaders.

A vertex buffer stores data read from A vertex buffer stores data read from DMA.DMA.

A vertex cache (~ 10 vertexs) can be A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader used to avoid to execute vertex shader for the same vertex twice.for the same vertex twice.

The vertex stream is grouped in The vertex stream is grouped in primitives and sent to the rasterizer.primitives and sent to the rasterizer.

Hardware PipelineHardware PipelineVERTEX BUFFER

FETCHvertex array

VERTEX SHADER

VERTEXCACHE

INDEXFIFO

PRIMITIVEFIFO

PRIMITIVEBUFFER

PRIMITIVEASSEMBLY

vertex data

vertex data (T&L)

index

vertex data (T&L)

index

offset

primitive(n vertex)

index

MEMORY

vertex array

address

hit/miss

primitive data(n vertexs)

address

COMMANDPROCESSOR

AGP

commands

primitive(n vertexs)

index list

vertex array address

Vertex Shader Vertex Shader ArchitectureArchitecture

SIMD architecture. Registers are 128b wide, four 32 bit fields.SIMD architecture. Registers are 128b wide, four 32 bit fields. Instruction set: typical arithmetic instructions (vector mul, add) Instruction set: typical arithmetic instructions (vector mul, add)

and some special instructions (ARL, DST), some complex and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, mathematic instructions (EXP, COS), support for branching, loops and procedures.loops and procedures.

3 different sources of data:3 different sources of data: Input stream (~ 16 registers).Input stream (~ 16 registers). Constants (~ 256 registers).Constants (~ 256 registers). Temporaries (~ 16 registers).Temporaries (~ 16 registers).

2 different destinations:2 different destinations: Output stream (~ 15 registers).Output stream (~ 15 registers). Temporaries (~ 16 registers).Temporaries (~ 16 registers).

Conditional registers (NV30) and boolean constants Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’.(R300, DX9) for conditional ‘execution’.

Vertex Shader Inputs and Vertex Shader Inputs and OutputsOutputs

VERTEX INPUT (16x128 bits)

CONSTANTS(256 x 128 bits)

TEMPORARY(16 x 128 bits)

ADDRESS (2 x 128 bits)

MUX/ABS/NEGATE/SWIZZLE

ALU/MASK

2 1 11

1

VERTEX OUTPUT(15 x 128 bits)

1

SREG

SREG

SREG

SREG

OP

OP

DREG

DREG

DREG

2

Vertex Shader Vertex Shader ArchitectureArchitecture

INSTRUCTIONSVERTEX INPUT

CONSTANTS

ADDRESS

PC

MUX

IR

TEMPORALS

VERTEX OUTPUT

CCs

ALU

MASK

NEG/ABS

SWIZZLE

MUX MUX MUX

+1

STACK

BRANCH

Vertex Shader: NV20Vertex Shader: NV20 Exposes programmability of a small part of Exposes programmability of a small part of

the geometry pipeline.the geometry pipeline. Vertex load & store, format conversion, Vertex load & store, format conversion,

primitive assembly, clipping, triangle setup primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline occur completely in parallel, in pipeline fashion.fashion.

4-wide fine grained SIMD FP to provide the 4-wide fine grained SIMD FP to provide the necessary performance, and run necessary performance, and run multiple multiple execution threadsexecution threads to maintain efficiency to maintain efficiency and provide a very simple programming and provide a very simple programming mode.mode.

NV20: IntroductionNV20: Introduction

Independent vertices.Independent vertices. IEEE single precission FP.IEEE single precission FP. 4 component vectors (x, y, z, w).4 component vectors (x, y, z, w). Input registers can have their Input registers can have their

components arbitrarily components arbitrarily rearranged/replicated (swizzled).rearranged/replicated (swizzled).

Any operation generating a scalar must Any operation generating a scalar must generate that scalar replicated across generate that scalar replicated across all components, and output writes have all components, and output writes have a component write mask.a component write mask.

NV20: Program ModelNV20: Program Model

NV20: Input AttributesNV20: Input Attributes Input Attributes:Input Attributes:

16 quad-float vertex source attribute registers.16 quad-float vertex source attribute registers. Position, normal, two colors, up to 8 texture coordinate Position, normal, two colors, up to 8 texture coordinate

sets, skin weights, fog and point size.sets, skin weights, fog and point size. Default 0.0 for second and third components, 1.0 for the Default 0.0 for second and third components, 1.0 for the

fourth.fourth. Attributes are persistent.Attributes are persistent. Only one vertex attribute may be read per program Only one vertex attribute may be read per program

instruction.instruction. Constant memory:Constant memory:

96 quad floats.96 quad floats. Can only be loaded before vertices are processed.Can only be loaded before vertices are processed. Only one constant may be read by one program instruction.Only one constant may be read by one program instruction. The program may not read to constants.The program may not read to constants.

NV20: Input AttributesNV20: Input Attributes Integer address register:Integer address register:

Loaded using ARL.Loaded using ARL. Indexed constant reads with out-of-range reads Indexed constant reads with out-of-range reads

returning (0,0,0,0).returning (0,0,0,0). Read/Write register file:Read/Write register file:

12 quad floats.12 quad floats. Three reads and one write per instruction.Three reads and one write per instruction. Initialized to (0,0,0,0) per vertex.Initialized to (0,0,0,0) per vertex.

Any vector read may be sourced as Any vector read may be sourced as multiple operands and individually multiple operands and individually swizzled/negated each time.swizzled/negated each time.

NV20: Output attributesNV20: Output attributes Standard mapping for the fixed function Standard mapping for the fixed function

pipeline at the homogeneous clip space point.pipeline at the homogeneous clip space point. Position for clipping.Position for clipping. Vertex color output clamped to the range 0.0 Vertex color output clamped to the range 0.0

to 1.0.to 1.0. Fog distance, point size.Fog distance, point size. 8 texture coordinates.8 texture coordinates. All instruction writes have an optional 4-All instruction writes have an optional 4-

component write mask.component write mask. Initialized to (0.0, 0.0, 0.0, 1.0).Initialized to (0.0, 0.0, 0.0, 1.0).

NV20: Instruction Set.NV20: Instruction Set.

No branching.No branching. Constant Latency: issue any instruction per clock and execute all Constant Latency: issue any instruction per clock and execute all

instructions with thhe same latency. All operands are instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory immediately available, limiting the size of registers and memory banks.banks.

NV20: Hardware NV20: Hardware ImplementationImplementation

Two blocks: vertex attribute buffer Two blocks: vertex attribute buffer (VAB) and the floating point core.(VAB) and the floating point core.

NV20: VABNV20: VAB The VAB is responsible for vertex attribute persistence.The VAB is responsible for vertex attribute persistence. 16 input attributes16 input attributes When a write to an addres is recieved defaults (0.0, 0.0, 0.0, When a write to an addres is recieved defaults (0.0, 0.0, 0.0,

1.0) and the valid data overwrites the components.1.0) and the valid data overwrites the components. The VAB drains into a number of input buffers (IB) that are The VAB drains into a number of input buffers (IB) that are

used to feed the FP core in a round robin fashion.used to feed the FP core in a round robin fashion. Dirty bits are maintained in the VAB so only changed Dirty bits are maintained in the VAB so only changed

attributes are updated when the same buffer is again the attributes are updated when the same buffer is again the drain target.drain target.

The transfer of a vertex is triggered by a write to address 0 The transfer of a vertex is triggered by a write to address 0 (vertex position).(vertex position).

To prevent bubbles during simultaneous loading and draining To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence.target address, superceding a default drain sequence.

NV20: VABNV20: VAB

NV20: Floating Point CoreNV20: Floating Point Core Processes the instruction set.Processes the instruction set. Multithreaded vector processor operating on quad-float data.Multithreaded vector processor operating on quad-float data. Vertex data read from input buffers and transformed into Vertex data read from input buffers and transformed into

output buffers (OB).output buffers (OB). Same latency for vector and special function units.Same latency for vector and special function units. Multiple vertex threads are used to hide this latency.Multiple vertex threads are used to hide this latency. SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX,

SLT, SGE.SLT, SGE. Special FU: RCP, RSQ, LOG, EXP, LIT.Special FU: RCP, RSQ, LOG, EXP, LIT. VU is approximately IEEE (no denormalized numbers or VU is approximately IEEE (no denormalized numbers or

exceptions, rounding always toward negative infinity).exceptions, rounding always toward negative infinity). 1 instruction per clock and all input/output options have no 1 instruction per clock and all input/output options have no

performance penalty.performance penalty. All input vectors are available with no latency. All input vectors are available with no latency.

NV20: Float Point CoreNV20: Float Point Core

Vertex Shader: R300Vertex Shader: R300 4 vertex shader units.4 vertex shader units. 1 scalar unit, 1 vector unit.1 scalar unit, 1 vector unit. Registers:Registers:

ALU Registers:ALU Registers: Constants: 256 read only vectors.Constants: 256 read only vectors. Temporary: 12 read/write vectorsTemporary: 12 read/write vectors Input: 16 read only vectors.Input: 16 read only vectors. Output: 15 write only vectors.Output: 15 write only vectors.

Flow Control Registers:Flow Control Registers: Integer Constat: 16 read only vectors.Integer Constat: 16 read only vectors. Address: 1 read/write vector.Address: 1 read/write vector. Loop Counter: 1 scalar.Loop Counter: 1 scalar. Boolean Constant: 16 read only bits.Boolean Constant: 16 read only bits.

R300: InstructionsR300: Instructions Up to 256 instructions long shaders.Up to 256 instructions long shaders. Up to 64K executed instructions per vertex.Up to 64K executed instructions per vertex. ALU instructions: ADD, DP3, DP4, EXP, EXPP, ALU instructions: ADD, DP3, DP4, EXP, EXPP,

EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT.MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT.

Control Flow instructions: CALL, LOOP, Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN.ENDREPEAT, RETURN.

Address Instructions: ARL, ARR.Address Instructions: ARL, ARR. Graphic Instructions: DST, LIT.Graphic Instructions: DST, LIT. Instructions based in DX9 VS2.0.Instructions based in DX9 VS2.0.

NV30: OverviewNV30: Overview Supports all VS1 instructions and features. Supports all VS1 instructions and features. Beyond VS2?Beyond VS2? Condition codes.Condition codes. Branches and subroutines.Branches and subroutines. Modifiers: absolute.Modifiers: absolute. User clip support (new output registers User clip support (new output registers

CLP0-CLP5).CLP0-CLP5). New instructions.New instructions. More registers.More registers.

NV30: OverviewNV30: Overview

Up to 256 instructions per program.Up to 256 instructions per program. Up to 64K executed instructions per Up to 64K executed instructions per

vertex.vertex. 16 temporary registers.16 temporary registers. 2 vector address registers.2 vector address registers. 256 program parameters 256 program parameters

(constants).(constants).

NV30: Condition CodesNV30: Condition Codes 4 component register:4 component register:

LT: less than zero.LT: less than zero. EQ: equal to zero.EQ: equal to zero. GT: greater than zero.GT: greater than zero. UN: unordered, for comparisions involving NaN.UN: unordered, for comparisions involving NaN.

Instructions optionally update condition code state: Instructions optionally update condition code state: ““C” suffix: DP4C, MOVC.C” suffix: DP4C, MOVC. ““CC” pseudo register for update condition codes. CC” pseudo register for update condition codes.

Condition code used in:Condition code used in: Branches and procedure call/return.Branches and procedure call/return. Result masking.Result masking.

NV30: ModifiersNV30: Modifiers

Source:Source: SwizleSwizle NegateNegate AbsoluteAbsolute

TargetTarget MaskingMasking Conditional maskingConditional masking

NV30: Branching and NV30: Branching and subroutinessubroutines

BRABRA Unconditional.Unconditional. Conditional: BRA label (LE.xyww)Conditional: BRA label (LE.xyww) Computed (indirect): BRA [A1.z] (GT.x)Computed (indirect): BRA [A1.z] (GT.x)

Call & return for subroutines.Call & return for subroutines. CAL & RET.CAL & RET. Same options that with branches.Same options that with branches. Four levels of subroutin execution.Four levels of subroutin execution. No parameter stack.No parameter stack.

NV30: ClippingNV30: Clipping

New output registers: New output registers: o[CLP0]..o[CLP5].o[CLP0]..o[CLP5].

GL_CLIP_PLANEn enabled.GL_CLIP_PLANEn enabled. Clip coordinate n interpolated across the Clip coordinate n interpolated across the

primitive.primitive. Only the portion of the primitive where Only the portion of the primitive where

the clip coordinate is greater than zero is the clip coordinate is greater than zero is rasterized.rasterized.

Hardware performs fast trivial reject if all Hardware performs fast trivial reject if all clip coordinats of a primitive are negative.clip coordinats of a primitive are negative.

NV30: New InstructionsNV30: New Instructions ARL: supports loading 4-component A0 and A1 intergre registers ARL: supports loading 4-component A0 and A1 intergre registers

now.now. ARR: like ARL except rounds rather than truncates before storing ARR: like ARL except rounds rather than truncates before storing

integer result in an address register.integer result in an address register. BRA, CAL, RET: branching instructions.BRA, CAL, RET: branching instructions. COS, SIN: high precision trigonometric functions.COS, SIN: high precision trigonometric functions. FLR, FRC: floor and fraction of floating point values.FLR, FRC: floor and fraction of floating point values. EX2, LG2: high-preccision exponentiation and logarithm functions.EX2, LG2: high-preccision exponentiation and logarithm functions. ARA: adds pairs of components of an address register, useful for ARA: adds pairs of components of an address register, useful for

looping and other operations.looping and other operations. SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar

to SLT and SGE.to SLT and SGE. SSG: “set sign” operation generates a vector holding –1.0 for SSG: “set sign” operation generates a vector holding –1.0 for

negative operand components , 0 for zero components, and +1.0 negative operand components , 0 for zero components, and +1.0 for positive components.for positive components.

NV30: Instruction ListNV30: Instruction List Add & multiply instructions: ADD, DP3, DP4, Add & multiply instructions: ADD, DP3, DP4,

DPH, MAD, MOV, SUB.DPH, MAD, MOV, SUB. Math functions: ABS, COS, EX2, FLR, FRC, LG2, Math functions: ABS, COS, EX2, FLR, FRC, LG2,

LOG, RCP, RSQ, SIN.LOG, RCP, RSQ, SIN. Set on instructions: SEG, SFL, SGE, SGT, SLE, Set on instructions: SEG, SFL, SGE, SGT, SLE,

SLT, SNE, STR.SLT, SNE, STR. Branching instructions: BRA, CAL, RET.Branching instructions: BRA, CAL, RET. Address register instructions: ARL, ARA.Address register instructions: ARL, ARA. Graphics-oriented instructions: DST, LIT, RCC, Graphics-oriented instructions: DST, LIT, RCC,

SSG.SSG. Minimum/maximum instructions: MAX, MINMinimum/maximum instructions: MAX, MIN

OthersOthers AntialiasingAntialiasing

Anisotropic Filtering (textures).Anisotropic Filtering (textures). Line Antialiasing.Line Antialiasing. Edge AntialiasingEdge Antialiasing Full Screen Antialiasing (FSAA):Full Screen Antialiasing (FSAA):

Supersampling.Supersampling. MultiSampling.MultiSampling.

TBDR: Tile Based Deferred Rendering (STMicro TBDR: Tile Based Deferred Rendering (STMicro PowerVR).PowerVR).

HOS (High Order Surfaces): N-Patches, Bezier, HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation.Displacement Mapping, TruForm, Tesselation.

Documents

Status – Week 276 Victor Moya. Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel