1 Copyright © 2001 Intel Corporation. * Other names and brands may be claimed as the property of others. Meltdown 2001 Optimizing DirectX* Graphic Applications

Meltdown 20011Copyright © 2001 Intel Corporation.

*Other names and brands may be claimed as the property of others.

Optimizing DirectX* Graphic Applications

using Software Vertex Processing

Ronen Zohar/Kim Pallister

Intel Corporation



Agenda

• Do I need SW vertex processing?

• The PSGP

• Using SW vertex processing for maximum performance: memory, batching and render-states

• SW vertex processing and DirectX*’s 8.0 new features



Do I need SW vertex processing?

• Your publisher wants:– Eye-candy graphics, using all the latest 3D features– Lower the “minimum system requirements”– and many more

• Problem: older systems does not support all the eye-candy features

• Solution1: Disable features for low-end systems• Solution2: Use SW vertex processing (at least for the

features that you can) and keep some features



Inside DirectX* Graphics

Driver

Application

API Front-end

Communication to the driver (DDI)

SW Vertex processing (PSGP)

DirectX run-time

HW vertex processing path



PSGP – Processor Specific Geometry Pipeline

• Part of DirectX graphics responsible for the SW vertex processing algorithms, optimized for the client’s processor

• DirectX’s 8.0 PSGP is optimized for:– Intel® Pentium® III processor– Intel Pentium 4 processor



The PSGP

VB

Map stream to registers

Execute vertex shader code

Vertex shader path

Transformation Lighting Tex Gen

Fixed function path

Format data to output-FVF

Internal temporary VB’s

Clipper

IB

To driver



PSGP Principles

• Use SIMD to process multiple vertices in each iteration– Vertical processing– Data is swizzled on the fly

• Prefetch input streams to hide memory latency• Write output to temporary VB’s based on XYZRHW

FVF code– In system memory if need to read back transformed vertices– In driver memory if no read-back is required– More on this later…



Input Stream Memory Allocation

• Create SW processed primitives in system memory (using the D3DUSAGE_SOFTWAREPROCESSING

usage create flags).• If the same VB is processed both in SW and

HW– Try to avoid it– If you must - create multiple copies, one in system

and one in driver memory

• If the primitive is never clipped, use the D3DUSAGE_DONOTCLIP usage flag



Primitive Batching

• Batch all the SW processed primitives together

• SW processes the entire VB range that you submit, if multiple primitives are using the same VB – squeeze the vertices range

• As with HW, bigger primitives are always better (the PSGP have long setup)



Primitive Batching (Cont)

• The PSGP is batching the processed vertices before sending them to HW (to reduce HW’s VB changes)

• Primitives are batched as long as their output FVF is equal:– XYZ | NORMAL | TEX1 and XYZ | DIFFUSE | TEX1 have

the same output FVF (XYZRHW | DIFFUSE | TEX1)– In SW mode, changing the VB FVF does

not mean a slowdown (unlike HW)



Clipping Render-state

• When clipping is enabled, the PSGP– Stores its output to system memory buffer

• As it need to read vertices in order to clip– Driver need to copy it across the AGP

• When clipping disabled writes to driver allocated buffer– No Copy here!

– Calculates clip flags (out-codes) for each vertex• more execution cycles per vertex

– Clips

• Minimize the amount of clipping• Use bounding boxes/spheres on your objects• Don’t forget to take the guard-band into account



Clipping Render-state (Cont)

• Pseudo-code to minimize clipping– If (BB is outside screen)

• Don’t render primitive

– Elseif (BB is inside guard-band)• Render with clipping off

– Else• Render with clipping on

• Typical game scene should have <10% of primitives clipped– Biggest problem is front plane clipping



Performance Render-states

• Specular – very expensive

• LocalViewer – smaller performance impact than HW, but still costs more

• NormalizeNormals – extra work for the PSGP, use only when needed

• Fog – written as “specular alpha”, can change PSGP’s output FVF



DirectX* 8.0 Graphics New Features

• Point sprites

• Tweening

• Indexed vertex blending/ Indexed palette skinning

• Vertex Shaders



Point Sprites

• PSGP writes in native FVF format

• If HW does not support– Each point is expanded to quad, using the

point size calculated– The quad list is submitted to the driver

• Very slow solution if no HW support for point sprites, try to avoid it



Tweening

• Tween the position and normal before transformation (in SIMD)

• After tweening continuous the “standard” PSGP flow

• Costs very few cycles– But, for tweening and transformation only a vertex

shader would run faster– Try to compare your exact scenario to a vertex

shader



Indexed Skinning

• Transforms all vertices to matrix0 space– Using scalar code, with lookup for the

needed matrix

• Than continuous the normal PSGP flow• DirectX* 7 style skinning is supported

by some HW and may run faster, but requires multiple models and DrawPrimitive calls



Vertex shaders

• At vertex shader creation– The shader code is compiled to equivalent IA32

code– Using all possible assembly optimizations and

instructions available on client’s CPU to achieve fastest code

• At vertex shader execution– Calling the generated code

• SW vertex shaders have excellent performance



SW Vs. HW Vertex Shaders

• Calculates more than one vertex in a single iteration– Based on the processor SIMD width

• Not every shader instruction is 1 clock– But, the CPU runs with much higher

frequency than today’s 3D graphics chips



SW Vs. HW Vertex Shaders (Cont)

• Simple compilation sample:– Mul r0.xyz,v0,c0

• Movaps xmm0,[v0.x]• Mulps xmm0,[c0.x]• Movaps xmm1,[v0.y]• Mulps xmm1,[c0.y]• Movaps xmm2,[v0.z]• Mulps xmm2,[c0.z]• Movaps [r0.x],xmm0• Movaps [r0.y],xmm1• Movaps [r0.z],xmm2



SW Vs. HW Vertex Shaders (cont)

• Data that you write, is data that the CPU have to calculate– Write only needed data (using the vertex shader

write mask)– Use the swizzle modifiers, and don’t duplicate

written data

• Vertex shader instructions are blended to achieve maximum performance– But, keeping dependency chains squeezed will

help the compiler in physical register assignments



Performance Tips for SW Vertex Shaders

• m?x? macros have better performance than the un-expanded macros

• Try to minimize the use of the address register– Due to the parallelism of the SW vertex

shader– Sort the VB by values used in the address

register



Performance Tips for SW Vertex Shaders

• lit, expp and logp are big cycle consumers– Use the worse accuracy (i.e. expp.x) when

possible – Use either .x or .z (but not both)– exp and log are worse than expp, logp

• Don’t implicitly saturate color values– it is done automatically



Optimized Vertex Shaderdp4 oPos.x, v0, c2

dp4 oPos.y, v0, c3

dp4 oPos.z, v0, c4

dp4 oPos.w, v0, c5

add r1, c6,-v0

dp3 r2, r1, r1

rsq r2, r2

mov oT0, v2

mul r1,r1,r2

dp3 r3, v1, r1

max r3,r3,c8

add r3, r3, c7

min oD0,r3,c9

m4x4 oPos, v0, c[2]

add r1.xyz, c6,-v0

dp3 r2.w, r1, r1

rsq r2.w, r2.w

mul r1.xyz,r1,r2.w

dp3 r2.w, v1, r1

max r2.w,r2.w,c8

add oD0.xyz, r2.w, c7

mov oT0, v2



Questions??

[email protected]

[email protected]

Intel, Pentium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2001 Intel Corp.



Backup



Tweening + transformation vertex shader

• Mul r0.xyz,v0,c0.x // c0.x – α

• Mad r0.xyz,v1,c0.y,r0 // c0.y – (1- α)

• M4x4 oPos,r0,c1

• Mov oD[0].xyz,v2

• Mov oT[0].xy,v3



Not Equal Address Value

Const register file (x4)

1.0f 1.0f 1.0f 1.0f

2.0f 2.0f 2.0f 2.0f

3.0f 3.0f 3.0f 3.0f

1 2 1 2

Address register (x4)

Need to re-arrange a combination register for the SIMD instruction to use

(costs ~20 cycles) 1.0f 2.0f 1.0f 2.0f

Instruction argument



Equal Address Value

Const register file (x4)

1.0f 1.0f 1.0f 1.0f

2.0f 2.0f 2.0f 2.0f

3.0f 3.0f 3.0f 3.0f

2 2 2 2

Address register (x4)

Accessing directly the x4 constant register file.

No penalty for “re-arranging” vertices

Address accessing mode is selected when storing address value

Instruction argument

Documents

1 Copyright © 2001 Intel Corporation. * Other names and brands may be claimed as the property of others. Meltdown 2001 Optimizing DirectX* Graphic Applications