Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD

Technology Behind AMD’s “Leo Demo”

Jay McKeeMTS Engineer, AMD

Why Forward Rendering?

● Complex materials● Multiple light types● Supports hardware anti-aliasing● Efficient memory usage● Supports transparency● BUT, previously could not support a

large number of lights

Forward+ Rendering

● Modified forward renderer. Add computer shader for light culling. Modify main light loop.

● Lighting and shading done in the same place, all information is preserved.

Forward+ Rendering (continued)● No limits on parameters for lights and

materials● Omni● Spot● Cinematic (arbitrary falloffs, barndoor)● BRDF per material instance

● Simple design, concentrate on rendering, not engine maintenance.

Important DX11 features

●Compute Shaders●UAV support.

Compute Shaders

●In Leo demo we use two compute shaders:● One for culling lights.● Another for spawning Virtual Point Lights (VPLs)

for indirect lighting.

● Culling 3,072 lights takes 1.7 ms on high end GPU.

UAVs

● Array(s) of scene light information.● Array of u32 light indices for storing

start/end lights per-tile.● Array of material instance data

Algorithm summary● Depth Pre-Pass● Light Culling

● Screen divided into tiles. Launch compute shader per tile.● Light info such as position, radius, direction, length

passed to light culling compute shader.● Light culling shader projects lights bounds to screen-

space tiles. Uses scene depth from z pre-pass for z testing against light volumes.

● Outputs to UAV describing per tile light list start/end along with a large UAV of u32 array of light indices.

● Output UAVs are passed to main light shaders for looping through lights per-pixel.

Algorithm summary continued● Render scene materials

● Base light accumulation function● Use screen x, y location to determine tileID● From tileID, get light start and end indices● From start index to end index, loop● Entry is index into light array.● Accumulate light hitting pixel● Returns total direct and indirect light hitting

pixel.

Algorithm summary continued

● Material shader● Decides what to do with total incoming light● Passed into material’s BRDF for example● Uses light accumulation building blocks

● Env. lighting, base light accumulation, BRDF, etc. are put together for final pixel color.

Light Culling Shader Details (1/3)

// 1. prepare

float4 frustum[4];

float minZ, maxZ;

{

ConstructFrustum( frustum );

minZ = thread_REDUCE(MIN, depth );

maxZ = thread_REDUCE(MAX, depth );

ldsMinZ = SIMD_REDUCE(MIN, minZ );

ldsMaxZ = SIMD_REDUCE(MAX, maxZ );

minZ = ldsMinZ;

maxZ = ldsMaxZ;

}

Light Culling Shader Details (2/3)__local u32 ldsNLights = 0;

__local u32 ldsLightBuffer[MAX];

// 2. overlap check, accumulate in LDS

for(int i=threadIdx; i<nLights; i+=WG_SIZE)

{

Light light = fetchAndTransform( lightBuffer[ i ] );

if( overlaps( light, frustum ) && overlaps ( light, minZ, maxZ ) )

{

AtomicAppend( ldsLightBuffer, i );

}

}

Light Culling Shader Details (3/3)// 3. export to global

__local u32 ldsOffset;

if( threadIdx == 0 )

{

ldsOffset = AtomAdd( ldsNLights );

globalLightStart[tileIdx] = ldsOffset;

globalLightEnd[tileIdx] = ldsOffset + ldsNLights;

}

for(int i=threadIdx; i< ldsNLights; i+=WG_SIZE)

{

int dstIdx = ldsOffset + i;

globalLightIndexBuffer[dstIdx] = ldsLightBuffer[i];

}

// BaseLighting.inc // THIS INC FILE IS ALL THE COMMON LIGHTING CODE

StructuredBuffer<float4> LightParams : register(u0);StructuredBuffer<uint> LowerBoundLights : register(u1);StructuredBuffer<uint> UpperBoundLights : register(u2);StructuredBuffer<int2> LightIndexBuffer : register(u3);

uint GetTileIndex(float2 screenPos){ float tileRes = (float)m_tileRes; uint numCellsX = (m_width + m_tileRes - 1)/m_tileRes; uint tileIdx = floor(screenPos.x/tileRes)+floor(screenPos.y/tileRes)*numCellsX;

return tileIdx;}

}

Light Accumulation Pseudo-code

Light Accumulation (2):StartHLSL BaseLightLoopBegin // THIS IS A MACRO, INCLUDED IN MATERIAL SHADERS

uint tileIdx = GetTileIndex( pixelScreenPos ); uint startIdx = LowerBoundLights[tileIdx]; uint endIdx = UppweBoundLights[tileIdx];

[loop] for ( uint lightListIdx = startIdx; lightListIdx < endIdx; lightListIdx++ ) {

int lightIdx = LightIndexBuffer[lightListIdx];

// Set common light parametersfloat ndotl = max(0, dot(normal, lightVec));

float3 directLight = 0;float3 indirectLight = 0;

Light Accumulation (3):

if( lightIdx >= numDirectLightsThisFrame ) { CalculateIndirectLight(lightIdx , indirectLight); } else { if( IsConeLight( lightIdx ) ) { // <<== Can add more light types here CalculateDirectSpotlight(lightIdx , directLight); } else { CalculateDirectSpherelight(lightIdx , directLight); } }

float3 incomingLight = (directLight + indirectLight)*ndotl; float shadowTerm = CalcShadow();

EndHLSL

StartHLSL BaseLightLoopEnd }EndHLSL

Material Shader Template:#include "BaseLighting.inc"

float4 PS ( PSInput i ) : SV_TARGET{ float3 totalDiffuse = 0; float3 totalSpec = GetEnvLighting();;

$include BaseLightLoopBegin

// unique material code goes here!! Light accumulation on the pixel for a given light// we have total incoming light and direct/indirect light components as well as material params and shadow term// use these building blocks to integrate lighting terms

totalDiffuse += GetDiffuse(incomingLight); totalSpec += CalcPhong(incomingLight);

$include BaseLightLoopEnd

float3 finalColor = totalDiffuse + totalSpec; return float4( finalColor, 1 );}

Debug Mode Demo

Benchmark

3k dynamic lights

Compute-based Deferred v.s. Forward+

Forward+(L)

Forward+(H)

Deferred(L)

Deferred(H)

0 2 4 6 8 10 12 14 16 18 20

Prepass Light processing

Final shading

Time (ms)

Takahiro Harada, Jay McKee, Jason C.Yang, Forward+: Bringing Deferred Lighting to the Next Level, Eurographics Short Paper (2012)

Depth Pre-Pass Critical

● Pixel overdraw cripples this technique so depth pre-pass is required.

● Depth pre-pass is good opportunity to use MRT to generate other full-screen data needed for post-fx and other render fx (optional).

Other important points

● XBOX 360 has good bandwidth so given limitations on forward rendering, deferred makes a lot of sense.

● However, ALU computation growing at faster rate than bandwidth. more and more feasible to just do the calculations than to read/write so much data.

● Dynamic branching penalties not nearly as bad as before. As an optimization, compute shader can sort by light-type for example to minimize penalties.

● All that "light management" CPU side code to decide which lights hit each object for setting constant registers can be ditched!

Summary

● Modified forward renderer that handles scenes with 1000s of lights.

● Hardware anti-aliasing (MSAA) “automatic”● Bandwidth friendly.● Makes the most of the GPU's ALU power (which is

growing faster than bandwidth)

Thanks!Contact: [email protected]@[email protected]

Leo Demo website:http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx

Eurographics 2012: 'Forward+: Bringing Deferred Lighting to the Next Level'

mailto:[email protected]



http://developer.amd.com/Resources/documentation/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx



Documents

Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD