58
Practical Performance Analysis and Tuning Practical Performance Analysis and Tuning Ashu Rege and Clint Brewer NVIDIA Developer Technology Group

Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practical Performance Analysis and Tuning

Practical Performance Analysis and Tuning

Ashu Rege and Clint BrewerNVIDIA Developer Technology Group

Page 2: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Overview

Basic principles in practicePractice identifying the problems (and win prizes)Learn how to fix the problemsSummaryQuestion and AnswerPerformance Lore

Page 3: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Basic Principles

Pipelined architectureEach part needs the data from the previous part to do its job

Bottleneck identification and eliminationBalancing the pipeline

Page 4: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Pipelined Architecture (simplified view)

Framebuffer

Fragment Processor

Texture Storage + Filtering

RasterizerGeometry Processor

Geometry StorageCPU

Vertices Pixels

Page 5: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

The Terrible Bottleneck

Framebuffer

Fragment Processor

Texture Storage + Filtering

RasterizerGeometry Processor

Geometry StorageCPU

Limits the speed of the pipeline

Page 6: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Need to identify it quickly and correctlyGuessing what it is without testing can waste a lot of coding time

Two ways to identify a stage as the bottleneckModify the stage itselfRule out the other stages

Page 7: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Modify the stage itselfBy decreasing its workload

FPS FPS

If performance improves greatly, then you know this is the bottleneckCareful not to change the workload of other stages!

Page 8: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Rule out the other stagesBy giving all of them little or no work

FPS

If performance doesn’t change significantly, then you know this is the bottleneckCareful not to change the workload of this stage!

FPS

Page 9: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Most changes to a stage affect other stages as wellCan be hard to pick what test to doLet’s go over some tests

Page 10: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: CPU

CPU workloadWhat could the problem be?

Could be the gameComplex physics, AI, game logicMemory managementData structures

Could be incorrect usage of APICheck debug runtime output for errors and warnings

Could be the display driverToo many batches

Page 11: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: CPU

Reduce the CPU workloadTemporarily turn off

Game logicAIPhysicsAny other thing you know to be expensive on the CPU as long as it doesn’t change the rendering workload

Page 12: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: CPU

Rule out other stagesKill the DrawPrimitive calls

Set up everything as you normally would but when the time comes to render something, just do not make the DrawPrimitive* callProblem: you don’t know what the runtime or driver does when a draw primitive call is made

Use VTUNE or NVPerfHUD (more info later)These let you see right away if the CPU time is in your app or somewhere else

Page 13: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Vertex

Vertex BoundWhat could the problem be?

Transferring the vertices and indices to the cardTurning the vertices and indices into trianglesVertex cache missesUsing an expensive vertex shader

Page 14: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Vertex

Reduce vertex overheadUse simpler vertex shader

But still include all the data for the pixel shaderSend fewer Triangles??

Not good: can affect pixel shader, texture, and frame buffer

Decrease AGP Aperture??Maybe not good: can affect texture also, depends on where your textures areUse NVPerfHUD to see video memory

If it’s full then you might have textures in AGP

Page 15: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Vertex

Rule out other stagesRender to a smaller backbuffer; this can rule out

TextureFrame bufferPixel shader

Test for a CPU bottleneckCan also render to smaller view port instead of smaller backbuffer. Still rules out

TextureFrame bufferPixel shader

Page 16: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Raster

RasterizationRarely the bottleneck, spend your time testing other stages first

Page 17: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Texture

Texture BoundWhat could the problem be?

Texture cache missesHuge TexturesBandwidthTexturing out of AGP

Page 18: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Texture

Reduce Texture bandwidthUse tiny (2x2) textures

Good, but if you are using alpha test with texture alpha, then this could actually make things run slower due to increased fill. It is still a good easy test though

Use mipmaps if you aren’t alreadyTurn off anisotropic filtering if you have it on

Page 19: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Texture

Rule out other stagesSince texture is so easy to test directly, we recommend relying on that

Page 20: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Fragment

Fragment BoundWhat could the problem be?

Expensive pixel shaderRendering more fragments than necessary

High depth complexityPoor z-cull

Page 21: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: Fragment

Modify the stage itselfJust output a solid color

Good: does no work per fragmentBut also affects texture, so you must then rule out texture

Use simpler mathGood: does less work per fragmentBut make sure that the math still indexes into the textures the same way or you will change the texture stage as well

Page 22: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: FB

Frame Buffer bandwidthWhat could the problem be?

Touching the buffer more times than necessaryMultiple passes

Tons of alpha blendingUsing too big a buffer

Stencil when you don’t need itA lot of time dynamic reflection cube-maps can get away with r5g6b5 color instead of x8r8g8b8

Page 23: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification: FB

Modify the stage itselfUse a 16 bit depth buffer instead of a 24 bit oneUse a 16 bit color buffer instead of a 32 bit one

Page 24: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Now we have a bunch of practical ideas to find out if each stage is a bottleneck or not

Questions on Bottleneck Identification?

Page 25: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

A Tool: NVPerfHud

Free tool made to help identify bottlenecksBatchesGPU idleCPU waits for GPUDriver timeTotal timeSolid color pixel shaders2x2 texturesEtc...

Page 26: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice

Now lets look at some sample problems and see if we can find out where the problem isUse NVPerfHUD to help

Page 27: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Clean the Machine

Make sure that your machine is ready for analysisMake sure you have the right driversUse a release build of the game (optimizations on)Check debug output for warnings or errors but.....Use the release d3d runtime!!!No maximum validationNo driver overridden anisotropic filtering or anti-aliasingMake sure v-sync is off

Page 28: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 1

A seemingly simple scene runs horribly slowNarrow in on the bottleneck

Page 29: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 1

HRESULT hr = pd3dDevice->CreateVertexBuffer( 6* sizeof( PARTICLE_VERT ),0, //declares this as staticPARTICLE_VERT::FVF,D3DPOOL_DEFAULT,&m_pVB,NULL );

Dynamic vertex bufferBAD creation flags

Page 30: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 1

HRESULT hr = pd3dDevice->CreateVertexBuffer( 6* sizeof( PARTICLE_VERT ),D3DUSAGE_DYNAMIC | D3DUSAGE_WRITEONLY, PARTICLE_VERT::FVF,D3DPOOL_DEFAULT,&m_pVB,NULL );

Dynamic vertex bufferGOOD creation flags

Page 31: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 1

m_pVB->Lock(0, 0,(void**)&quadTris, 0);

Dynamic Vertex BufferBAD Lock flags

No flags at all!?That can’t be good....

Page 32: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 1

m_pVB->Lock(0, 0,(void**)&quadTris, D3DLOCK_NOSYSLOCK | D3DLOCK_DISCARD);

Dynamic Vertex BufferGOOD Lock flags

Use D3DLOCK_DISCARD the first time you lock a vertex buffer each frame

And again when that buffer is fullOtherwise just use NOSYSLOCK

Page 33: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 2

Another slow sceneWhat’s the problem here

Page 34: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 2

Texture bandwidth overkillUse mipmaps Use dxt1 if possible

Some cards can store compressed data in cacheUse smaller textures when they are fine

Does the grass blade really need a 1024x1024 texture?Maybe

Page 35: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 3

Another slow scene Who wants a prize?

Page 36: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 3

Expensive pixel shaderCan have huge performance effectOnly 3 verts, but maybe a million pixels

That’s only 1024x1024

Look at all the pixels!!

Page 37: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 3

36 cycles BAD

Page 38: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 3

11 cycles GOOD

Page 39: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 3

What changed?Moved math that was constant across the triangle into the vertex shader

Used ‘half’ instead of ‘float’

Got rid of normalize where it wasn’t necessarySee Normalization Heuristics http://developer.nvidia.com

Page 40: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 4

The last oneAudience: there are no more prizes, but we’ve locked the doors

Page 41: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 4

Too many batchesWas sending every quad as it’s own batchInstead, group quads into one big VB then send that with one call

Page 42: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Practice: Example 4

What if they use different textures?Use texture atlasesPut the two textures into a single texture and use a vertex and pixel shader to offset the texture coordinates

Page 43: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Balancing the Pipeline

Once satisfied with performanceBalance the pipeline by making more use of un-bottlenecked stagesCareful not to make too much use of them

FPS FPS

Page 44: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Summary

Pipeline architecture is ruled by bottlenecksDon’t waste time optimizing stages needlesslyIdentify bottlenecks with quick testsUse NVPerfHUD to analyze your pipelineUse Fxcomposer to help tune your shadersCheck your performance early and often

Don’t wait until the last week!

Page 45: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Questions?

Ashu Rege ([email protected])Clint Brewer ([email protected])

Page 46: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Other NVIDIA programming talks

GPU Gems Showcase Wed 5:30 – 6:00

Real-time Translucent Animated ObjectsFri 2:30 – 3:30

Page 47: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

We collected some advice from various developers and include it here so you don’t have to discover it the hard way

Page 48: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

Use low resolution (<256x256) 8-bit normalization cube-maps. Quality isn’t reduced since 50% of texels in high resolution cube-map are identical you are only getting nearest filteringUse oblique frustum clipping to clip geometry for reflection instead of a clip planeRe-use vertex buffers for streaming geometry. Don’t create and delete vertex buffers every frame if they could be re-usedUse multiples of 32 byte sized vertices for transfer over AGP

Page 49: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

Use Occlusion Query and render object’s bounding box this frame. Then use the result next frame to decide whether or not you need to draw the real objectFor ARB fragment programs use ARB_precision_hint_fastest Use 16-bit 565 cube-maps for dynamic reflections on cars. Don’t need 32-bit reflectionsBlend out small game objects and don’t render them when they are far away. cuts down on batches

Page 50: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

use half instead of float optimizations early in developmentIf rendering multiple passes, lay down Depth first then render your expensive pixel shaders. Cuts out depth complexity problems when shadingIf rendering multiple passes, on later additive passes you can set alpha to r + g + b, then use alpha test to cut on fillTerrain was rendered in 4 passes in ps1.1 due to texture limits. Render it in 1 pass in ps2.0

Page 51: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

Communicate with IHVs about your problem, sometimes it really isn’t your code and we can fix the bugs!Use texture pages / atlases to combine objects into a single batchUse anisotropic filtering only on textures that need it. Don’t just set it to default onDon’t lock static vertex buffers multiple times per frame. make them dynamicSorting the scene by render target gave a large perf boost

Page 52: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

When locating the bottleneck, divide and conquer. Lower resolution first, cuts the problem almost in half. rules out just about everything fill and pixel relatedUse float4 to pack multiple float2 texture coordinatesOptimize your index and vertex buffers to take advantage of the cacheMove per object calculations out of the vertex shader and onto the cpuMove per triangle calculations out of the pixel shader and into the vertex shader

Page 53: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

Use swizzles and masks in your vertex and pixel shaders: Value.xy = blahUse the API to clear the color and depth bufferDon’t change the direction of your z test mid frame, going from > ...to... >= ...to... = should be fine, but don’t go from > ...to... <Don’t use polygon offset if something else will workDon’t write depth in your pixel shader if you don’t have to

Page 54: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Performance Lore

Use Mipmaps. If they are too blurry for you, use anisotropic and/or trilinear filtering: that gives better quality than LOD biasRarely is there a single bottleneck in a game. If you find a bottleneck and fix it, and performance doesn’t improve more than a few fps. Don’t give up. You’ve helped yourself by making the real bottleneck apparent. Keep narrowing it down until you find it

Page 55: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

Bottleneck Identification

Run App Vary FB FPSvaries?

FBlimited

Vary texturesize/filtering

FPSvaries?

Vary resolution

FPSvaries?

Texturelimited

Vary fragment

instructions

FPSvaries?

Vary vertex

instructions

FPSvaries?

Transformlimited

Vary vertex size/AGP rate

FPSvaries?

Transferlimited

Fragmentlimited

Rasterlimited

CPUlimited

Yes

No

No

No

No

No

No

Yes

Yes

Yes

Yes

Yes

Page 56: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

references

http://developer.nvidia.com/object/GDC_2004_Presentations.htmlTomas Akenine-Moller and Eric Haines, Real-Time Rendering, second editionhttp://developer.nvidia.com/object/GDCE_2003_Presentations.html, Has other presentations on finding and locating the bottleneck

Page 57: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

developer.nvidia.comdeveloper.nvidia.comThe Source for GPU Programming

Latest documentationSDKsCutting-edge tools

Performance analysis toolsContent creation tools

Hundreds of effectsVideo presentations and tutorialsLibraries and utilitiesNews and newsletter archives

EverQuest® content courtesy Sony Online Entertainment Inc.

Page 58: Practical Performance Analysis and Tuningdownload.nvidia.com › developer › presentations › GDC... · Performance Lore Use Occlusion Query and render object’s bounding box

GPU Gems: Programming Techniques, GPU Gems: Programming Techniques, Tips, and Tricks for RealTips, and Tricks for Real--Time GraphicsTime Graphics

Practical real-time graphics techniques from experts at leading corporations and universities

Great value:Contributions from industry expertsFull color (300+ diagrams and screenshots)Hard cover816 pagesAvailable at GDC 2004

“GPU Gems is a cool toolbox of advanced graphics techniques. Novice programmers and graphics gurus alike will find the gems practical, intriguing, and useful.”Tim SweeneyLead programmer of Unreal at Epic Games

“This collection of articles is particularly impressive for its depth and breadth. The book includes product-oriented case studies, previously unpublished state-of-the-art research, comprehensive tutorials, and extensive code samples and demos throughout.”Eric HainesAuthor of Real-Time Rendering

For more, visit:For more, visit:http://http://developer.nvidia.com/GPUGemsdeveloper.nvidia.com/GPUGems