Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Practical Performance Analysis and Tuning
Practical Performance Analysis and Tuning
Ashu Rege and Clint BrewerNVIDIA Developer Technology Group
Overview
Basic principles in practicePractice identifying the problems (and win prizes)Learn how to fix the problemsSummaryQuestion and AnswerPerformance Lore
Basic Principles
Pipelined architectureEach part needs the data from the previous part to do its job
Bottleneck identification and eliminationBalancing the pipeline
Pipelined Architecture (simplified view)
Framebuffer
Fragment Processor
Texture Storage + Filtering
RasterizerGeometry Processor
Geometry StorageCPU
Vertices Pixels
The Terrible Bottleneck
Framebuffer
Fragment Processor
Texture Storage + Filtering
RasterizerGeometry Processor
Geometry StorageCPU
Limits the speed of the pipeline
Bottleneck Identification
Need to identify it quickly and correctlyGuessing what it is without testing can waste a lot of coding time
Two ways to identify a stage as the bottleneckModify the stage itselfRule out the other stages
Bottleneck Identification
Modify the stage itselfBy decreasing its workload
FPS FPS
If performance improves greatly, then you know this is the bottleneckCareful not to change the workload of other stages!
Bottleneck Identification
Rule out the other stagesBy giving all of them little or no work
FPS
If performance doesn’t change significantly, then you know this is the bottleneckCareful not to change the workload of this stage!
FPS
Bottleneck Identification
Most changes to a stage affect other stages as wellCan be hard to pick what test to doLet’s go over some tests
Bottleneck Identification: CPU
CPU workloadWhat could the problem be?
Could be the gameComplex physics, AI, game logicMemory managementData structures
Could be incorrect usage of APICheck debug runtime output for errors and warnings
Could be the display driverToo many batches
Bottleneck Identification: CPU
Reduce the CPU workloadTemporarily turn off
Game logicAIPhysicsAny other thing you know to be expensive on the CPU as long as it doesn’t change the rendering workload
Bottleneck Identification: CPU
Rule out other stagesKill the DrawPrimitive calls
Set up everything as you normally would but when the time comes to render something, just do not make the DrawPrimitive* callProblem: you don’t know what the runtime or driver does when a draw primitive call is made
Use VTUNE or NVPerfHUD (more info later)These let you see right away if the CPU time is in your app or somewhere else
Bottleneck Identification: Vertex
Vertex BoundWhat could the problem be?
Transferring the vertices and indices to the cardTurning the vertices and indices into trianglesVertex cache missesUsing an expensive vertex shader
Bottleneck Identification: Vertex
Reduce vertex overheadUse simpler vertex shader
But still include all the data for the pixel shaderSend fewer Triangles??
Not good: can affect pixel shader, texture, and frame buffer
Decrease AGP Aperture??Maybe not good: can affect texture also, depends on where your textures areUse NVPerfHUD to see video memory
If it’s full then you might have textures in AGP
Bottleneck Identification: Vertex
Rule out other stagesRender to a smaller backbuffer; this can rule out
TextureFrame bufferPixel shader
Test for a CPU bottleneckCan also render to smaller view port instead of smaller backbuffer. Still rules out
TextureFrame bufferPixel shader
Bottleneck Identification: Raster
RasterizationRarely the bottleneck, spend your time testing other stages first
Bottleneck Identification: Texture
Texture BoundWhat could the problem be?
Texture cache missesHuge TexturesBandwidthTexturing out of AGP
Bottleneck Identification: Texture
Reduce Texture bandwidthUse tiny (2x2) textures
Good, but if you are using alpha test with texture alpha, then this could actually make things run slower due to increased fill. It is still a good easy test though
Use mipmaps if you aren’t alreadyTurn off anisotropic filtering if you have it on
Bottleneck Identification: Texture
Rule out other stagesSince texture is so easy to test directly, we recommend relying on that
Bottleneck Identification: Fragment
Fragment BoundWhat could the problem be?
Expensive pixel shaderRendering more fragments than necessary
High depth complexityPoor z-cull
Bottleneck Identification: Fragment
Modify the stage itselfJust output a solid color
Good: does no work per fragmentBut also affects texture, so you must then rule out texture
Use simpler mathGood: does less work per fragmentBut make sure that the math still indexes into the textures the same way or you will change the texture stage as well
Bottleneck Identification: FB
Frame Buffer bandwidthWhat could the problem be?
Touching the buffer more times than necessaryMultiple passes
Tons of alpha blendingUsing too big a buffer
Stencil when you don’t need itA lot of time dynamic reflection cube-maps can get away with r5g6b5 color instead of x8r8g8b8
Bottleneck Identification: FB
Modify the stage itselfUse a 16 bit depth buffer instead of a 24 bit oneUse a 16 bit color buffer instead of a 32 bit one
Bottleneck Identification
Now we have a bunch of practical ideas to find out if each stage is a bottleneck or not
Questions on Bottleneck Identification?
A Tool: NVPerfHud
Free tool made to help identify bottlenecksBatchesGPU idleCPU waits for GPUDriver timeTotal timeSolid color pixel shaders2x2 texturesEtc...
Practice
Now lets look at some sample problems and see if we can find out where the problem isUse NVPerfHUD to help
Practice: Clean the Machine
Make sure that your machine is ready for analysisMake sure you have the right driversUse a release build of the game (optimizations on)Check debug output for warnings or errors but.....Use the release d3d runtime!!!No maximum validationNo driver overridden anisotropic filtering or anti-aliasingMake sure v-sync is off
Practice: Example 1
A seemingly simple scene runs horribly slowNarrow in on the bottleneck
Practice: Example 1
HRESULT hr = pd3dDevice->CreateVertexBuffer( 6* sizeof( PARTICLE_VERT ),0, //declares this as staticPARTICLE_VERT::FVF,D3DPOOL_DEFAULT,&m_pVB,NULL );
Dynamic vertex bufferBAD creation flags
Practice: Example 1
HRESULT hr = pd3dDevice->CreateVertexBuffer( 6* sizeof( PARTICLE_VERT ),D3DUSAGE_DYNAMIC | D3DUSAGE_WRITEONLY, PARTICLE_VERT::FVF,D3DPOOL_DEFAULT,&m_pVB,NULL );
Dynamic vertex bufferGOOD creation flags
Practice: Example 1
m_pVB->Lock(0, 0,(void**)&quadTris, 0);
Dynamic Vertex BufferBAD Lock flags
No flags at all!?That can’t be good....
Practice: Example 1
m_pVB->Lock(0, 0,(void**)&quadTris, D3DLOCK_NOSYSLOCK | D3DLOCK_DISCARD);
Dynamic Vertex BufferGOOD Lock flags
Use D3DLOCK_DISCARD the first time you lock a vertex buffer each frame
And again when that buffer is fullOtherwise just use NOSYSLOCK
Practice: Example 2
Another slow sceneWhat’s the problem here
Practice: Example 2
Texture bandwidth overkillUse mipmaps Use dxt1 if possible
Some cards can store compressed data in cacheUse smaller textures when they are fine
Does the grass blade really need a 1024x1024 texture?Maybe
Practice: Example 3
Another slow scene Who wants a prize?
Practice: Example 3
Expensive pixel shaderCan have huge performance effectOnly 3 verts, but maybe a million pixels
That’s only 1024x1024
Look at all the pixels!!
Practice: Example 3
36 cycles BAD
Practice: Example 3
11 cycles GOOD
Practice: Example 3
What changed?Moved math that was constant across the triangle into the vertex shader
Used ‘half’ instead of ‘float’
Got rid of normalize where it wasn’t necessarySee Normalization Heuristics http://developer.nvidia.com
Practice: Example 4
The last oneAudience: there are no more prizes, but we’ve locked the doors
Practice: Example 4
Too many batchesWas sending every quad as it’s own batchInstead, group quads into one big VB then send that with one call
Practice: Example 4
What if they use different textures?Use texture atlasesPut the two textures into a single texture and use a vertex and pixel shader to offset the texture coordinates
Balancing the Pipeline
Once satisfied with performanceBalance the pipeline by making more use of un-bottlenecked stagesCareful not to make too much use of them
FPS FPS
Summary
Pipeline architecture is ruled by bottlenecksDon’t waste time optimizing stages needlesslyIdentify bottlenecks with quick testsUse NVPerfHUD to analyze your pipelineUse Fxcomposer to help tune your shadersCheck your performance early and often
Don’t wait until the last week!
Questions?
Ashu Rege ([email protected])Clint Brewer ([email protected])
Other NVIDIA programming talks
GPU Gems Showcase Wed 5:30 – 6:00
Real-time Translucent Animated ObjectsFri 2:30 – 3:30
Performance Lore
We collected some advice from various developers and include it here so you don’t have to discover it the hard way
Performance Lore
Use low resolution (<256x256) 8-bit normalization cube-maps. Quality isn’t reduced since 50% of texels in high resolution cube-map are identical you are only getting nearest filteringUse oblique frustum clipping to clip geometry for reflection instead of a clip planeRe-use vertex buffers for streaming geometry. Don’t create and delete vertex buffers every frame if they could be re-usedUse multiples of 32 byte sized vertices for transfer over AGP
Performance Lore
Use Occlusion Query and render object’s bounding box this frame. Then use the result next frame to decide whether or not you need to draw the real objectFor ARB fragment programs use ARB_precision_hint_fastest Use 16-bit 565 cube-maps for dynamic reflections on cars. Don’t need 32-bit reflectionsBlend out small game objects and don’t render them when they are far away. cuts down on batches
Performance Lore
use half instead of float optimizations early in developmentIf rendering multiple passes, lay down Depth first then render your expensive pixel shaders. Cuts out depth complexity problems when shadingIf rendering multiple passes, on later additive passes you can set alpha to r + g + b, then use alpha test to cut on fillTerrain was rendered in 4 passes in ps1.1 due to texture limits. Render it in 1 pass in ps2.0
Performance Lore
Communicate with IHVs about your problem, sometimes it really isn’t your code and we can fix the bugs!Use texture pages / atlases to combine objects into a single batchUse anisotropic filtering only on textures that need it. Don’t just set it to default onDon’t lock static vertex buffers multiple times per frame. make them dynamicSorting the scene by render target gave a large perf boost
Performance Lore
When locating the bottleneck, divide and conquer. Lower resolution first, cuts the problem almost in half. rules out just about everything fill and pixel relatedUse float4 to pack multiple float2 texture coordinatesOptimize your index and vertex buffers to take advantage of the cacheMove per object calculations out of the vertex shader and onto the cpuMove per triangle calculations out of the pixel shader and into the vertex shader
Performance Lore
Use swizzles and masks in your vertex and pixel shaders: Value.xy = blahUse the API to clear the color and depth bufferDon’t change the direction of your z test mid frame, going from > ...to... >= ...to... = should be fine, but don’t go from > ...to... <Don’t use polygon offset if something else will workDon’t write depth in your pixel shader if you don’t have to
Performance Lore
Use Mipmaps. If they are too blurry for you, use anisotropic and/or trilinear filtering: that gives better quality than LOD biasRarely is there a single bottleneck in a game. If you find a bottleneck and fix it, and performance doesn’t improve more than a few fps. Don’t give up. You’ve helped yourself by making the real bottleneck apparent. Keep narrowing it down until you find it
Bottleneck Identification
Run App Vary FB FPSvaries?
FBlimited
Vary texturesize/filtering
FPSvaries?
Vary resolution
FPSvaries?
Texturelimited
Vary fragment
instructions
FPSvaries?
Vary vertex
instructions
FPSvaries?
Transformlimited
Vary vertex size/AGP rate
FPSvaries?
Transferlimited
Fragmentlimited
Rasterlimited
CPUlimited
Yes
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
references
http://developer.nvidia.com/object/GDC_2004_Presentations.htmlTomas Akenine-Moller and Eric Haines, Real-Time Rendering, second editionhttp://developer.nvidia.com/object/GDCE_2003_Presentations.html, Has other presentations on finding and locating the bottleneck
developer.nvidia.comdeveloper.nvidia.comThe Source for GPU Programming
Latest documentationSDKsCutting-edge tools
Performance analysis toolsContent creation tools
Hundreds of effectsVideo presentations and tutorialsLibraries and utilitiesNews and newsletter archives
EverQuest® content courtesy Sony Online Entertainment Inc.
GPU Gems: Programming Techniques, GPU Gems: Programming Techniques, Tips, and Tricks for RealTips, and Tricks for Real--Time GraphicsTime Graphics
Practical real-time graphics techniques from experts at leading corporations and universities
Great value:Contributions from industry expertsFull color (300+ diagrams and screenshots)Hard cover816 pagesAvailable at GDC 2004
“GPU Gems is a cool toolbox of advanced graphics techniques. Novice programmers and graphics gurus alike will find the gems practical, intriguing, and useful.”Tim SweeneyLead programmer of Unreal at Epic Games
“This collection of articles is particularly impressive for its depth and breadth. The book includes product-oriented case studies, previously unpublished state-of-the-art research, comprehensive tutorials, and extensive code samples and demos throughout.”Eric HainesAuthor of Real-Time Rendering
For more, visit:For more, visit:http://http://developer.nvidia.com/GPUGemsdeveloper.nvidia.com/GPUGems