Upload
jewel-horton
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Performance Tools
Jeff Kiel
Manager, Developer Performance Tools
© NVIDIA Corporation 2007
Performance Tools Agenda
Overview of GPU pipeline and Unified Shader
NVIDIA PerfKit 5.0: Driver & GPU Performance DataInstrumented Driver: GPU & driver performance information, GLExpert runtime debugging
PerfSDK: Performance data integrated into your application
PerfHUD: The Direct3D GPU Performance Accelerator
gDEBugger: OpenGL performance analysis and debugging
ShaderPerf: Shader program performance
© NVIDIA Corporation 2007
GPU Pipelined Architecture (Logical View)
FramebufferFramebuffer
Vertex Vertex AssemblyAssemblyCPUCPU
GPUGPU
BlendingBlendingRasterizerRasterizerVertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
GeometryGeometryShaderShader
TextureTexture
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderPixel Pixel
ShaderShader
© NVIDIA Corporation 2007
GPU Pipelined Architecture (Logical View)
FramebufferFramebuffer
Vertex Vertex AssemblyAssemblyCPUCPU
GPUGPU
BlendingBlendingRasterizerRasterizerVertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
GeometryGeometryShaderShader
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderPixel Pixel
ShaderShader
TextureTexture
© NVIDIA Corporation 2007
Common Graphics/GPU Problems
New, increasingly complex GPU hardwareGPU is a black boxUnified shaders changes everything
Increasing engine and scene complexityArtists don’t always understand how rendering engines workCPU tuning insufficient (multiple processors, multi-cores)Turn around time for debugging and tuning shaders too longHard to debug API/pipeline setup issues
© NVIDIA Corporation 2007
Unified Shader Tuning
No longer “pixel shader bound” GPU balances workload
Now just “shader unit bound”
Check workload distribution for optimization opportunities
Typical optimizations may not workClassic: move calculations from pixels to vertices
If #vertices ~= #pixels, no improvement
© NVIDIA Corporation 2007
NVIDIA PerfKit 5: The Solution!
Instrumented DriverGLExpertPerfHUDPerfSDK PerfAPI Sample Code Helper Classes DocumentationTools
NVIDIA Plug-In for Microsoft PIX for WindowsgDEBugger 3.1 DevCPL
Platforms (x32 & x64)Windows XP & VistaLinux Update Soon!
© NVIDIA Corporation 2007
PerfKit Instrumented Driver
GLExpert functionality
GPU and Driver Performance CountersOpenGL and Direct3D
Data exported via NVIDIA API and PDH
Simplified Experiments (SimExp)
Collect GPU and driver data, retain performanceTrack intra-frame events & statistics
Gather and collate at end of frame
Performance overhead 1-2%
© NVIDIA Corporation 2007
GLExpert: What is it?
Helps eliminate driver/CPU performance issues
OpenGL portion of the Instrumented DriverOutput to stdout or debugger
Different groups/levels of information detail
Controlled using environment variables in Linux, DevCPL tab in Windows
© NVIDIA Corporation 2007
GLExpert: What is it?
Information providedGL Errors: print when raised
Software Fallbacks: indicate when the driver is in fall back
GPU Programs: errors during compile or link
VBOs: show where they reside, mapping details
FBOs: print reasons for unsupported configuration
Future EnhancementsExtensive SLI supportQuadro 5600/GeForce 8 Series extensionsMore detailed pipeline setup messages, buffer object support, fallback information, and more
© NVIDIA Corporation 2007
PerfKit: Performance Counter Types
SW/Driver Counters: PerfAPI, PDH
Raw GPU Counters: PerfAPI, PDH
Simplified Experiments: PerfAPI
Instrumented GPUs
Quadro FX 5600 & 4500GeForce 8800 GTX, 8600 GTGeForce 7950/7900 GTX & GT
GeForce 7800 GTXGeForce 6800 Ultra & GTGeForce 6600
© NVIDIA Corporation 2007
OpenGL/Direct3D Driver CountersGeneral
FPSms per frame
DriverDriver frame time (total time spent in driver)Driver sleep time (waiting for GPU) Detailed wait timers (kernel, locks, rendering, etc.)
CountsBatches, vertices, primitivesTriangles and instanced triangles
MemoryTotal usedRender targets, textures, buffers
© NVIDIA Corporation 2007
GPU Counters
vertex_attribute_count
culled_primitive_countprimitive_counttriangle_countvertex_count
shaded_pixel_count
rop_busy
Texture Texture Unit Unit
(Filtering)(Filtering)
Vertex Vertex AssemblyAssembly
Raster / Raster / ZCullZCull
Raster Raster OperationsOperations
Frame Frame BufferBuffer
(RAM (RAM Memory)Memory)
gpu_idle
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShader
GeometryGeometryShaderShader
Vertex Vertex ShaderShaderVertex Vertex ShaderShaderVertex Vertex ShaderShaderPixel Pixel
ShaderShader
shader_busyvertex, geometry, pixel
ratios
© NVIDIA Corporation 2007
How do I use PerfKit counters?
PerfAPI: Easy integration of PerfKit Real time performance monitoring using GPU and driver counters, round robin sampling
Simplified Experiments for single frame analysis
PDH: Performance Data Helper for WindowsDriver data, GPU counters, and OS information
Exposed via Perfmon
Good for rapid prototyping
PerfSDK: Sample code and helper classes
© NVIDIA Corporation 2007
PerfAPI: Real Time
// Somewhere in setupNVPMAddCounterByName(“vertex_shader_busy”);NVPMAddCounterByName (“pixel_shader_busy”);NVPMAddCounterByName (“shader_waits_for_texture”);NVPMAddCounterByName (“gpu_idle”);
// In your rendering loop, sample using namesNVPMSample(NULL, &nNumSamples);NVPMGetCounterValueByName(“vertex_shader_busy”, 0, &nVSEvents, &nVSCycles); NVPMGetCounterValueByName(“pixel_shader_busy”, 0, &nPSEvents, &nPSCycles); NVPMGetCounterValueByName(“shader_waits_for_texture”, 0, &nTexEvents, &nTexCycles); NVPMGetCounterValueByName(“gpu_idle”, 0, &nIdleEvents, &nIdleCycles);
© NVIDIA Corporation 2007
PerfAPI: SimExp
NVPMAddCounter(“GPU Bottleneck”);NVPMAllocObjects(50);
// Set up the experiment, get pass countNVPMBeginExperiment(&nNumPasses);for(int ii = 0; ii < nNumPasses; ++ii) {
// Scene setup/clear backbufferNVPMBeginPass(ii);
NVPMBeginObject(0);// Draw calls associated with object 0 and flushNVPMEndObject(0);
...
NVPMEndPass(ii);
// End scene/present/swap buffers}
// End experiment and retrieve bottleneckNVPMEndExperiment();NVPMGetCounterValueByName(“GPU Bottleneck”, 0, &nGPUBneck, &nGPUCycles);
© NVIDIA Corporation 2007
PerfHUD: Direct3D debugging and tuning
One click bottleneck determination
Graphs and debugging tools overlaid on your application
4 screens for targeted analysisPerformance Dashboard
Debug Console
Frame Debugger
Frame Profiler
Drag and drop application on PerfHUD icon
© NVIDIA Corporation 2007
New! PerfHUD 5.0!
Interactive modelShader Edit and Continue
Render state Modification
Configurable Graphs
Many more features and usability improvements
New technologiesWindows Vista & DirectX 10
Quadro 5600 and GeForce 8800 with Unified Shader Architecture
© NVIDIA Corporation 2007
Demo: PerfHUD
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Demo: Performance Dashboard
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Demo: Performance Dashboard
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Demo: Performance Dashboard
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Demo: Frame Debugger
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Demo: Advanced Frame Debugger
© NVIDIA Corporation 2007
Demo: Frame Profiler
Company of Heroes used with permission from THQ and Relic Entertainment
© NVIDIA Corporation 2007
Frame ProfilerOne Touch Performance Analysis
PerfHUD uses PerfSDK
Multiple passes on the scene, sample over 40 performance counters
Need to render THE SAME FRAME until all the counters are read
Must use time-based animation
Do use QueryPerformanceCounter() or timeGetTime()
Don’t use RDTSC or throttle frame rate
© NVIDIA Corporation 2007
Associated Tools: NVIDIA Plug-In for Microsoft PIX for Windows
© NVIDIA Corporation 2007
Graphic Remedy’s gDEBugger
OpenGL and OpenGL ES Debugger and ProfilerShorten development timeImprove application qualityOptimize performanceNVIDIA PerfKit and GLExpert integratedNow supports Linux!Windows XP & Vista, x32 & x64Discounted academic licenses availableNVIDIA booth Thursday morning
http://www.gremedy.com
© NVIDIA Corporation 2007
PerfGraph
Open source tool for graphing performance counters
Supports PerfKit GPU/Driver signals
System performance information (CPU utilization, memory, etc.)
Cross platform
Windows & Linux
© NVIDIA Corporation 2007
Project Status
PerfKit 5.0 available now: http://developer.nvidia.com/perfkitPerfHUD 5.0
ForceWare Release 100 Driver
GeForce 8800 support
Windows XP & Vista, 32 and 64 bit
Linux 32 and 64 bit PerfSDK/GLExpert in development
PerfGraph: www.sourceforge.org\perfgraph
Instrumented GPUs
Feedback and Support: http://developer.nvidia.com/forums
Quadro FX 5600 & 4500GeForce 8800 SeriesGeForce 7950/7900 GTX & GT
GeForce 7800 GTXGeForce 6800 Ultra & GTGeForce 6600
© NVIDIA Corporation 2007
v2f BumpReflectVS(a2v IN,uniform float4x4 WorldViewProj,uniform float4x4 World,uniform float4x4 ViewIT)
{v2f OUT;// Position in screen space.OUT.Position = mul(IN.Position, WorldViewProj);// pass texture coordinates for fetching the normal mapOUT.TexCoord.xyz = IN.TexCoord;OUT.TexCoord.w = 1.0;// compute the 4x4 tranform from tangent space to object spacefloat3x3 TangentToObjSpace;// first rows are the tangent and binormal scaled by the bump scaleTangentToObjSpace[0] = float3(IN.Tangent.x, IN.Binormal.x, IN.Normal.x);TangentToObjSpace[1] = float3(IN.Tangent.y, IN.Binormal.y, IN.Normal.y);TangentToObjSpace[2] = float3(IN.Tangent.z, IN.Binormal.z, IN.Normal.z);OUT.TexCoord1.x = dot(World[0].xyz, TangentToObjSpace[0]);OUT.TexCoord1.y = dot(World[1].xyz, TangentToObjSpace[0]);OUT.TexCoord1.z = dot(World[2].xyz, TangentToObjSpace[0]);OUT.TexCoord2.x = dot(World[0].xyz, TangentToObjSpace[1]);OUT.TexCoord2.y = dot(World[1].xyz, TangentToObjSpace[1]);OUT.TexCoord2.z = dot(World[2].xyz, TangentToObjSpace[1]);OUT.TexCoord3.x = dot(World[0].xyz, TangentToObjSpace[2]);OUT.TexCoord3.y = dot(World[1].xyz, TangentToObjSpace[2]);OUT.TexCoord3.z = dot(World[2].xyz, TangentToObjSpace[2]);float4 worldPos = mul(IN.Position, World);// compute the eye vector (going from shaded point to eye) in cube spacefloat4 eyeVector = worldPos - ViewIT[3]; // view inv. transpose contains eye position in world space in last row.OUT.TexCoord1.w = eyeVector.x;OUT.TexCoord2.w = eyeVector.y;OUT.TexCoord3.w = eyeVector.z;return OUT;
}
///////////////// pixel shader //////////////////
float4 BumpReflectPS(v2f IN,uniform sampler2D NormalMap,uniform samplerCUBE EnvironmentMap,
uniform float BumpScale) : COLOR{
// fetch the bump normal from the normal mapfloat3 normal = tex2D(NormalMap, IN.TexCoord.xy).xyz * 2.0 - 1.0;normal = normalize(float3(normal.x * BumpScale, normal.y * BumpScale, normal.z)); // transform the bump normal into cube space// then use the transformed normal and eye vector to compute a reflection vector// used to fetch the cube map// (we multiply by 2 only to increase brightness)float3 eyevec = float3(IN.TexCoord1.w, IN.TexCoord2.w, IN.TexCoord3.w);float3 worldNorm;worldNorm.x = dot(IN.TexCoord1.xyz,normal);worldNorm.y = dot(IN.TexCoord2.xyz,normal);worldNorm.z = dot(IN.TexCoord3.xyz,normal);float3 lookup = reflect(eyevec, worldNorm);return texCUBE(EnvironmentMap, lookup);
}
ShaderPerf 2.0
Inputs:•GLSL, Cg, HLSL•PS1.x,PS2.x,PS3.x•VS1.x,VS2.x, VS3.x •!!FP1.0•!!ARBfp1.0
ShaderPerf
GPU Arch:•Quadro FX series •GeForce 8X00, 7X00•GeForce 6X00 & FX
Outputs:Outputs:•Resulting assembly codeResulting assembly code•# of cycles# of cycles•# of temporary registers# of temporary registers•Pixel/vertex throughputPixel/vertex throughput•Test all fp16 and all fp32Test all fp16 and all fp32
© NVIDIA Corporation 2007
ShaderPerf: In your pipeline
Test current performance Compare with shader cycle budgets
Test optimization opportunities
Not just Tex/ALU balance: cycles & throughput
Automated regression analysis
Integrated in FX Composer 2.0Artists/TDs code expensive shaders
Achieve optimum performance
© NVIDIA Corporation 2007
ShaderPerf 2.0 Alpha
Supports Direct3D/HLSL
GeForce 7, 6, and FX series GPUs
ForceWare Release 162 Unified Compiler
Improved vertex performance simulation and throughput calculation with branching
Multiple drivers from one ShaderPerf
Smaller footprint
New programmatic interface
© NVIDIA Corporation 2007
ShaderPerf 2.0: Beta
Full support for Cg and GLSL, vertex and fragment programs
Support for Quadro 5600 & GeForce 8 series GPUs
Geometry shaders and geometry throughput
Fragment program differencing
© NVIDIA Corporation 2007
Questions?
Stop by our booth for a hands on demo!
Online: http://developer.nvidia.com/PerfKithttp://developer.nvidia.com/PerfHUDhttp://developer.nvidia.com/ShaderPerf
Feedback and Support: http://developer.nvidia.com/forums
© NVIDIA Corporation 2007
PerfKit 5
The NVIDIA Developer ToolkitThe NVIDIA Developer Toolkit
GPU Programming Guide
ShaderPerf 2
PerfHUD 5
Conference Presentations
PerfSDK
GLExpert
gDEBugger
NV PIX Plug-in
SDK 10FX Composer 2
Melody
Texture Tools 2
Cg Toolkit
NVSG
Content CreationContent Creation Software Software DevelopmentDevelopment PerformancePerformance DocumentationDocumentation
Videos
mental mill Artist Edition
Books
© NVIDIA Corporation 2007
GPU Gems 3 Available Now!
SIGGRAPH Bookstore
Major Book Retailers
Includes chapters fromAdobe Systems
Apple
Crytek
Cornell University
Electronic Arts
Havok
Juniper Networks
Microsoft
SEGA
…and many more