View
223
Download
3
Category
Tags:
Preview:
Citation preview
D3D12A NEW MEANING FOR EFFICIENCY AND PERFORMANCE
DAVE OLDCORN, AMDSTEPHAN HODES, AMD
MAX MCMULLEN, MICROSOFTDAN BAKER, OXIDE
5TH MARCH 2015
D3D11 to D3D12
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20153
WHAT HASN’T CHANGED
D3D12 is primarily a software change Hardware programming model is still the same
‒Few new rendering features
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20154
WHAT HAS CHANGED
The software model has changed a lot Not just in the API, but also in the underlying
philosophy‒Closer to the hardware‒Give more control to the application
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20155
APPLICATION IS ARBITER OF CORRECT RENDERING
Trades off safety for power‒If D3D11 is Javascript, D3D12 is C++
Large areas of undefined‒... where behaviour will change with future GPUs
Use the debug layer Stay away from the corners, don’t take risks
‒Expect “morality guides”‒... once we know what people keep doing wrong
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20156
BROAD STROKE CHANGES D3D11 -> 12
Sequential API Queues, Command Lists
Small state blocks State object for pipeline
Resource binding: individual objects Resource binding: tables
Automatic synchronisation, driver tracks resource state
Manual synchronisation, app must avoid overwrites
Implicit memory management by OS & driver
Explicit memory management by application
New in D3D12
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20158
COMMAND LISTS
Each command list is executed strictly sequentially Command lists can call out to second-level command lists
(“bundles”)‒Some restrictions on bundles‒Replaying bundles is OK
Top level command lists can be replayed too‒But not until the previous submit has retired
Size them right‒100s draws for direct lists; 10+ draws for bundle
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 20159
COMMAND LISTS ENABLE CPU SIDE THREADING
Command lists can be built on arbitrary threads‒And very quickly too
Submit is thread-safe‒Submit in batches
Consider task oriented engines‒Divide rendering into tasks‒Run CPU tasks to build command lists‒Use dependencies to order GPU submission
‒Also helps with resource barriers
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201510
ALLOCATOR AND LIST MEMORY MANAGEMENT
Lists / Allocators manage memory‒Hang on to their resources when reset‒Must be destroyed to fully release memory
‒Reuse lists / allocators on ‘similar’ data‒Destroy if data is very dissimilar
‒Don’t use pool of lists / allocators for all possible uses
Initial
100 draws
Reset
Same 100 draws
200 draws
List / Allocator memory usage
(Guaranteed no new allocations)
Different 100 draws
5 draws
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201511
PIPELINE STATE OBJECT (PSO)
Collates most D3D11 renderstates Compiled into hardware registers at Create time
‒Can easily be tens of ms, so use asynchronous threads All state set onto command buffer in one go Keep adjacent PSOs similar Use sensible defaults for don’t care fields
Example: Rasterizer state
INT DepthBias;FLOAT DepthBiasClamp;FLOAT SlopeScaledDepthBias;BOOL DepthClipEnable;
None of this matters if depth
test is off
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201513
D3D12 RESOURCE BINDING 1
Table driven Shared across all shader stages Two-level table
‒Root Signature describes a top-level layout‒Pointers to descriptor tables‒Direct pointers to constant buffers‒ Inline constants
Changing which table is pointed to is cheap‒It’s just writing a pointer; no synchronisation cost
Changing contents of table is harder‒Can’t change table in flight on the hardware; no
automatic renaming
Table Pointer
RootSignature
RootConstant
BufferView
32-bitconstant
Table pointerTable
pointer
CB view
CB view
SR view
UA view
DescriptorTable
SR viewSR view
SR view
SR view
DescriptorTable
Table pointer
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201514
D3D12 RESOURCE BINDING 2
Tables should be grouped by frequency of change‒Per-draw, per-material, per-light, per-frame‒Hint update frequency to driver by placing most frequent changes early in root signature
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201515
D3D12 RESOURCE BINDING TIPS
Don’t overload root signature size‒CBVs and constants in root signature should probably be changing every draw call
‒Bulk constant data should be in CBs not root constants Use static tables where possible
‒Associate with object and prebuild
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201516
D3D12 RESOURCE SYNCHRONISATION
No automatic synchronisation Must insert barriers between usage Three functions of barrier
‒Format conversion‒e.g. antialiasing resolve or depth decompression
‒Synchronisation‒Ensuring correct order of execution; e.g. compute use of a render output could start before
colour buffer is finished working on the data, due to pipelining‒Visibility
‒Typically cache flushes, if unit A and unit B do not share the same visibility of the data Barrier specifies previous and next usage and driver inserts appropriate work
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201517
BARRIER TIPS
Group barriers into same Barrier call‒Will take the worst case of all, rather than potentially incurring multiple sequential barriers
Set minimal barriers Barriers must be correct
‒Will be a gigantic headache for IHVs if not
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201518
PROFILING
D3D11 was reasonably predictable in profiling‒Limited set of accessible bottlenecks‒Usually fairly obvious which one you’re hitting
D3D12 environment adds new factors‒API features: flexible resource binding, concurrency‒Hardware limits that were pretty much impossible to bump against in
D3D11‒Even PCIe® and system memory bus
Different hardware much more likely to have divergent behaviour‒Test on a wide range of hardware
Concurrency inD3D12
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201520
QUEUES
Graphics, compute and copy queues
Each is a superset Must specify executing
queue type at record time
Graphics
Compute
Copy
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201521
MULTIPLE QUEUES
Multiple queues of the same type supported‒Within queue: work is ordered
‒Between separate queues work can be arbitrarily reordered
Use Fences to define work order
GraphicsQueue 1
GraphicsQueue 2
Graphics engine
Shadowmap L0 Lighting L0
Shadowmap L1 Lighting L1
Shadowmap L0 Shadowmap L1 Lighting L1 Lighting L0
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201522
GAME ENGINE WORKFLOW
Physics Shadowmap Rendering
G-buffer Rendering
Lighting & Shading
Solid Post Processing
Post Processing
UI Rendering Present
TressFXParticle
Multiple cascadesPoint/Spotlights
Prepare
e.g. generate Min/Max Mips
e.g. Particle Rendering
Transparent Obj Rendering
Heap Defragmentation Streaming Dynamic Data Update
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201523
CONCURRENCY
Graphics, compute and / or copy may run in parallel‒Profile to verify‒Very familiar to console programmers
GraphicsEngine
ComputeEngine
CopyEngine DefragmentationStreamingDynamic Data Update
Physics
Shadowmaps G-buffer
TileDeferred AA/AO
Transparent
Tonemap
UI
Prepare SM
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201524
DEMO TIME!
Example of gains from async compute:‒Interleaving 2 frames
Sample code will be available Sample based on DX11 work by Jason Stewart & Gareth Thomas
G-buffer Rendering 1
Lighting & Shading 1
G-buffer Rendering 2
Lighting & Shading 2
G-buffer Rendering 3
Lighting & Shading 0
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201525
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201526
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201527
PARALLELISE UNALIKE WORKLOADS
Engines may compete for resources‒Bus bandwidth‒Shader core, texture fetch for compute / graphics‒GPRs, Caches…
The less similar the workload, the faster each runsBus dominated Shader throughput Geometry dominatedShadow mappingROP heavy workloadsMany G buffer operationsDMA operations- Texture upload- Heap defrag
Deferred lighting (usually)Many postprocessing effectsMost compute tasks- Texture compression- Physics- Simulations
Rendering highly detailed models
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201528
EXPLOITING CONCURRENCY
Profile! Can align execution across queues with fences
‒Fences have a significant cost‒Don’t overdo this; “a few” per frame at most
Shadow mapAnimateParticles
Stream Texture Deferred Lighting
Shadow map Deferred Lighting
Stream Texture Animate Particles
Deferred LightingShadow map
Stream Texture
Animate Particles
Win!
Big Win!
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201529
BARRIERS AND MULTIPLE QUEUES
Barrier must be inserted on last queue to write resource‒Primarily this is for any required format conversion
Fences contain implicit acquire / release barriers‒One of the reasons they have a high cost
Resource Management in D3D12Max McMullenMicrosoft
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201531
DIRECT3D 12 RESOURCE CREATION OVERVIEW
Direct3D 11 has a simple model, create and use Works great given the simplicity of the abstraction A few problems for today’s titles
‒Unpredictable performance differences due to driver workarounds‒No high performance reuse of memory in a given frame‒Tiled Resources added on to the original abstraction
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201532
DIRECT3D 11
Physical Pages
DDI
API
Physical Pages
GPU VA
Buffer
Physical Pages
GPU VA
Texture3D
Physical Pages
GPU VA
Texture2DTexture2D
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201533
DIRECT3D 12 RESOURCE HEAPS
Direct3D 12 separates allocation of GPU physical pages and GPU virtual addresses from resources
Applications can better amortize the cost of physical page allocation‒Reuse memory for temporaries‒Repurpose memory when the scene no longer requires it
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201534
DIRECT3D 12 RESOURCE HEAPS
Physical Pages
DDI
API
Physical Pages
GPU VA
Buffer Texture3D Texture2D
Resource Heap
Texture2D
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201535
RESOURCE HEAP PROPERTIES
Memory Pool L0 – Closest to CPUL1 – Closest to GPU (Discrete GPU only)
CPU Page Properties Not Accessible (L0 & L1)Write Combine (L0 Only)Write Back (L0 Only)
Alignment 64 KB (Default)1 MB (Enable MSAA)
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201536
SIMPLIFIED HEAP TYPES
DEFAULT UPLOAD READBACKMemory Pool
L1 (Discrete)L0 (Integrated)
L0 L0
CPU Properties
No CPU access Write Combine Write Back*
Write Back
Usage Frequent GPU Read/Write
Max GPU Bandwidth
CPU Write Once, GPU Read Once
Max CPU Write Bandwidth
GPU Write Once, CPU Read
Max CPU Read Bandwidth
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201537
DIRECT3D 12 RESOURCE CREATION APIS
Three types of resource create‒Committed‒Placed‒Reserved
Each has a different pattern of GPU VA and Physical Page usage to enable different scenarios
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201538
DIRECT3D 12 RESOURCE CREATION APIS
Physical Pages Physical Pages
GPU VA
Resource Heap
Texture3D Buffer
Physical Pages
GPU VA
Resource Heap
Texture2D
Committed Placed Reserved
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201539
EFFICIENT HEAP USAGE
Prefer default heaps populated by upload heaps‒Build a ring buffer out of one or more committed upload buffer resources, and leave
each buffer perpetually mapped for CPU access‒Sequentially write data into each buffer with the CPU, aligning offsets as needed‒Instruct the GPU to signal an increasing fence value at the end of each frame‒Do not overwrite the data in the upload heap until the fence value indicates the GPU
has finished reading the data Reuse upload heaps for dynamic data sent to GPU throughout rendering
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201540
PHYSICAL MEMORY REUSE
Both reserved and placed resources must follow the same rules as Direct3D 11 tiled resources: An aliasing barrier must be queued when physical memory is
reused with a new resource The application must initialize the resource memory with either a
Clear or Copy operation when first using or re-using physical memory with a render target or depth stencil resource
Efficient Memory Use in D3D12Dan BakerCo-Founder of Oxide Games
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201542
D3D12 MEMORY CONTROL
D3D11 – much guesswork in driver/API on where data went, how it was referenced
ConstantBuffer dynamic map difficult to stream huge quantities of data efficiently
D3D12 provides explicit control over memory mapping ‒Can create one large buffer per frame and stage all data‒No specific need for a constant buffer – becomes application construct if desired
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201543
HIGH THROUGHPUT RENDERING
To get advantage of draw call, must be hooked into game logic
For each unit, turret, missile trail, CPU calculates information like position or color
This data must be uploaded to the GPU – quickly as possible
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201544
FAST DATA STREAMING TO GPU
CPUL1 Data Cache
CPU Memory
L2/L3 Cache
GPU Memory
GPU
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201545
STREAMING THE DATA
GPU memory is not write-cached, do not read Should always write whole cache-lines out _mm_stream_si128
‒Writes cache-line at a time‒Will bypass L2 and L3 Cache
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201546
REAL-WORLD D3D12 EXAMPLE
Ashes of the Singularity – new mega RTS from Oxide and Stardock
Player may have thousands of units Every turret, bullet and missile simulated by engine On heavy frame, Ashes uploads 40-50 mb/s of data to
GPU, 60fps = 3 GB/s‒~20% of system bandwidth on DDR3‒If stored in CPU memory with GPU fetch, would be doubled
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201547
WHAT A FRAME LOOKS LIKE IN ASHES
Sim Job
Sim Job
Sim Job
D3D12 CMD Job
D3D12 CMD Job
Core 1
Current Frame
Sim Job
Sim Job
D3D12 CMD Job
D3D12 CMD JobCore 2
Sim Job Sim Job
D3D12 CMD Job
D3D12 CMD JobCore 3
Sim Job Sim Job
D3D12 CMD Job
D3D12 CMD JobCore 4
AI Job
Sim Job Sim JobD3D12
CMD JobD3D12
CMD JobCore 5
Game Job
Sim Job
Sim Job
Sim Job
AI Job
Game Job
Next Frame
D3D12 Present Job
GPU Memory
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201548
D3D12 DEMO
Demo of Ashes of the Singularity
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201549
Questions
We are hiring!Contact: Nicolas.Thibieroz@amd.com
| D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 201550
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
Recommended