Direct3D12 and the Future of Graphics APIs by Dave Oldcorn

Direct3D12 and the future of graphics APIs

Direct3D12 and the future of graphics APIs

Dave Oldcorn, Direct3D12 Technical Lead, AMD

#| AMD Direct3D Futures | March 20th, 2014

1

The Problem


2

The problemMismatch between existing Direct3D and hardware capabilitiesLots of CPU cores, but only one stream of dataState communication in small chunksHidden workHard to predict from any one given call what the overhead might beImplicit memory managementHardware evolving away from classical register programming

#| AMD Direct3D Futures | March 20th, 2014Metal(register level access)API landscapeGap between PC raw 3D APIs and the hardware has opened up

Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice

Where the PC APIs are is a middle ground

Capability, ease of use, distance from 3D engine

Game Engines

Frostbite

Unity

Unreal

CryEngine

BlitzTech

Flash / SilverlightConsole APIsOpportunityD3D9

OpenGL

D3D11

D3D7/8Application


4

What are the Consequences?What Are the solutions?

#| AMD Direct3D Futures | March 20th, 2014Sequential APISequential API: state for given draw comes from arbitrary previous time

Some states must be reconciled on the CPU (delayed validation)All contributing state needs to be visible

GPU isnt like this, uses command buffersMust save and restore state at start and end

...DrawSet PS CBDraw x 5Set VS CBDraw x 3Set BlendSet PSSet RT stateDrawSet VS VBDraw...

(more, earlier)PS CBVS CBBlend statePSRT stateDraw

State contributing to draw API input


6

Threading a sequential APISequential API threadingSimple producer / consumer modelExtra latencyBuffering has a costMore threading would mean dividing tasks on finer grainBottlenecked on application or driver threadDifficult to extract parallelism (Amdahls Law)

Application simulationPrebuildThread 0PrebuildThread 1Application Render ThreadGPU Execution QueueQueued Buffer 0 QueuedBuffer 1 ...Runtime / DriverApplicationDriver ThreadQueuedBuffer 2


7

Command buffer APIGPUs only listen to command buffers

Let the app build themCommand Lists, at the API level

Solves sequential API CPU issues

Application simulationThread 0Thread 1Build Cmd BufferBuildCmdBufferGPU Execution QueueQueued Buffer 0 QueuedBuffer 1 ...Runtime / DriverApplication


8

Better schedulingApp has much more control over scheduling workBoth CPU side and GPU

Threads dont really share much resource

Many more options for streaming assets

Driver threadCreate thread

D3D11: CB building threads tend to interfereGPU load still added but only after queuingRender workCreate work

GPU executes

D3D12: CB building threads more independentCreate threadBuild threads


9

Pipeline objects Pipeline objects get rid of JIT and enable LTCG for GPUs

Decouple interface and implementation

Were aware that this is a hairpin bend for many graphics engines to negotiate.Many engines dont think in terms of predicting state up frontThe benefits are worth it

Simplified dataflow through pipelineVSPSIndexProcessPrimitive GenerationRasteriserRendertargetOutput???


10

render object binding mismatchHardware uses tables in video memory

BUT still programmed like a register solutionSo one bind becomes:Allocate a new chunk of video memoryCreate a new copy of the entire tableUpdate the one entryWrite the register with the new table base address

SR

CB

On-chiproot table(1 per stage)Pointer to table(here, textures)GPU MemorySRD tableGPU MemoryresourcePointer to table(constant buffers)Pointer to (+ params of) resource


11

Descriptor TablesSeveral tables of each type of resourceEasy to divide up by frequency

Tables can be of arbitrary size; dynamically indexed to provide bindless textures

Changing a pointer in the root table is cheap

Updating a descriptor in a table is not so cheapSome dynamic descriptors are a requirement but avoid in general.

SR.T[0]SR.T[3]SR.T[2]SR.T[1]UAVCB.T[1]CB.T[0]SampSR.T[0][0]SR.T[0][2]SR.T[0][1]

CB.T[1][0]CB.T[1][1]

On-chiproot tablePointer to table(textures table 0)GPU MemorySRD tablePointer to table(constbuf table 1)


12

KEY innovations

InnovationCPU-side winGPU-side winCommand buffersBuild on many threadsControl of schedulingLower latencySimplified state trackingPipeline state objectsLink at create timeNo JIT shader compilesEfficient batched updatesCheaper state updatesEnables LTCGBind objects in groupsCheap to change groupCheap to change groupFits hardware paradigmMove work to CreatePredictabilityEnables optimisations


13

KEY innovations

InnovationCPU-side winGPU-side winExplicit SynchronisationEfficiencyRequired for bindless texturesLess overheadExplicit Memory ManagementEfficiencyPredictabilityApplication flexibilityZero copyControl over placementDo lessPredictability, EfficiencyEnables aggressive scheduleFEWER BUGS


14

NEW PROBLEMS(And tips to solve them)


15

New visible limitsMore draws in does not automatically mean more triangles outYou will not see full rendering rates with triangles averaging 1 pixel each.Wireframe mode should look different to filled rendering


16

New visible limitsFeeding the GPU much more efficiently means exploring interesting new limits that werent visible before

10k/frame of anything is ~1s per thing.

GPU pipeline depth is likely to be 1-10s (1k-10k cycles).

Specific limit: context registersRoot shader table is NOT in the contextCompute doesnt bottleneck on context


17

Application in chargeApplication is arbiter of correct renderingThis is a serious responsibilityThe benefits of D3D12 arent readily available without this condition

Applications must be warning-free on the debug layer

Different opportunities for driver intervention

Consider controlling risk by avoiding riskier techniques

#| AMD Direct3D Futures | March 20th, 2014Application in chargeNo driver thread in playApp can target much lower latencyBUT implies app has to be ready with new GPU work

DriverF1App RenderFrame 1GPUF1Frame 2F2F2Frame 3F3F3D3D11: No dead GPU time after 1st frame (but extra latency)DeadTimeFirst work sent to driverDriver buffers Present; no future dead timeNo buffered present reveals dead time on GPU

#| AMD Direct3D Futures | March 20th, 2014Use command buffers sparinglyEach API command list maps to a single hardware command buffer

Starting / ending a command list has an overheadWrites full 3D state, may flush caches or idle GPU

We think a good rule of thumb will be to target around 100 command buffers/frameUse the multiple submission API where possible

CB0CB1CB2CB0Multiple applications running on systemApplication 0 queueCB0CB1CB2CB0Application 1 queueGPU executes

#| AMD Direct3D Futures | March 20th, 2014Round-up

#| AMD Direct3D Futures | March 20th, 2014All-newTheres a learning curve here for all of us

In the main its a shallow oneCompared at least to the general problem of multithreaded renderingMultithread is always hard.Simpler design means fewer bugs and more predictable performance

#| AMD Direct3D Futures | March 20th, 2014What AMD plan to deliverRelease driver for Direct3D12 launch

Continuous engagementWith MicrosoftWith ISVsBring your opinions to us and to Microsoft.

#| AMD Direct3D Futures | March 20th, 2014QUESTIONS


24

Technology

Direct3D12 and the Future of Graphics APIs by Dave Oldcorn