Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture overview – moving...

Dishonored 2 rendering engine architecture overview

Jérémy Virga, Arkane Studios ( Lyon / France)

Intro Why we’re doing this ? (except because programmers want to have fun…)

Where we come from

•  Only one thread generaIng GPU commands (mono context) •  SynchronizaIon point to exchange data •  ONen leaves underused CPU cores

MainThread Game logic (N) sync Game logic (N+1) sync Game logic (N+2) …

RenderThread Rendering (N-‐1) Rendering (N) Rendering (N+1)

… One frame

Where we want to go

•  SequenIal but spread on all cores •  Removes an extra frame latency & data copies

•  Requires both game logic and rendering completely mulIthreaded

MainThread … Game logic (N) RenderLoop (tasks creaIon & submit) (N) …

Worker1 Task Task Task

Task Task Task

Worker2 Task Task Task Task Task

Worker4 Task Task

Task Task Task Task

Worker5 Task Task Background task

… One frame

Where we want to go

Let’s focus on rendering part…

MainThread … Game logic RenderLoop (tasks creaIon & submit) …

Worker1 Task Task Task

Task Task Task

Worker4 Task Task

Task Task Task Task

Worker5 Task Task Background task

Where we want to go (2) : Scheduling

MainThread … RenderLoop (N) …

Worker1 Shadow 1 Post-‐FX

Worker2 Opaque Shadow 2

Worker3 Z-‐prepass Alpha

GPU (N-‐?) … Shadow 1 Shadow 2 Z Prepass Opaque Alpha Post-‐FX …

Tasks generaIng GPU commands

“pure-‐cpu” Tasks (no graphic context)

GPU execuIon

•  CPU execuIon ordering to solve read/write data dependencies •  MulI GPU contexts to generate GPU commands from any thread in parallel

•  Each non “pure-‐cpu” task generates a « local » command buffer •  Fine scheduling control: CPU execuIon (commands recording) should be driven independently from GPU frame

order requirements (commands replay)

Where we want to go (2) : Scheduling

MainThread … RenderLoop (N) …

Worker1 Shadow 1 Post-‐FX

Worker2 Opaque Shadow 2

Worker3 Z-‐prepass Alpha

GPU (N-‐?) … Shadow 1 Shadow 2 Z Prepass Opaque Alpha Post-‐FX …

Tasks generaIng GPU commands

“pure-‐cpu” Tasks (no graphic context)

GPU execuIon

•  CPU execuIon ordering to solve read/write data dependencies •  MulI GPU contexts to generate GPU commands from any thread in parallel

•  Each non “pure-‐cpu” task generates a « local » command buffer •  Fine scheduling control: CPU execuIon (commands recording) should be driven independently from GPU frame

order requirements (commands replay)

VoidEngine & Dishonored 2 rendering facts

•  Environments/architecture created from small blocks + more details than previous game Ø Lot of objects to process, Thousands of draw calls

•  Instanced batch draws Ø batches generated based on visibility results from Umbra

•  Dedicated culling for shadow casters •  Lot of shadow casIng lights, all dynamic with cache system, nothing pre-‐baked

•  Several passes •  Z prepass for opaque, separate Z prepass for alpha,

Tiled forward shading with PBR, extra passes dedicated to effects, etc…

•  In-‐house indirect lighIng system & IBL cubemaps network •  …

VoidEngine & Dishonored 2 rendering facts

•  Lot of « almost-‐independent » passes, lot of data, lot of work. • Became a mess to organize, mulIthreading will make it worst.

• Over performances, ideal architecture requirements are: •  User-‐friendly: Easy to setup and insert new work. •  Readable: Easy to understand & follow frame sequence •  Modular: Easy to organize/rearrange/split/remove passes & work

…but you know world is not ideal. So let’s try to make it not too ugly ;)

Split the renderLoop Rendering task setup

Rendering task setup: dependencies

•  Two singular dependency kinds •  « CPU dependencies » for execuIon scheduling / data synchronizaIon

•  -‐> most criIcal for performances, could create « holes » in the execuIon Imeline •  « GPU dependencies » for submissions ordering

•  -‐> very small performance impact in general, required to have consistent GPU frame but doesn’t affect CPU parallelism

Task A Task B Task C

Task A

Task B

Task C

“CPU” dependencies “GPU” dependencies

Rendering task setup: in/out

•  Explicit input/output declaraIons •  object lists •  render targets read/write •  Buffers •  random user data •  etc…

Task A

• RT • RT • Render list

Task B

• RT • buffer

Task C

• RT • User data

Rendering task setup (2): chaining

•  Explicit task chaining: input could comes from another task output •  used for automaIc dependencies checking •  Helps for readability & code maintenance. Could remove a task with limited code modificaIon.

Task A

•  RT 0 (out)

Task B

•  RT 0 from B (in) •  RT 1 (out)

Task C

•  RT 1 from C (in)

Rendering task setup (2): chaining

•  Skipped condiIon •  a task could be skipped by runIme depending on execuIon context (skipped effect, etc…) Ø  scheduler will automaIcally fix chaining

Task A

• RT 0 (out)

Task C

• RT 1 from C (in) RT 0 from A

Task B

• RT 0 from B (in) • RT 1 (out)

Rendering task setup (3): advanced opFons

•  « Background » task: low priority, render loop doesn’t wait for it at end of frame •  Submiqed on a next frame if not ready

•  « forced immediate » task: actually executed inline during submission •  Uses the main “immediate” graphic context •  To workaround graphic middleware or plarorm specific API limitaIons •  Keeps frame ordering consistency

Spreading the world: AddiFonal helpers

•  Supports spawn of new tasks from another one •  -‐>MulIple producers, mulIple consumers scheduling •  RunAsync( … );

•  To convert any piece of code into asynchronous call •  ParallelFor(…);

•  To split processing on several workers in just one line of code •  Interface similar to Intel TBB[1], MicrosoN PPL[2], …

•  RenderPass •  EncapsulaIon of several tasks sharing dependencies and/or inputs. •  Scheduling sIll fully flexible at task-‐level •  E.g. each shadow slice/part is a task, encapsulated into only one shadow pass.

Rendering task examples

• Umbra visibility jobs (cpu) • Drawing batches gathering/sorIng (cpu) •  Lights sorIng (cpu) • DirecIonal shadow cascades draws (cpu/gpu) •  Local (point/spot/area light) shadows update (cpu/gpu) • Opaque pass draws (cpu/gpu) • Alpha pass draws (cpu/gpu) • … etc ~50-‐70 tasks currently (~half are cpu/gpu)

Results, issues & guidelines

Results

• We got ~40-‐60% renderLoop duraIon Ime saved on first draN (on 6-‐8 cores hardware) •  Excellent results on latest consoles. SIll improving over SDK updates •  We are expecIng the best results on latest PC APIs (Mantle/DX12/Vulkan)

• We improved those results significantly by tweaking tasks (see guidelines) •  we have to do that constantly during game development as things are moving

• MulItask overhead VS overall performances •  Scheduling cost, submission cost •  Cache misses easier to raise (when cache is shared through cores) •  You should sIll get benefits

Issues

• NOT for every environment: •  PC D3D11, efforts were made on some recent drivers, but result depends on IHV (independent hardware vendor) •  From really good improvements to horrible performances loss •  Could rely on D3D11_FEATURE_DATA_THREADING::DriverCommandLists with recent drivers

•  We fallback on an hybrid mode when not correctly supported •  only pure-‐cpu tasks are parallelized, gpu-‐tasks run on just one worker, with only one graphic context.

•  Gpu dependencies converted to cpu ones to keep frame ordering consistency

Issues

• …easy to break rendering with random arIfacts hard to understand. • We developed in-‐house debug tools & commands

•  Could switch on the fly to single threaded execuIon •  Could display on-‐screen intermediate task’s RT outputs •  ExecuIon Imeline recording •  Could record submissions ordering of a buggy frame and replay it •  Dependency graphs generaIon

•  … sIll evolving

Issue example: the « renaming » case

Update a dynamic GPU resource on task A. Use it in a command buffer in task B

•  Doesn’t require an extra CPU dependency between them. From the CPU, execuIng update(A) before, during or aNer binding(B) is completely valid.

MainThread … RenderLoop …

Worker1 A (update)

Worker2 B (binding)

Issue example: the « renaming » case

•  On PC D3D11, driver handles this for you •  On update, it « renames » the resource = it creates another copy version •  On binding use, it adds « split point » in the command buffer each Ime the actual copy version behind a dynamic resource is unknown (= not updated within the same local command buffer).

•  On submission, it patches all the split points of the command buffer according to other preceding submissions •  -‐> bad performances overhead !

•  On consoles & new PC APIs: manual management •  Much more efficient •  Requires your knowledge of the actual renamed « version » to use in the binding task(B) •  -‐> Input/output task chaining gives that

To be conFnued: guidelines

• Bench it ! •  Use low-‐level profiling tools to observe stalls, holes in the Imeline, preempIon •  PC: MicrosoN Concurrency Visualizer [3], …

•  Improve work split / CPU dependencies to prevent holes / improve code paqerns to prevent CPU stalls / etc…. will increase results significantly •  Be careful to not have too many thread context switches. •  Tweak core affiniIes of your tasks (consoles) •  Granularity of split: overhead vs performance gain

To be conFnued: next steps

• Use extra GPU engine (Asynchronous compute, DMA, …) to also improve GPU parallelism – consoles & new PC APIs only •  Re-‐use tasks GPU dependencies to manage GPU queues synchronizaIons

•  Thinking about a system allowing tasks generaIng very small command buffers to give it to another task at the end, instead of registering for submission directly. •  -‐> hard to manage correctly submission ordering

QuesFons ?

jvirga@arkane-‐studios.com

Bonus slide: kill mutexes

• Mutexes are your nemesis. •  There is oNen a more efficient paqern or primiIve to avoid using them. •  Use spin lock when you know the lock duraIon Ime is really small •  Use lockless queues, etc… •  Pre-‐allocate containers and use them with atomic indexes increment •  Use Read/Write mutex when you know there are much more read than write on the data (several concurrent reads allowed, exclusive write) •  Use thread local storage in code called concurrently •  …

Bonus slide: scheduler implementaFon details

•  RenderLoop •  (A) For each Task, PushTask()

•  readyToStart (CPU dependencies + task not skipped by runIme) && there’s an available worker ? •  Send signal to the worker

•  else •  Place in queue

•  (B) Wait for a pending task submission. •  (C) For each pending submission

•  readyToSubmit (GPU dependencies) ? •  Send command buffer to GPU queue •  Release/Recycle it (plarorm dependent)

•  Repeat (B) and (C) unIl all frame tasks are completed & submiqed – except for “background” tasks

Bonus slide: scheduler implementaFon details

• Worker •  (A) Wait for a task readyToStart signal •  (B) If the task requires command buffer, assign it a graphic context •  (C) Execute the task •  (D) Close the command buffer •  (E) Place task into pending submission queue if command buffer actually filled •  (D) check for any other available task to run in scheduler queue

•  return to (B) else return to (A)

Bonus slide: Concurrency Visualizer •  Low-‐level: catch CPU core stalls, memory management, preempIon, sleep, IO

•  Blocking & unblocking call stacks

•  Timings

•  Markers API to make it readable

•  Observe context switches

Catch context switches

Bonus slide: parallelFor sample

Bonus slide: chaining sample

References

•  [1] Intel Threading Building Blocks (library) hqps://www.threadingbuildingblocks.org/ •  [2] MicrosoN Parallel Paqerns Library hqps://msdn.microsoN.com/en-‐us/library/vstudio/dd492418.aspx •  [3] MicrosoN Concurrency Visualizer (PC profiling tool)

•  Bundled with Visual Studio 2012 •  OpIonal extension since 2013 hqps://visualstudiogallery.msdn.microsoN.com/24b56e51-‐fcc2-‐423f-‐b811-‐f16f3fa3af7a

Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture overview – moving...

Technology

Jérémy Fréchard RT221 1 By Jérémy Fréchard. - General principle - Historic - The first french mobile telephone system : Radiocom 2000 - The french mobile

Dishonored Game Guide

Article Jérémy Vanderbeke Journal à Part 3

Radar Palette Home Click Doppler Pre-warm Frontal 1 Ahead of WCB Classic area for virga Probability of virga increases with strength and dryness of the

Pioneer film director dishonored by those who follow in his …und.edu/faculty/christopher-jacobs/_files/docs/dw... · · 2015-01-08Pioneer film director dishonored by those who

2018 -19 PROGRAMME 1819 - 3rd draft.pdf · Ralph Towner Sunday 17 Feb 2019 at 3pm Pixels Ensemble Hildegard of Bingen O viridissima virga (voice) O virga ac diadema (voice) George

Opacity Enforcing Control Synthesis · 2021. 3. 2. · Jérémy Dubreil, Philippe Darondeau, Hervé Marchand To cite this version: Jérémy Dubreil, Philippe Darondeau, Hervé Marchand

Jérémy Lebreton EXOZODI Kick-off Meeting 10-02-2011

Radar Palette Home Click Conventional Pre-warm Frontal 1 Ahead of WCB Classic area for virga Probability of virga increases with strength and dryness of

The SuperNEMO experiment A very low background experiment Jérémy ARGYRIADES, LAL Orsay

Giving to God - Amazon S3 · 2018. 4. 9. · dishonored) Your name? _ God then goes on to explain how they dishonored Him. They dishonored Him by accepting and sacrificing animals

Christin jérémy documentation

Jérémy DESAPHY L'analyse structurale de complexes protéine

Samo Kralj 1,2 , Riccardo Rosso 3 , Epifanio G. Virga 3

1. When a cheque is wrongfully dishonored by the …ibskayamkulam.in/OnlineExam/file/2013-09... · 1. When a cheque is wrongfully dishonored by the paying bank, it is liable to the

Nicolas Bossard, Jérémy Jacob, Claude Le Milbeau, Joana

« Damaging the Perfect Image of Athletes: How Sport Promotes Envy » Jérémy CELSE · 2017-05-05 · Jérémy CELSE – Damaging the perfect image of athletes: how sport promotes

Synthesis of opaque systems with static and dynamic masks · 2021. 5. 19. · Franck Cassez, Jérémy Dubreil, Hervé Marchand To cite this version: Franck Cassez, Jérémy Dubreil,

Assassin's creed and dishonored photos (modeling and 3 d)

French feed back about AIR 2 & AIR3 Léa Riffaut, Jérémy Pinte