58
© 2018 Arm Limited Hans-Kristian Arntzen 2018-08-16 – SIGGRAPH 2018 Rendering Structures Analyzing modern rendering on mobile

Rendering Structures - ARM architecture

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rendering Structures - ARM architecture

© 2018 Arm Limited

• Hans-Kristian Arntzen• 2018-08-16 – SIGGRAPH 2018

Rendering Structures

Analyzing modern rendering on mobile

Page 2: Rendering Structures - ARM architecture

2 © 2018 Arm Limited

Content

Motivation

1

Scene and lights

2

Rendering structures overview

3

Benchmark results

4

Post-AA benchmarking

5

Page 3: Rendering Structures - ARM architecture

3 © 2018 Arm Limited

Motivation

• Performance characteristics for mobile architectures differ from desktop

• Very little comparative data on rendering many lights on mobile

• Explore the most promising rendering structures for mobile

• Focus on Mali

• Midgard GPU family is very different to Bifrost

Page 4: Rendering Structures - ARM architecture

© 2018 Arm Limited

The sceneSponza, duh! [3]

Page 5: Rendering Structures - ARM architecture

5 © 2018 Arm Limited

Page 6: Rendering Structures - ARM architecture

6 © 2018 Arm Limited

Page 7: Rendering Structures - ARM architecture

7 © 2018 Arm Limited

Page 8: Rendering Structures - ARM architecture

© 2018 Arm Limited

Classic deferred shading

Page 9: Rendering Structures - ARM architecture

9 © 2018 Arm Limited

Page 10: Rendering Structures - ARM architecture

10 © 2018 Arm Limited

Page 11: Rendering Structures - ARM architecture

11 © 2018 Arm Limited

Pros and Cons

• Pros• Arbitrary number of lights• Arbitrary different light types• Separate shadow maps per light• Small, decoupled shaders• Robust against geometry overdraw

• Cons• No (easy) MSAA• No (easy) transparency• False positives in shading (single-sided depth test)

Page 12: Rendering Structures - ARM architecture

© 2018 Arm Limited

Clustered shading

Page 13: Rendering Structures - ARM architecture

13 © 2018 Arm Limited

Page 14: Rendering Structures - ARM architecture

14 © 2018 Arm Limited

Page 15: Rendering Structures - ARM architecture

15 © 2018 Arm Limited

Page 16: Rendering Structures - ARM architecture

16 © 2018 Arm Limited

Page 17: Rendering Structures - ARM architecture

17 © 2018 Arm Limited

Page 18: Rendering Structures - ARM architecture

18 © 2018 Arm Limited

Forward and deferred clustered shading

• Forward• Look up cluster and shade all lights when rendering a mesh directly

• Deferred• Render a full-screen quad in lighting pass and shade all positional lights

in one go based on reconstructed position

Page 19: Rendering Structures - ARM architecture

19 © 2018 Arm Limited

Pros and Cons

• Pros• Very flexible

–Deferred, Forward, Transparency, MSAA, Volumetrics• Can be computed before knowing depth buffer• Can be expanded to support more than just lights

• Cons• All resources (e.g. shadow maps) need to be bound• Some fixed overhead to sample cluster• Very heavy shader

Page 20: Rendering Structures - ARM architecture

© 2018 Arm Limited

Forward Z pre-pass

Page 21: Rendering Structures - ARM architecture

21 © 2018 Arm Limited

The forward depth prepass

• Forward clustering shaded pixels are heavy

• Want to avoid over-shading

• Perfect front-to-back sorting is impractical

• Can potentially be done on-chip

• Sometimes unavoidable for certain effects• Screen-space AO techniques

Page 22: Rendering Structures - ARM architecture

22 © 2018 Arm Limited

The forward depth prepass

• Pros• No over-shading• More flexible drawing order

• Cons• Double the geometry load• More bandwidth required• More CPU overhead

Page 23: Rendering Structures - ARM architecture

© 2018 Arm Limited

ComparisonHow many pixels are touched?

Page 24: Rendering Structures - ARM architecture

24 © 2018 Arm Limited

Page 25: Rendering Structures - ARM architecture

25 © 2018 Arm Limited

Page 26: Rendering Structures - ARM architecture

© 2018 Arm Limited

ResultsUsing the Granite renderer [2]

Page 27: Rendering Structures - ARM architecture

27 © 2018 Arm Limited

The grand benchmark

• Create a massive benchmark sweep

• Compare the rendering structures head-to-head

• Turn knobs on and off, and see how it affects performance

Page 28: Rendering Structures - ARM architecture

28 © 2018 Arm Limited

Test hardware and renderer

• Mobile

• Desktop reference points

• Vulkan

• PBR

• Multipass techniques used for deferred

Page 29: Rendering Structures - ARM architecture

29 © 2018 Arm Limited

Midgard (T-880) greatly prefers deferred techniques

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

# spot lights

No

rmal

ized

tim

e

Classic deferred

Clustered deferred

Forward clustered

Forward clustered prepass

Clustered stencil culling

Page 30: Rendering Structures - ARM architecture

30 © 2018 Arm Limited

Clustered shading scales very well on Bifrost (G71 & G72)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

# spot lights

No

rmal

ized

tim

e

Classic deferred

Clustered deferred

Forward clustered

Forward clustered prepass

Clustered stencil culling

Page 31: Rendering Structures - ARM architecture

31 © 2018 Arm Limited

Gap between deferred and forward is greater on desktop

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

# spot lights

No

rmal

ized

tim

e

RX470 & GTX1060 average

Classic deferred

Clustered deferred

Forward clustered

Forward clustered prepass

Clustered stencil culling

Page 32: Rendering Structures - ARM architecture

32 © 2018 Arm Limited

Forward depth prepass might be a good idea

Desktop Midgard Bifrost

0

0.2

0.4

0.6

0.8

1

1.2

No

rmal

ized

tim

e

Off

On

Page 33: Rendering Structures - ARM architecture

33 © 2018 Arm Limited

MSAA is super expensive with microgeometry

Forward clustered Forward clusteredprepass

Forward clustered(4x MSAA)

Forward clusteredprepass (4x MSAA)

Classic deferred Stencil culling Clustered deferred

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

No

rmal

ized

tim

e

Desktop

Midgard

Bifrost

Page 34: Rendering Structures - ARM architecture

34 © 2018 Arm Limited

Clustered shading actually improves bandwidth

Read Write Total

No

rmal

ized

DD

R t

ran

sact

ion

s

Classic deferred

Stencil culling

Clustered deferred

Forward Clustered

Forward Clustered Prepass

Forward Clustered (4x MSAA)

Forward Clustered Prepass (4x MSAA)

Page 35: Rendering Structures - ARM architecture

35 © 2018 Arm Limited

One tenth performance compared to mid-range desktop

Desktop Midgard Bifrost

0

2

4

6

8

10

12

14

16

18

No

rmal

ized

tim

e ag

ain

st d

eskt

op

(p

er t

ech

niq

ue)

Forward clustered

Forward clustered prepass

Forward clustered (4x MSAA)

Forward clustered prepass (4x MSAA)

Classic deferred

Stencil culling

Clustered deferred

Page 36: Rendering Structures - ARM architecture

© 2018 Arm Limited

Post-AA rundown

Page 37: Rendering Structures - ARM architecture

37 © 2018 Arm Limited

The usual suspects

• FXAA• 9-tap

• SMAA• Low, Medium, High, Ultra

• TAA

Page 38: Rendering Structures - ARM architecture

38 © 2018 Arm Limited

Bilinear5-tap cross

Variance Clipping

None

MAX3

Bicubic

YCgCo

RGB

AABBclip

Clamp5-tap cross

3x3

RoundedCorner

Low

Medium

High

Ultra

Extreme

Nightmare

Page 39: Rendering Structures - ARM architecture

39 © 2018 Arm Limited

Results

0

10

20

30

40

50

60

70

80

90

Cyc

les

per

pix

el (

sin

gle

core

GP

U)

Post-AA rundown (normalized MP1)

Midgard

Bifrost

Page 40: Rendering Structures - ARM architecture

40 © 2018 Arm Limited

Links

[1] - http://efficientshading.com/wp-content/uploads/s2015_mobile.pptx

[2] - https://github.com/Themaister/Granite

[3] - https://github.com/KhronosGroup/glTF-Sample-Models/tree/master/2.0/Sponza

[4] - http://advances.realtimerendering.com/s2016/Siggraph2016_idTech6.pdf

[5] - https://www.khronos.org/assets/uploads/developers/library/2017-gdc/GDC_Vulkan-on-Mobile_Vulkan-Multipass-ARM_Mar17.pdf

Page 41: Rendering Structures - ARM architecture

4141

Thank YouDankeMerci谢谢ありがとうGraciasKiitos감사합니다धन्यवादתודה

© 2018 Arm Limited

Page 42: Rendering Structures - ARM architecture

© 2018 Arm Limited

Bonus slides

Page 43: Rendering Structures - ARM architecture

© 2018 Arm Limited

Clustered stencil culling

Page 44: Rendering Structures - ARM architecture

44 © 2018 Arm Limited

Clustered stencil culling

• Basically, bucket the positional lights into N buckets

• Each bucket gets their own stencil bit• I used 7, 1 for masking background, YMMV

• Render backfaces with greater test at end of G-buffer pass• Each bucket sets their own stencil bit if depth passes• Instance all lights in a bucket

• In lighting pass• Less-than test, also test stencil read-only if the bucket bit is set.• Effectively a conservative double-sided test is achieved.• Cuts out a lot of overdraw against background.• Lights which clip near plane do not participate in cluster, just use back-face test as-is.

Page 45: Rendering Structures - ARM architecture

45 © 2018 Arm Limited

Page 46: Rendering Structures - ARM architecture

46 © 2018 Arm Limited

Page 47: Rendering Structures - ARM architecture

47 © 2018 Arm Limited

Compared to classic deferred

• Pros• Can reduce false positives in shading

• Cons• Need to free up some bits in the stencil buffer• Some extra early-ZS fill-rate required• Some work required to bin lights to stencil bits

Page 48: Rendering Structures - ARM architecture

48 © 2018 Arm Limited

Clustered shading

• Has been presented at SIGGRAPH 2015 in the past [1]• Unfortunately, no real performance data

• Extremely flexible• Does not need to know depth buffer up-front• Supports marching-like techniques for volumetrics• Can be computed on CPU or GPU (I use GPU)

• Fully supports• Forward• Deferred• MSAA• Transparency

• To shade• Look up (offset, count) or equivalent from a 3D texture based on rasterized position.• Iterate and shade lights, lights can be stored in a large buffer.• Needs atlasing techniques (or bindless) for shadowmaps.

Page 49: Rendering Structures - ARM architecture

49 © 2018 Arm Limited

Implementation

• Async compute shader computes per cell in cluster• Low-resolution pre-pass in compute• Prunes lights in 4x4x4 blocks before testing at full res.

• Accurate intersection tests are easy• Because we have perfect small cubes, we can approximate well treating cube as a sphere• No need for conservative raster which is popular for the common «froxel» layout

• Bitmasks limit number of maximum lights• Can also use classic «list» of lights for arbitrary amounts, but more expensive to compute list on GPU• Shading performance seems the same (within margin of error), so going to leave it at that

Page 50: Rendering Structures - ARM architecture

50 © 2018 Arm Limited

Light sweep test

Test how the different rendering algorithms react to increasing number of spot lights:• Classic deferred (blend light volumes over frame buffer)• Stencil culling (same as classic deferred, but tries to reduce false positives)• Forward clustered• Deferred clustered (all positional lights rendered as a single full-screen quad)• Forward clustered prepass (On-tile, run prepass depth in same render pass)

Barebones rendering outside positional lights:• LDR is used to avoid constant overhead of HDR bloom.• Shadows for directional light is turned off to avoid overhead of rendering shadows.• No MSAA.

Numbers on mobile are presented with normalized runtime based on:• GPU cycle counts• Peak clock speed

Page 51: Rendering Structures - ARM architecture

51 © 2018 Arm Limited

Method sweep test

Tries to look at aggregate results to answer questions like:• How expensive are shadows?• How expensive is VSM vs PCF 1x1?• Prepass vs no prepass?• Does anything stick out compared to desktop?

Also have bandwidth numbers captured on the S9+.

Page 52: Rendering Structures - ARM architecture

52 © 2018 Arm Limited

Read Write Total

DD

R t

ran

sact

ion

sDirectional light shadows bandwidth

Off

On

Read Write Total

DD

R t

ran

sact

ion

s

Positional light shadows bandwidth

Off

On

Read Write Total

DD

R t

ran

sact

ion

s

Forward prepass bandwidth

Off

On

Page 53: Rendering Structures - ARM architecture

53 © 2018 Arm Limited

Read Write Total

DD

R t

ran

sact

ion

sShadow mapping filter bandwidth

PCF 1x1

VSM

Read Write TotalD

DR

tra

nsa

ctio

ns

Stencil culling bandwidth

Off

On

Read Write Total

DD

R t

ran

sact

ion

s

HDR vs LDR bandwidth

LDR

HDR + bloom

Read Write Total

DD

R t

ran

sact

ion

s

Deferred methods bandwidth

Classic deferred

Clustered deferred

Page 54: Rendering Structures - ARM architecture

54 © 2018 Arm Limited

Observations

Bifrost deals way better with forward shading than Midgard• Far more registers available, and scalar instead of vector

Clustered shading is great on Bifrost• Deferred with multipass or forward, either is good

Forward prepass is surprisingly good• Gets better with more complex content• Bandwidth hit isn’t as extreme as I expected• Expect there to be a cutoff point

MSAA is expensive with denser geometry• We avoid all the bandwidth hit, but still need to shade a lot more partial quads

~10x gap to mid-range desktop• At least with this renderer ☺

Page 55: Rendering Structures - ARM architecture

55 © 2018 Arm Limited

The usual suspects

• FXAA• The 9-tap version• It’s light on arithmetic, so it’s basically a 9-tap filtering benchmark ☺

• SMAA• Low• Medium• High• Ultra

• TAA• There is no de-facto reference TAA implementation• I implemented various well-known refinements as building blocks• Made some «presets»

Page 56: Rendering Structures - ARM architecture

56 © 2018 Arm Limited

TAA variants

• Low• RGB input used for clamping• No HDR luminance adjustment• Neighbor rejection method based on clamping to 5-tap cross• Nearest depth / max velocity found from 5-tap cross

• Medium• Turns on HDR luminance adjustment so we blend in tonemapped space (reduces flicker)

• High• Uses AABB clipping for neighbors (less color squashing in the AABB corner)• Rounded corner method to find neighbor color AABB

• Ultra• Converts RGB to YCgCo for neighbor clipping purposes (retains hue better)

Page 57: Rendering Structures - ARM architecture

57 © 2018 Arm Limited

More intense variants

• Extreme• Neighbor clamping method changed to rounded corner with variance clipping• Nearest depth / max velocity method bumped to a 3x3 grid

• Nightmare• When sampling the history buffer, use a 9-tap bicubic filter

– Trades a lot of arithmetic to avoid full 16-tap bicubic– Massive blur reduction in motion– This final shader is about 27 texel fetches per pixel

Page 58: Rendering Structures - ARM architecture

58 © 2018 Arm Limited

Observations

• Mobile is 10-20x slower than mid-range desktop• The gap is larger than for regular rendering

• Post-AA is still hard to fit into a budget• At 720p, FXAA seems reasonable (~1 ms)

• Mali-G71 and G72 are neck and neck on some post-AA• Texture pipe throughput is theoretically the same per clock• µ-arch improvements show up nicely on SMAA• Otherwise, uplift seems to be eaten by texture throughput/bandwidth.

• T-880 (and other Midgard GPUs) don’t scale with large, complex shaders