56
SPU Render Arseny “Zeux” Kapoulkine CREAT Studios [email protected] http://zeuxcg.org/

SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

SPU Render

Arseny “Zeux” KapoulkineCREAT Studios

[email protected]://zeuxcg.org/

Page 2: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

2

Introduction

● Smash Cars 2 project

– Static scene of moderate size

– Many dynamic objects

– Multiple render passes

– Totals up to 3000 batches per frame

● PPU render up to 12 ms

– Target – 60 fps :(

Page 3: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

3

Introduction

Page 4: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

4

Optimization techniques

● PPU code optimizations

– Has been done several times

– Would like PPU time to become ~0

● Static command buffers

– Somewhat restricted

– Culling is unclear

● Move code to SPU

Page 5: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

5

Agenda

● Render design

● Brief description of SPU

● Porting

● Development

● Q & A

Page 6: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

6

Agenda

● Render design

● Brief description of SPU

● Porting

● Development

● Q & A

Page 7: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

7

Render – “high” level

● Rendering is done on sets of RenderItem

– The sets are already sorted and culled

● RenderItem aggregates:

– SceneNode

– Material

– Shader

– RenderEntity

Page 8: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

8

Render – SceneNode

● Transform graph node

– Local transform

– Global transform (derived from local)

● Local transforms are set by misc code

– Animations

– Physics

– Game logic

Page 9: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

9

Render – Shader

● Render pipeline setup algorithm

– virtual void apply● Setups auto-parameters

– Are computed automagically by the system– WorldViewProjection, ShadowMap, etc.

– virtual void setup● Setups material

– Material parameters (including textures) – Shaders

● 99% objects are of final type HWShader

Page 10: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

10

Render – Material

● Container of instance data for Shader

– Data layout description● Parameter name/type● Offset in data array

– Data array

– Accessors for name/index (get/set)

– Render states● Blend, alpha test, depth, cull

Page 11: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

11

Render – RenderEntity

● Drawing algorithm

– virtual void render

● Several implementations

– RenderStaticGeometry

– RenderSkinnedGeometry

– RenderMorphedGeometry

– DynamicObject

Page 12: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

12

Render – low level

● Cross-platform wrappers

– State setup (with cache)

– Vertex/pixel constant setup

– Shader setup

● GCM implementation

– PS3-specific API for CB generation● Is mostly present on SPU

– This makes porting easier

Page 13: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

13

Agenda

● Render design

● Brief description of SPU

● Porting

● Development

● Q & A

Page 14: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

14

SPU – what is it?

● 6 like cores

– 3.2 GHz, in-order, dual-issue

– 128 vector registers

– Local Storage (LS)● 256 Kb – code + data● 6 cycle latency● External memory is accessed via DMA

– Asynchronous memcpy (LS ↔ memory)– Alignment/size restrictions

Page 15: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

15

SPU – porting tasks

● Build code for SPU

● Run code on SPU

– Task/job manager

– Code/data size

– Virtual functions

● Optimization

– Effective usage of DMA

– Code optimization

Page 16: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

16

Agenda

● Render design

● Brief description of SPU

● Porting

● Development

● Q & A

Page 17: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

17

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

● Step 3 – code optimization

Page 18: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

18

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

● Step 3 – code optimization

Page 19: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

19

Porting steps

● Step 1 – working prototype

– Speed does not matter● Non-optimal code, synchronous DMA

– Complete functionality

● Step 2 – data optimization

● Step 3 – code optimization

Page 20: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

20

Step 1 – PPU interface

● async::Renderer

– Simple interface● push(RenderItem) (+ batch versions)● flush()● kick()

– The limits are set when creating renderer● Maximum number of items● Maximum CB size

– Double-buffering for CB

Page 21: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

21

Step 1 – PPU interfacePPURenderer 1 Renderer 2

pushpushpush

push

itemitem

itemitem

flush

flushSPU 1 SPU 2

CB CB

render jobrender job

kick2 GPU CBrender 2

render miscrender 1

rendermisc

kick1

GPUrender 2

render misc

render 1

Write

Read

Page 22: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

22

Step 1 – DMA helpers

● Convenience functions to simplify DMA

– Allocator● Trivial stack allocator, ptr += size

– fetchData(ea, size)● Memory allocation and synchronous DMA● Can handle misalignment

– fetchObject / fetchObjectArray● Typed versions of fetchData

– Later we made asynchronous variants

Page 23: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

23

Step 1 – DMA helpers

● void* fetchData(alloc, ea, size)uint32_t sizeAligned = (size + (ea & 15) + 15) & ~15;

void* ls = alloc.allocate(sizeAligned);

DmaGet(ls, ea & ~15, sizeAligned);

DmaWait();

return (char*)ls + (ea & 15);

● T* fetchObject(alloc, ea)return (T*)fetchData(alloc, ea, sizeof(T));

Page 24: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

24

Step 1 – virtual functions

● PPU vfptr does not make sense on SPU

● The solution varies across interface

– Shader● Single supported shader type – HWShader

– RenderEntity● Enum for all supported types● Enum value is stored in unused pointer bits

– ptr = actual_ptr | type // actual_ptr % 4 == 0

Page 25: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

25

Step 1 – encapsulation

● Makes porting harder

– Methods with incorrect SPU code● CRT_ASSERT(next->prev == this)

– Additional method parameters● render() → render(Context)

● Makes SPU code refactoring harder

● Solution (some people don't like this...)

– #define private public [SPU-only!]

Page 26: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

26

Step 1 – shader patch

● RSX lacks PS constant registers

– Constants are embedded into microcode

– Microcode has to be patched● RSX blitting

– Huge RSX cost (up to 50% frame time)● PPU render

– Ring buffer for microcode instances– Complex synchronization

● SPU render– Instances are stored in the same buffer where

CB resides

Page 27: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

27

Step 1 – synchronization

● PPU/SPU

– Data races● Transformation matrices● Material parameters

– Objects can be deleted

– Solution● SPU code has to be fast● PPU waits for SPU before changing data

Page 28: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

28

Step 1 – synchronization

● SPU/RSX

– PPU● flush() inserts WAIT at the beginning of CB

– Waits indefinitely● kick() inserts CALL in main CB

– SPU● Fills CB with rendering commands/shaders● Appends RET to the end● Replaces WAIT with NOP*

Page 29: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

29

Step 1 – results

● Porting time – 3 days

● Render time – 25 ms

– PPU render time is 12.5 ms

– How to make it faster?● Brute-force – split queue into 5 chunks

– 5 ms for 5 SPU● Write better code

● Completely separate code branch

– Common data structures

Page 30: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

30

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

● Step 3 – code optimization

Page 31: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

31

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

– Change data layout● Lower indirection count

– Asynchronous DMA● Double-buffering for input/output data

● Step 3 – code optimization

Page 32: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

32

Step 2 – memory layout

RenderItem

SceneNode RenderStaticGeometry

Material

VertexDeclaration

HWShader

Parameters

Textures

TextureDesc

TextureData

GCMTexture

Param ctab

Auto ctab

HWShaderImpl

VS command buffer PS program

Page 33: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

33

Step 2 – data layout

● Goal – lower indirection count

– Actually, make graph paths shorter

● Where do they come from?

– Shared data

– “Variable” length arrays● Size is known at load time

– “Good” architecture● Law of Demeter

– Pimpl

Page 34: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

34

● Textures

– struct TextureInfo● Stored in data array● Is updated in setValue● The contents is sufficient for texture setup

– 4b – sampler state, 12b – texture header

● Render States

– Stored in data array● 16b for all states

Step 2 – materials

Page 35: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

35

● Before:

● After:

Step 2 – materials

Material + Render States Parameters

Textures TextureDesc

TextureData GCMTexture

Material Parameters + Textures + Render States

Page 36: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

36

Step 2 – HWShaderImpl

● Lots of “variable” length arrays

– Constant tables

– Shader data

● Solution

– Sequential layout of everything in memory

– Header contains offsets

– DMA get and pointer fixup● vsCB = (char*)impl + impl->vsCBOffset

Page 37: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

37

● Before:

● After:

Step 2 – HWShaderImpl

Param ctab Auto ctabHWShaderImpl VS CB PS program

Param ctab

Auto ctab HWShaderImpl

VS command buffer

PS program

HWShader

Page 38: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

38

Step 2 – VertexDeclaration

● class RenderStaticGeometry

– VertexDeclaration* vdecl● Can store vdecl by value

– Space penalty

● There are not a lot of unique instances

– There is a declaration cache anyway

– Can implement a software cache!● 4 element cache, DMA stall on cache miss● 30 cache misses for 3500 batches

Page 39: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

39

Step 2 – FlatRenderItem

● Graph path to HWShaderImpl is long

– item->material->shader->impl

● FlatRenderItem

– Caches pointers/sizes● Material data EA/size● Shader impl EA/size● Scene node/render entity EA

– Created at level load time

Page 40: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

40

Step 2 – FlatRenderItem

RenderItem SceneNodeRenderStaticGeometry

Material HWShader HWShaderImpl

● Before:

● After:

FlatRenderItemSceneNode

RenderStaticGeometryMaterial data

HWShaderImpl

Material data

Page 41: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

41

Step 2 – DMA optimizations

● Up to now all DMA are synchronous

● Can hide DMA latency!

– Launch several requests● Wait for all at once

– Double buffering● While current batch is being processed

– Source data for next batch is being read– Result for previous batch is being written

● Requires additional LS memory– Not a problem in our case

Page 42: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

42

Step 2 – output DMA

● Command buffer

– Two 8 Kb buffers

– Swap on buffer overflow

● Shader buffer

– Can do double buffering

– It's easier to wait for transfer though● But before processing instead of after!● DmaPut has enough latency to complete

Page 43: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

43

Step 2 – input DMA

● For each batch

– Wait for previous transfers

– Prefetch next batch● 4 DmaGet at once

– Current batch processing

● Requires loop prologue

– Prefetch for first batch

Page 44: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

44

Step 2 – input DMA

● Complex code (lack of experience atm)

– FlatRenderItem are fetched one by one● It's easier to fetch in groups

● Bugs

– The code prefetches one item past the end ● PPU duplicates last item to avoid errors

– Don't forget to wait for last DmaGet !● Otherwise stack corruption is possible

Page 45: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

45

Step 2 – results

● Optimization time – 3 days

● Render time – 8 ms

– Without double buffering – 12 ms

● PPU time did not change (don't ask)

FlatRenderItem

Material data

SceneNode

RenderStaticGeometry VertexDeclaration

HWShaderImpl

Page 46: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

46

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

● Step 3 – code optimization

Page 47: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

47

Porting steps

● Step 1 – working prototype

● Step 2 – data optimization

● Step 3 – code optimization

– Profiling● SN Tuner● SPUsim

– Optimization

Page 48: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

48

Step 3 – SN Tuner

● CPU/GPU profiler for PS3

– SPU performance counters● DMA stalls● Instruction scheduling

– Overview of code quality

– SPU PC sampling● No overhead as opposed to PPU sampling● Used for function cost overview

– Had to selectively remove inlining

Page 49: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

49

Step 3 – SPUsim

● SPU simulator for PC

– Awesome for prototyping● Lightning fast iterations● Stalls statistics● Instruction trace

– Shows stalls, lack of pairing

– For small self-contained functions● You can setup DMA, but it's not very easy

Page 50: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

50

Step 3 – branching

● Branching carries a lot of overhead

● Reduce branch counts

– Branch flattening

– Loop unrolling

– Switch → function pointer table

● Zero-size DMA

● Branch hinting

Page 51: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

51

Step 3 – LS load/store

● LS load/store is limited to 16b size/align

– Compiler performs shuffle / masking

● 16b reads

– Padding for input data

– Loop unrolling

● 16b writes

– Write several RSX commands at a time

– Padding for output data (via NOP for RSX)

Page 52: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

52

Step 3 – results

● Optimization time – 5 days

● Render time – 2 ms

● Further optimizations

– Code optimization is still possible● But is not worth it for now

– Parallel rendering with N SPUs● Different scene chunks● Different passes

Page 53: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

53

Porting results

● PPU time – 12.5 ms

● SPU time (prototype) – 25 ms (3 days)

● SPU time (layout) – 12 ms (2.5 days)

● SPU time (async DMA) – 8 ms (1 day)

● SPU time (code) – 2 ms (5 days)

● 75 Kb SPU code, 20 Kb PPU code

– Currently 105 / 26 Kb

Page 54: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

54

Agenda

● Render design

● Brief description of SPU

● Porting

● Development

● Q & A

Page 55: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

55

Development

● Already implemented

– Batch sorting

– Culling (frustum, screen size)

– Custom game parameter setup

● Future work

– Occlusion culling (already implemented)

– Single buffered context

– Uber shaders

Page 56: SPU Render - Arseny Kapoulkine · Step 3 – SN Tuner CPU/GPU profiler for PS3 – SPU performance counters DMA stalls Instruction scheduling – Overview of code quality – SPU

56

Q & A

?Arseny “Zeux” Kapoulkine

CREAT [email protected]

http://zeuxcg.org/