108

#GDC15 Code Clinic

Embed Size (px)

Citation preview

Page 1: #GDC15 Code Clinic
Page 2: #GDC15 Code Clinic

I don’t know anything about wood carving, but…

Page 3: #GDC15 Code Clinic
Page 4: #GDC15 Code Clinic
Page 5: #GDC15 Code Clinic

Different problems, different scales, different tools.

Page 6: #GDC15 Code Clinic
Page 7: #GDC15 Code Clinic

Tools scale

• Chainsaw

• Dremel

• Laser

Page 8: #GDC15 Code Clinic

Tools scale

• Chainsaw

• Dremel <- Compiler

• Laser

Page 9: #GDC15 Code Clinic

To make best use of a tool (e.g. compiler)

• Solve the part of the problem the tool can’t help with.

• Prepare the area that the tool can help with.

• Solve details missed by using the tool as intended.

Page 10: #GDC15 Code Clinic

Part 1:Solve the part of the problem

the tool can’t help with.

Page 11: #GDC15 Code Clinic

Approaches…

Page 12: #GDC15 Code Clinic

Approach #1

• Estimate resources available

• Triage based on cost and value estimates

• Collect data

• Adapt as cost and/or value predictions change

• AKA Engineering

Page 13: #GDC15 Code Clinic

Approach #2

• Don’t worry about resources

• Desperately scramble to fix when running out.

• Generously described as irrational

• Desperate scramble != “optimization”

Page 14: #GDC15 Code Clinic

The problems are there even if you ignore them.

Page 15: #GDC15 Code Clinic

Estimate resources available

• What are the bottlenecks? i.e. most insufficient for the demand

Page 16: #GDC15 Code Clinic

Shell + Excel: test_0 vs. test_1 (W=2 -> W=512)

Page 17: #GDC15 Code Clinic

What can we infer?

Page 18: #GDC15 Code Clinic

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=8

test_0 test_1

33ms = 30fpsCol W=8; Ea. R/W avg +/- 32B from next/prev8192 * 1024 * 32bit = 3MB / frame

Page 19: #GDC15 Code Clinic

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=8

test_0 test_1

33ms = 30fpsCol W=8; Ea. R/W avg +/- 32B from next/prev8192 * 1024 * 32bit = 3MB / frame

Row, in order8192 * 1024 * 32bit * 8.9x = 34.7MB / frame

Page 20: #GDC15 Code Clinic

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=512

test_0 test_1

W=512; > 20x diff (1.19MB/frame)Algorithmic complexity equivalent

Page 21: #GDC15 Code Clinic

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=512

test_0 test_1

Gap = doing nothing.

Page 22: #GDC15 Code Clinic

Why?

Page 23: #GDC15 Code Clinic

http://www.gameenginebook.com/SINFO.pdf

Page 24: #GDC15 Code Clinic

By comparison…

Page 25: #GDC15 Code Clinic

http://www.agner.org/optimize/instruction_tables.pdf

(AMD Piledriver)

Page 26: #GDC15 Code Clinic

http://www.agner.org/optimize/instruction_tables.pdf

(AMD Piledriver)

Page 27: #GDC15 Code Clinic

The Battle of North Bridge

L1

L2

RAM

Page 28: #GDC15 Code Clinic

Verify data…

Page 29: #GDC15 Code Clinic
Page 30: #GDC15 Code Clinic

Excel: W=64 (int32_t; mod 8)

Page 31: #GDC15 Code Clinic

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=512

test_0 test_1

Make reasonable use of resources

Page 32: #GDC15 Code Clinic

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

W=512

test_0 test_1

Make reasonable use of resources

(not the same as optimization)

Page 33: #GDC15 Code Clinic

Memory access is the most significant component.

Page 34: #GDC15 Code Clinic

Bottleneck: most insufficient for the demand

• Q: Is everything always all about memory access / cache?

• A: No. It’s about whatever the most scarce resources are.

• E.g. Disk access seeks; GPU draw counts; etc.

Page 35: #GDC15 Code Clinic

Let’s look at it another way…

Page 36: #GDC15 Code Clinic

Simple, obvious things to look for + Back of the envelope calculations

= Substantial wins

Page 37: #GDC15 Code Clinic

http://deplinenoise.wordpress.com/2013/12/28/optimizable-code/

Page 38: #GDC15 Code Clinic
Page 39: #GDC15 Code Clinic

2 x 32bit read; same cache line = ~200

Page 40: #GDC15 Code Clinic

Waste 56 bytes / 64 bytes

Page 41: #GDC15 Code Clinic

Float mul, add = ~10

Page 42: #GDC15 Code Clinic

Let’s assume callq is replaced. Sqrt = ~30

Page 43: #GDC15 Code Clinic

Mul back to same addr; in L1; = ~3

Page 44: #GDC15 Code Clinic

Read+add from new line= ~200

Page 45: #GDC15 Code Clinic

Waste 60 bytes / 64 bytes

Page 46: #GDC15 Code Clinic

90% waste!

Page 47: #GDC15 Code Clinic

Alternatively,Only 10% capacity used*

* Not the same as “used well”, but we’ll start here.

Page 48: #GDC15 Code Clinic

Time spent waiting for L2 vs. actual work

~10:1

Page 49: #GDC15 Code Clinic

Time spent waiting for L2 vs. actual work

~10:1

This is the compiler’s space.

Page 50: #GDC15 Code Clinic

Time spent waiting for L2 vs. actual work

~10:1

This is the compiler’s space.

Page 51: #GDC15 Code Clinic

Compiler cannot solve the most significant problems.

Page 52: #GDC15 Code Clinic
Page 53: #GDC15 Code Clinic

12 bytes x count(32) = 384 = 64 x 6

4 bytes x count(32) = 128 = 64 x 2

Page 54: #GDC15 Code Clinic

12 bytes x count(32) = 384 = 64 x 6

4 bytes x count(32) = 128 = 64 x 2

(6/32) = ~5.33 loop/cache line

Page 55: #GDC15 Code Clinic

12 bytes x count(32) = 384 = 64 x 6

4 bytes x count(32) = 128 = 64 x 2

Sqrt + math = ~40 x 5.33 = 213.33 cycles/cache line(6/32) = ~5.33 loop/cache line

Page 56: #GDC15 Code Clinic

12 bytes x count(32) = 384 = 64 x 6

4 bytes x count(32) = 128 = 64 x 2

Sqrt + math = ~40 x 5.33 = 213.33 cycles/cache line(6/32) = ~5.33 loop/cache line

+ streaming prefetch bonus

Page 57: #GDC15 Code Clinic

12 bytes x count(32) = 384 = 64 x 6

4 bytes x count(32) = 128 = 64 x 2

Sqrt + math = ~40 x 5.33 = 213.33 cycles/cache line(6/32) = ~5.33 loop/cache line

+ streaming prefetch bonus

Using cache line to capacity* =Est. 10x speedup

* Used. Still not necessarily as efficiently as possible

Page 58: #GDC15 Code Clinic

0

0.5

1

1.5

2

2.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

32K GameObjects

test_0 test_1

Measured = 6.8x

Page 59: #GDC15 Code Clinic

Part 2:Prepare the area that the tool

can help with.

Page 60: #GDC15 Code Clinic

“The high-level design of the code generator”

• http://llvm.org/docs/CodeGenerator.html

• Instruction Selection • Map data/code to instructions that exist

• Scheduling and Formation • Give opportunity to schedule / estimate against data access latency

• SSA-based Machine Code Optimizations• Manual SSA form

• Register Allocation• Copy to/from locals in register size/types

• Prolog/Epilog Code Insertion

• Late Machine Code Optimizations• Pay careful attention to inlining

• Code Emission • Verify asm output regularly

Page 61: #GDC15 Code Clinic

“Compilers are good at applying mediocre optimizations

hundreds of times.”- @deplinenoise

Page 62: #GDC15 Code Clinic

Three easy tips to help compiler

• #1 Analyze one value

• #2 Hoist all loop-invariant reads and branches

• #3 Remove redundant transform redundancy

Page 63: #GDC15 Code Clinic

#1 Analyze one value

• Find any one value (or implicit value) you’re interested in, anywhere.

• Printf it out over time. (Helps if you tag it so you can find it.)

• Grep your tty and paste into excel. (Sometimes a quick and dirty script is useful to convert to a more complex piece of data.)

• See what you see. You’re bound to be surprised by something.

Page 64: #GDC15 Code Clinic

e.g. SoundSourceBaseComponent::BatchUpdate

if ( g_DebugTraceSoundSource ){

Printf("#SS-01 %d m_CountSources\n”,i, component->m_CountSources);}

Page 65: #GDC15 Code Clinic
Page 66: #GDC15 Code Clinic

Approximately 0% of the sound source components have a source. (i.e. not playing yet)

Page 67: #GDC15 Code Clinic

Approximately 0% of the sound source components have a source. (i.e. not playing yet)

Switch to evaluating from source->component not component->source

20K vs. 200 cache lines

Page 68: #GDC15 Code Clinic

#2 Hoist all loop-invariant reads and branches

• Don’t re-read member values or re-call functions when you alreadyhave the data.

• Hoist all loop-invariant reads and branches. Even super-obvious ones that should already be in registers (member fields especially.)

Page 69: #GDC15 Code Clinic

How is it used? What does it generate?

Page 70: #GDC15 Code Clinic

How is it used? What does it generate?

Equivalent to:return m_NeedParentUpdate?count:0;

Page 71: #GDC15 Code Clinic

MSVC

Page 72: #GDC15 Code Clinic

MSVC

Re-read and re-test…

Increment and loop…

Page 73: #GDC15 Code Clinic

Re-read and re-test…

Increment and loop…

Why?

Super-conservative aliasing rules…?Member value might change?

Page 74: #GDC15 Code Clinic

What about something more aggressive…?

Page 75: #GDC15 Code Clinic

Test once and return…

What about something more aggressive…?

Page 76: #GDC15 Code Clinic

Okay, so what about…

Page 77: #GDC15 Code Clinic

Okay, so what about…

Equivalent to:return m_NeedParentUpdate?count:0;

Page 78: #GDC15 Code Clinic

…well at least it inlined it?

Page 79: #GDC15 Code Clinic

MSVC doesn’t fare any better…

Page 80: #GDC15 Code Clinic

Don’t re-read member values or re-call functions whenyou already have the data.

Ghost reads and writes

Page 81: #GDC15 Code Clinic

BAM!

Page 82: #GDC15 Code Clinic

:(

Page 83: #GDC15 Code Clinic

Ghost reads and writes

Don’t re-read member values or re-call functions whenyou already have the data.

Hoist all loop-invariant reads and branches. Even super-obvious ones that should already be in registers.

Page 84: #GDC15 Code Clinic

:)

Page 85: #GDC15 Code Clinic

:)

A bit of unnecessary branching, but more-or-less equivalent.

Page 86: #GDC15 Code Clinic

#3 Remove redundant transform redundancy

• Often, especially with small functions, there is not enough context to know when and under what conditions a transform will happen.

• It’s easy to fall into the over-generalization trap

• It’s only real cases, with real data, we’re concerned with.

Page 87: #GDC15 Code Clinic

A little gem…

Page 88: #GDC15 Code Clinic
Page 89: #GDC15 Code Clinic

wchar_t* -> char*

Suspicious. I know we don’t handle wide char conversion consistently.

Page 90: #GDC15 Code Clinic
Page 91: #GDC15 Code Clinic
Page 92: #GDC15 Code Clinic

wchar_t* -> char* -> wchar_t*

it takes the char* we created and converts it right back to a whar*

Page 93: #GDC15 Code Clinic
Page 94: #GDC15 Code Clinic
Page 95: #GDC15 Code Clinic

What’s happening?

• We convert from char* to wchar* to store it in m_DeleteFiles

• …So we can convert it from wchar* to char* to give to FileDelete

• …So we can convert it from char* to whcar* to give it to Windows DeleteFile.

Page 96: #GDC15 Code Clinic

Where does the char* come from in the first place?

Page 97: #GDC15 Code Clinic
Page 98: #GDC15 Code Clinic

Comes from argv. Never needed to touch the memory!

Page 99: #GDC15 Code Clinic

Which turned out to be a command line parameter that

was never used, anywhere.

Page 100: #GDC15 Code Clinic

Which brings us back to the wood carving analogy…

Page 101: #GDC15 Code Clinic
Page 102: #GDC15 Code Clinic

Part 3:Solve details missed by using

the tool as intended.

Page 103: #GDC15 Code Clinic

Optimization begins.See: e.g.

Page 104: #GDC15 Code Clinic

Part 4: Practice

Page 105: #GDC15 Code Clinic

If you don’t practice, you can’t do it when it matters.

Page 106: #GDC15 Code Clinic

Practice https://oeis.org/A000081

Page 107: #GDC15 Code Clinic

What’s the cause?

http://www.insomniacgames.com/three-big-lies-typical-design-failures-in-game-programming-gdc10/

Page 108: #GDC15 Code Clinic