Cranking Floating Point Performance Up To 11

Preview:

DESCRIPTION

The iPhone has a surprisingly powerful engine under that shiny hood when it comes to floating-point computations. This is something that surprises a lot of programmers because by default, things can slow down a lot whenever any floating point numbers are involved. This session will explain the secrets to unlocking maximum performance for floating point calculations, from the mysteries of Thumb mode, to harnessing the full power of the forgotten vector floating point unit. Stay away from this session if he thought of reading or even (gasp!) writing assembly code scares you.

Citation preview

Cranking Floating Point Performance Up To 11 

Noel LlopisSnappy Touch

http://twitter.com/snappytouchnoel@snappytouch.com

http://gamesfromwithin.com

Floating Point Performance

Floating point numbers

Floating point numbers

• Representation of rational numbers

Floating point numbers

• Representation of rational numbers

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

Floating point numbers

• Representation of rational numbers

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Following IEEE 754 format

Floating point numbers

• Representation of rational numbers

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Following IEEE 754 format

• Single precision: 32 bits

Floating point numbers

• Representation of rational numbers

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Following IEEE 754 format

• Single precision: 32 bits

• Double precision: 64 bits

Floating point numbers

Floating point numbers

Why floating point performance?

Why floating point performance?

• Most games use floating point numbers for most of their calculations

Why floating point performance?

• Most games use floating point numbers for most of their calculations

• Positions, velocities, physics, etc, etc.

Why floating point performance?

• Most games use floating point numbers for most of their calculations

• Positions, velocities, physics, etc, etc.

• Maybe not so much for regular apps

CPU

CPU

• 32-bit RISC ARM 11

CPU

• 32-bit RISC ARM 11

• 400-535Mhz

CPU

• 32-bit RISC ARM 11

• 400-535Mhz

• iPhone 2G/3G and iPod Touch 1st and 2nd gen

CPU (iPhone 3GS)

CPU (iPhone 3GS)

• Cortex-A8 600MHz

CPU (iPhone 3GS)

• Cortex-A8 600MHz

• More advanced architecture

CPU

CPU

• No floating point support in the ARM CPU!!!

How about integer math?

How about integer math?

• No need to do any floating point operations

How about integer math?

• No need to do any floating point operations

• Fully supported in the ARM processor

How about integer math?

• No need to do any floating point operations

• Fully supported in the ARM processor

• But...

Integer Divide

Integer Divide

Integer Divide

There is no integer divide

Fixed-point arithmetic

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a fixed-point library.

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a fixed-point library.

• Performs rational arithmetic with integer values at a reduced range/resolution.

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a fixed-point library.

• Performs rational arithmetic with integer values at a reduced range/resolution.

• Not so great...

Floating point support

Floating point support

• There’s a floating point unit

Floating point support

• There’s a floating point unit

• Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.

Sample program

Sample program struct Particle { float x, y, z; float vx, vy, vz; };

Sample program struct Particle { float x, y, z; float vx, vy, vz; };

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

Sample program struct Particle { float x, y, z; float vx, vy, vz; };

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

• 7.2 seconds on an iPod Touch 2nd gen

Floating point support

Floating point support

Trust no one!

Floating point support

Trust no one!When in doubt, check the

assembly generated

Floating point support

Thumb Mode

Thumb Mode

Thumb Mode• CPU has a special thumb

mode.

Thumb Mode• CPU has a special thumb

mode.

• Less memory, maybe better performance.

Thumb Mode• CPU has a special thumb

mode.

• Less memory, maybe better performance.

• No floating point support.

Thumb Mode• CPU has a special thumb

mode.

• Less memory, maybe better performance.

• No floating point support.

• Every time there’s an fp operation, it switches out of Thumb, does the fp operation, and switches back on.

Thumb Mode

Thumb Mode

• It’s on by default!

Thumb Mode

• It’s on by default!

• Potentially HUGE wins turning it off.

Thumb Mode

• It’s on by default!

• Potentially HUGE wins turning it off.

Thumb Mode

Thumb Mode

• Turning off Thumb mode increased performance in Flower Garden by over 2x

Thumb Mode

• Turning off Thumb mode increased performance in Flower Garden by over 2x

• Heavy usage of floating point operations though

Thumb Mode

• Turning off Thumb mode increased performance in Flower Garden by over 2x

• Heavy usage of floating point operations though

• Most games will probably benefit from turning it off (especially 3D games)

2.6 seconds!

ARM assemblyDISCLAIMER:

ARM assembly

I’m not an ARM assembly expert!!!DISCLAIMER:

ARM assembly

I’m not an ARM assembly expert!!!DISCLAIMER:

ARM assembly

I’m not an ARM assembly expert!!!DISCLAIMER:

ARM assembly

I’m not an ARM assembly expert!!!DISCLAIMER:

Z80!!!

ARM assembly

ARM assembly

• Hit the docs

ARM assembly

• Hit the docs

• References included in your USB card

ARM assembly

• Hit the docs

• References included in your USB card

• Or download them from the ARM site

ARM assembly

• Hit the docs

• References included in your USB card

• Or download them from the ARM site

• http://bit.ly/arminfo

ARM assembly

ARM assembly

• Reading assembly is a very important skill for high-performance programming

ARM assembly

• Reading assembly is a very important skill for high-performance programming

• Writing is more specialized. Most people don’t need to.

VFP unit

VFP unitA0

VFP unitA0

+

VFP unitA0

B0+

VFP unitA0

B0+

=

VFP unitA0

B0+

C0=

VFP unitA0

B0+

C0=

A1

B1+

C1=

VFP unitA0

B0+

C0=

A1

B1+

C1=

A2

B2+

C2=

VFP unitA0

B0+

C0=

A1

B1+

C1=

A2

B2+

C2=

A3

B3+

C3=

VFP unit

VFP unitA0 A1 A2 A3

VFP unit

+A0 A1 A2 A3

VFP unit

+A0 A1 A2 A3

B0 B1 B2 B3

VFP unit

+

=

A0 A1 A2 A3

B0 B1 B2 B3

VFP unit

+

=

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

VFP unit

+

=

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

Sweet! How do we use the vfp?

"fldmias %2, {s8-s23} \n\t" "fldmias %1!, {s0-s3} \n\t" "fmuls s24, s8, s0 \n\t" "fmacs s24, s12, s1 \n\t"

"fldmias %1!, {s4-s7} \n\t"

"fmacs s24, s16, s2 \n\t" "fmacs s24, s20, s3 \n\t" "fstmias %0!, {s24-s27} \n\t"

Like this!

Writing vfp assembly

Writing vfp assembly

• There are two parts to it

Writing vfp assembly

• There are two parts to it

• How to write any assembly in gcc

Writing vfp assembly

• There are two parts to it

• How to write any assembly in gcc

• Learning ARM and VPM assembly

vfpmath library

vfpmath library

• Already done a lot of work for you

vfpmath library

• Already done a lot of work for you

• http://code.google.com/p/vfpmathlibrary

vfpmath library

• Already done a lot of work for you

• http://code.google.com/p/vfpmathlibrary

• Vector/matrix math

vfpmath library

• Already done a lot of work for you

• http://code.google.com/p/vfpmathlibrary

• Vector/matrix math

• Might not be exactly what you need, but it’s a great starting point

Assembly in gcc

• Only use it when targeting the device

Assembly in gcc

• Only use it when targeting the device

#include <TargetConditionals.h>#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP#endif

Assembly in gcc

• The basics

asm (“cmp r2, r1”);

Assembly in gcc

• Multiple lines

asm ( “mov r0, #1000\n\t” “cmp r2, r1\n\t”);

Assembly in gcc• Accessing C variables

asm (//assembly code : // output operands : // input operands : // clobbered registers);

Assembly in gcc• Accessing C variables

asm (//assembly code : // output operands : // input operands : // clobbered registers);

int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );

Assembly in gcc• Accessing C variables

asm (//assembly code : // output operands : // input operands : // clobbered registers);

int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );

%0, %1, etc are the variables in order

Assembly in gcc

Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42\n\t" "add %0, r10, #33\n\t" : "=r" (dest) : "r" (src) : "r10" );

Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42\n\t" "add %0, r10, #33\n\t" : "=r" (dest) : "r" (src) : "r10" );

Clobber register list are registers used by

the asm block

Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42\n\t" "add %0, r10, #33\n\t" : "=r" (dest) : "r" (src) : "r10" );

Clobber register list are registers used by

the asm block

volatile prevents “optimizations”

VFP asmFour banks of 8 32-bit registers each

VFP asmFour banks of 8 32-bit registers each

#define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr \n\t" \ "bic r0, r0, #0x00370000 \n\t" \ "orr r0, r0, #0x000" #VEC_LENGTH "0000 \n\t" \ "fmxr fpscr, r0 \n\t"

VFP asm

VFP asm

VFP asmfor (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

VFP asm for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; asm volatile ( "fldmias %0, {s0-s5} \n\t" "fldmias %1, {s6-s8} \n\t" "fldmias %2, {s9-s11} \n\t" "fmacs s0, s3, s6 \n\t" "fmuls s3, s3, s9 \n\t" "fstmias %0, {s0-s5} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

VFP asm for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; asm volatile ( "fldmias %0, {s0-s5} \n\t" "fldmias %1, {s6-s8} \n\t" "fldmias %2, {s9-s11} \n\t" "fmacs s0, s3, s6 \n\t" "fmuls s3, s3, s9 \n\t" "fstmias %0, {s0-s5} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

Was: 2.6 seconds

VFP asm for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; asm volatile ( "fldmias %0, {s0-s5} \n\t" "fldmias %1, {s6-s8} \n\t" "fldmias %2, {s9-s11} \n\t" "fmacs s0, s3, s6 \n\t" "fmuls s3, s3, s9 \n\t" "fstmias %0, {s0-s5} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

Was: 2.6 secondsNow: 1.4 seconds!!

VFP asmLet’s do 6 operations at once!

struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };

VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} \n\t" "fldmias %1, {s12-s17} \n\t" "fldmias %2, {s18-s23} \n\t" "fmacs s0, s6, s12 \n\t" "fmuls s6, s6, s18 \n\t" "fstmias %0, {s0-s11} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} \n\t" "fldmias %1, {s12-s17} \n\t" "fldmias %2, {s18-s23} \n\t" "fmacs s0, s6, s12 \n\t" "fmuls s6, s6, s18 \n\t" "fstmias %0, {s0-s11} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds

VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} \n\t" "fldmias %1, {s12-s17} \n\t" "fldmias %2, {s18-s23} \n\t" "fmacs s0, s6, s12 \n\t" "fmuls s6, s6, s18 \n\t" "fstmias %0, {s0-s11} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds

Now: 1.2 seconds

VFP asmWhat’s the loop/cache overhead?

for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }

VFP asmWhat’s the loop/cache overhead?

for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }

Was: 1.2 seconds

VFP asmWhat’s the loop/cache overhead?

for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }

Was: 1.2 secondsNow: 1.2 seconds!!!!

Matrix multiply

Matrix multiplyStraight from vfpmathlib

Matrix multiply

Touch: 0.037919 s

Straight from vfpmathlib

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 s

Straight from vfpmathlib

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 sVFP: 0.042216 s

Straight from vfpmathlib

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 sVFP: 0.042216 s

About 2x faster!

Straight from vfpmathlib

Good use of vfp

Good use of vfp

• Matrix operations

Good use of vfp

• Matrix operations

• Particle systems

Good use of vfp

• Matrix operations

• Particle systems

• Skinning

Good use of vfp

• Matrix operations

• Particle systems

• Skinning

• Physics

Good use of vfp

• Matrix operations

• Particle systems

• Skinning

• Physics

• Procedural content generation

Good use of vfp

• Matrix operations

• Particle systems

• Skinning

• Physics

• Procedural content generation

• ....

What about the 3GS?

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Thumb

Normal

VFP1

VFP2

Touch

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

More 3GS: NEON

More 3GS: NEON

• SIMD coprocessor

More 3GS: NEON

• SIMD coprocessor

• Floating point and integer

More 3GS: NEON

• SIMD coprocessor

• Floating point and integer

• Huge potential

More 3GS: NEON

• SIMD coprocessor

• Floating point and integer

• Huge potential

• Very little documentation right now :-(

Thank you!

Noel LlopisSnappy Touch

http://twitter.com/snappytouchnoel@snappytouch.com

http://gamesfromwithin.com

Recommended