Cranking Floating Point Performance Up To 11

Noel LlopisSnappy Touch

http://twitter.com/snappytouchnoel@snappytouch.com

http://gamesfromwithin.com

Floating Point Performance

Floating point numbers

• Representation of rational numbers

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Following IEEE 754 format

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Single precision: 32 bits

• 1.2345, -0.8374, 2.0000, 14388439.34, etc

• Single precision: 32 bits

• Double precision: 64 bits

Why floating point performance?

• Most games use floating point numbers for most of their calculations

• Positions, velocities, physics, etc, etc.

• Maybe not so much for regular apps

• 32-bit RISC ARM 11

• 400-535Mhz

• 32-bit RISC ARM 11

• 400-535Mhz

• iPhone 2G/3G and iPod Touch 1st and 2nd gen

CPU (iPhone 3GS)

• Cortex-A8 600MHz

CPU (iPhone 3GS)

• Cortex-A8 600MHz

• More advanced architecture

• No floating point support in the ARM CPU!!!

How about integer math?

• No need to do any floating point operations

• Fully supported in the ARM processor

• But...

Integer Divide

There is no integer divide

Fixed-point arithmetic

• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a fixed-point library.

• Performs rational arithmetic with integer values at a reduced range/resolution.

• Not so great...

Floating point support

• There’s a floating point unit

• Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.

Sample program

Sample program struct Particle { float x, y, z; float vx, vy, vz; };

for (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

Sample program struct Particle { float x, y, z; float vx, vy, vz; };

• 7.2 seconds on an iPod Touch 2nd gen

Trust no one!

Trust no one!When in doubt, check the

assembly generated

Thumb Mode

Thumb Mode• CPU has a special thumb

• Less memory, maybe better performance.

• No floating point support.

• Every time there’s an fp operation, it switches out of Thumb, does the fp operation, and switches back on.

Thumb Mode

• It’s on by default!

Thumb Mode

• Potentially HUGE wins turning it off.

Thumb Mode

• Potentially HUGE wins turning it off.

Thumb Mode

• Turning off Thumb mode increased performance in Flower Garden by over 2x

Thumb Mode

• Heavy usage of floating point operations though

Thumb Mode

• Heavy usage of floating point operations though

• Most games will probably benefit from turning it off (especially 3D games)

2.6 seconds!

ARM assemblyDISCLAIMER:

ARM assembly

I’m not an ARM assembly expert!!!DISCLAIMER:

ARM assembly

Z80!!!

ARM assembly

• Hit the docs

ARM assembly

• Hit the docs

• References included in your USB card

ARM assembly

• Hit the docs

• Or download them from the ARM site

ARM assembly

• Hit the docs

• Or download them from the ARM site

• http://bit.ly/arminfo

ARM assembly

• Reading assembly is a very important skill for high-performance programming

ARM assembly

• Reading assembly is a very important skill for high-performance programming

• Writing is more specialized. Most people don’t need to.

VFP unit

VFP unitA0

VFP unit

VFP unitA0 A1 A2 A3

VFP unit

+A0 A1 A2 A3

VFP unit

+A0 A1 A2 A3

B0 B1 B2 B3

VFP unit

A0 A1 A2 A3

B0 B1 B2 B3

VFP unit

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

VFP unit

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

Sweet! How do we use the vfp?

"fldmias %2, {s8-s23} \n\t" "fldmias %1!, {s0-s3} \n\t" "fmuls s24, s8, s0 \n\t" "fmacs s24, s12, s1 \n\t"

"fldmias %1!, {s4-s7} \n\t"

"fmacs s24, s16, s2 \n\t" "fmacs s24, s20, s3 \n\t" "fstmias %0!, {s24-s27} \n\t"

Like this!

Writing vfp assembly

• There are two parts to it

• How to write any assembly in gcc

• Learning ARM and VPM assembly

vfpmath library

• Already done a lot of work for you

vfpmath library

• http://code.google.com/p/vfpmathlibrary

vfpmath library

• Vector/matrix math

vfpmath library

• Vector/matrix math

• Might not be exactly what you need, but it’s a great starting point

Assembly in gcc

• Only use it when targeting the device

Assembly in gcc

• Only use it when targeting the device

#include <TargetConditionals.h>#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP#endif

Assembly in gcc

• The basics

asm (“cmp r2, r1”);

Assembly in gcc

• The basics

asm (“cmp r2, r1”);

http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Assembly in gcc

• Multiple lines

asm ( “mov r0, #1000\n\t” “cmp r2, r1\n\t”);

Assembly in gcc• Accessing C variables

asm (//assembly code : // output operands : // input operands : // clobbered registers);

int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );

%0, %1, etc are the variables in order

Assembly in gcc

Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42\n\t" "add %0, r10, #33\n\t" : "=r" (dest) : "r" (src) : "r10" );

Clobber register list are registers used by

the asm block

Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42\n\t" "add %0, r10, #33\n\t" : "=r" (dest) : "r" (src) : "r10" );

Clobber register list are registers used by

the asm block

volatile prevents “optimizations”

VFP asmFour banks of 8 32-bit registers each

#define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr \n\t" \ "bic r0, r0, #0x00370000 \n\t" \ "orr r0, r0, #0x000" #VEC_LENGTH "0000 \n\t" \ "fmxr fpscr, r0 \n\t"

VFP asm

VFP asmfor (int i=0; i<MaxParticles; ++i){ Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag;}

VFP asm for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; asm volatile ( "fldmias %0, {s0-s5} \n\t" "fldmias %1, {s6-s8} \n\t" "fldmias %2, {s9-s11} \n\t" "fmacs s0, s3, s6 \n\t" "fmuls s3, s3, s9 \n\t" "fstmias %0, {s0-s5} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

Was: 2.6 seconds

Was: 2.6 secondsNow: 1.4 seconds!!

VFP asmLet’s do 6 operations at once!

struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };

VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} \n\t" "fldmias %1, {s12-s17} \n\t" "fldmias %2, {s18-s23} \n\t" "fmacs s0, s6, s12 \n\t" "fmuls s6, s6, s18 \n\t" "fstmias %0, {s0-s11} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }

VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} \n\t" "fldmias %1, {s12-s17} \n\t" "fldmias %2, {s18-s23} \n\t" "fmacs s0, s6, s12 \n\t" "fmuls s6, s6, s18 \n\t" "fstmias %0, {s0-s11} \n\t" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds

Now: 1.2 seconds

VFP asmWhat’s the loop/cache overhead?

for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }

Was: 1.2 seconds

Was: 1.2 secondsNow: 1.2 seconds!!!!

Matrix multiply

Matrix multiplyStraight from vfpmathlib

Matrix multiply

Touch: 0.037919 s

Straight from vfpmathlib

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 s

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 sVFP: 0.042216 s

Matrix multiply

Touch: 0.037919 sNormal: 0.096855 sVFP: 0.042216 s

About 2x faster!

Good use of vfp

• Matrix operations

Good use of vfp

• Particle systems

Good use of vfp

• Skinning

Good use of vfp

• Skinning

• Physics

Good use of vfp

• Skinning

• Physics

• Procedural content generation

Good use of vfp

• Skinning

• Physics

• Procedural content generation

• ....

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

What about the 3GS?

3G 3GS

Normal

7.2 8.0

2.6 2.6

1.4 1.30

1.2 0.64

1.2 0.18

More 3GS: NEON

• SIMD coprocessor

More 3GS: NEON

• Floating point and integer

More 3GS: NEON

• Huge potential

More 3GS: NEON

• Huge potential

• Very little documentation right now :-(

Thank you!

Noel LlopisSnappy Touch

http://twitter.com/snappytouchnoel@snappytouch.com

http://gamesfromwithin.com

Cranking Floating Point Performance Up To 11

Technology

Integer Arithmetic Floating Point Representation Floating Point Arithmetic

Floating point numbers

LogiCORE IP Floating-Point Operator v6 - Xilinx · The Xilinx® Floating-Point ... (/doc ... The Xilinx Floating-Point Operator core allows a range of floating-point

Floating Point Arithmetic

Integer Arithmetic Floating Point Representation Floating Point Arithmetic Topics

Fujitsu Limited Ver 15, 26 Apr. 2010 Move Selected Floating-Point Register on Floating-Point Register's Condition 122 A.76 Floating-Point Trigonometric Functions 123 A.77 Store Floating-Point

Floating Point. Agenda History Basic Terms General representation of floating point Constructing a simple floating point representation Floating

Set 16 FLOATING POINT ARITHMETIC. TOPICS Binary representation of floating point Numbers Computer representation of floating point numbers Floating point

Fixed Point vs. Floating Point

Floating Point Arithmetic. Outline Overflow Fixed Point Numbers Floating Point Numbers Excess Representation Arithmetic with Floating Point

Outline Floating point arithmetic 80x87 Floating Point Unit

Floating Point

Floating-point numberskbeach/phys420_580/docs/... · 2013-07-25 · Floating-point numbers ‣ Floating-point numbers have the form ‣ The adjustable radix point allows for calculation

Floating-point numberskbeach/courses/fall2017/... · Floating-point numbers ‣ Floating-point numbers have the form ‣ The adjustable radix point allows for calculation over a wide

– 1 – CS213, S’06 Floating Point 4/5/2006 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties CS213

Floating Point Arithmetic Feb 17, 2000 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IA32 floating point

00575 Floating Point

FLOATING-POINT ARITHMETIC â€¢ Floating-point representation and

Veriﬁcarlo: Checking Floating-Point Accuracy Through … · Veriﬁcarlo: Checking Floating-Point Accuracy ... PhD student in ... Checking Floating-Point Accuracy Through Monte

Floating point units