Essential Mathematics for Games Programmers (Fixed/Float Tutorial) Lars Bishop ([email protected])

Essential Mathematics for Games Programmers(Fixed/Float Tutorial)

Lars Bishop ([email protected])

Essential Math for Games

Number Spaces

• Cardinal – Positive numbers, no fractions• Integer – Pos., neg., zero, no fractions• Rational – Fractions• Irrational – Non-repeating decimals (,e)• Real – Rationals+irrationals• Complex – Real + multiple of -1

a+bi


Numerical Representations

• In graphics, we often deal with Real numbers (3.14159, 1.5, etc)

• Unlike integers, Real numbers have fractional components

• There are several common ways of representing these on a computer


Approximating Reals

• When we write a Real number on paper, we generally write only a few digits past the decimal point: 1.5, 3.45

• Most reals cannot be represented exactly by a few fractional digits

• Any written number has inherent precision


Finite Representations

• Any number in a computer has finite representation

• As a result, any such representation cannot represent every number exactly

• When representing numbers, we will always be coping with these limitations

• We need to understand and limit error


Fixed Point Numbers

• Use integer-like representation

• Assume that the least-significant bit is some negative power of 2 (⅛, ¼, etc)

• The “binary point” is in the middle of the number, not after the least-significant bit

Bit 7 6 5 4 • 3 2 1 0

Value 8 4 2 1 • 1/2 1/4 1/8 1/16


Fixed-Point Nomenclature

• Applications can adjust their precision and range by moving the “binary point”

• If a fixed point number has M bits of integral precision N bits of fractional precision,

• It is called an M.N number This is pronounced “M dot N”

• A 32-bit integer would be “32 dot 0”


Fixed Point Benefits

• Can represent fractional values with only integer arithmetic

• Simple to use and understand• Addition and Subtraction are the same

as for integers• Multiplication and Division are only

slightly different from their integer siblings


Fixed-point vs. Floating-point

• A 32-bit fixed-point actually has more inherent precision than a 32-bit Floating-point number!

• Floating-point trades accuracy for range by using some bits for the exponent

• “Floating point is for the lazy” - John von Neumann (paraphrased)


Floating-point ↔ Fixed-point

• Assuming an M.N fixed-point system:• IntToFix(i) = i << N• FloatToFix(f) = (int)(f * 2.0N)

Look out for overflow on these!

• FixToInt(i) = i >> N• FixToFloat(i) = ((float)i) * 2.0-N

Conversion to int can lose precision


Basic Fixed-Point Math

• For A and B, both M.N fixed point values, A+B = (int)A + (int)B A-B = (int)A – (int)B

• where (int)A is simply the fixed point treated bitwise as an integer

• This works because the binary points line up when A and B are both M.N


Multiplication – Basic Idea

• Multiplication is a bit more complex, but it is analogous to the grade-school base 10 trick:

We multiply the numbers as integers, and then slide the decimal point to the left by 1+1=2


Multiplication

0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0X

We want to compute 0.5f x 1.0f = 0.5f.

In 4.4 fixed-point, this is:

After the integer multiply, we get (8x16=128):

1 0 0 0 0 0 0 0

Then, we need to shift right by 4 (not 8 – we don’t want an integer result, we want a 4.4 result, not 8.0)

0 0 0 0 1 0 0 0


Challenges with Fixed Point

• Range (Overflow) In the previous multiplication example, if

we compute 1.0x1.0=1.0 using the given method, the intermediate value overflows

• Precision (Underflow) If we pre-shift numbers down to avoid

overflow, we can end up shifting to 0

• Extra shifting required for mul/div


Fixed Point Help

• Some CPUs have special instructions

• ARM9 (common in handhelds) Has a 32-bit X 32-bit → 64-bit multiply Also has similar 64 + 32 x 32 → 64 Can avoid overflow or underflow in

intermediate values Allows a free shift in every ALU operation


Fixed Point Summary

• Allows fractional math to be done quickly with integer hardware

• Requires careful range and precision analysis

• Is the only option on many embedded and handheld devices, which don’t have FPU hardware


Floating Point Numbers

• Used to represent general non-integers

• Often thought of as the set of Reals This is far from the truth, as you’ll see

• Most of this discussion will be about IEEE 754 32-bit single-precision Known in C/C++ as float Mention of doubles later


FP and Scientific Notation

• FP is analogous to scientific notation

• Scientific notation is: ± D.DDDD x 10E, where D.DDDD has a nonzero leading digit D.DDDD has fixed # of fractional digits E is a signed integer value


FP/Sci Notation Range/Precision

• Precision is not fixed: Precision of 1.000x10-2 is 100 times more

fine-grained than that of 1.000x100

• Precision and range are related

• The larger the number, the less precise

• 32-bit int is not a subset of float


FP/Sci Notation Components

• Sign

• Mantissa Normalized – has a nonzero integer digit Limited precision – fixed number of digits

• Exponent Chosen so that the mantissa is normalized


Sign

• FP numbers have an explicit sign bit

• Float has both 0.0f and -0.0f

• By the standard, 0.0f = -0.0f

• But the bits are different

• Do not memcmp floats! This is only one of many reasons


Exponent

• Like the exponent in scientific notation But, the exponent’s base is 2, not 10

• Stored as a biased number: ExponentBits = Exponent – Bias For float, Bias = 127 For float, -125 ≤ Exponent ≤ 128

• Exponent term is


Mantissa

• Represented as 1-dot-23 fixed point• Store the 23 fractional bits explicitly

Integer bit is implied (“hidden bit”)

• Generally, mantissa is normalized In other words, mantissa is 1.MantissaBits Similar to the scientific notation standard

• In the smallest numbers, the integer bit is assumed to be 0


Binary Representation

• Put together, the representation is:

• S=Sign bit

• E=Exponent bits (8)

• M=Fractional mantissa bits (23)

• We will write as A = (SA,EA,MA)

S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMMM


Special Values

• 0: S=0, E=All 0s, M=All 0s

• -0: S=1, E=All 0s, M=All 0s

• +∞: S=0, E=All 1s, M=All 0s Ex: 1.0f / 0.0f = +∞

• −∞: S=1, E=All 1s, M=All 0s Ex: -1.0f / 0.0f = -∞


Not a Number

• Represents undefined results 0.0f / 0.0f = NaN ACOS(2.0f) = Nan

• Two kinds – Quiet and Signaling Quiet can be passed on to other ops Signaling traps the code

• NaN: E=All 1s, M=Not all 0s


Very Small Numbers

• What happens when we run out of smaller and smaller exponents?

• Could flush to zero But, this can lead to the following problem X-Y=0 does not imply X=Y!

• Need to gradually underflow to zero

• FP does this by allowing denormals


Denormals

• A denormal is an FP number whose hidden mantissa bit is 0, not 1

• Indicated by E=0, M=Not all 0s

• A denormal is equal to

• (-1)S x 2-126 x 0.MantissaBits

• This allows precision to gradually roll off to zero.


Floating-point Add Basics

• To add two positive floating point numbers A (EA, MA) and B (EB, MB): Swap as needed so A has the greater

exponent, i.e. EB ≤ EA

Shift MB to the right by EA-EB bits

Add MA+MB and use as the new mantissa

Adjust the new exponent up or down to re-normalize the result.


FP Add Notes

• Not a simple process (even for pos #’s)

• If A>>B, then B can be shifted to 0

• Repeatedly adding small numbers to an accumulator (i.e. A+=B) can gradually lead to huge error

• At some point, A stops growing, no matter how many times B is added!


Fun with Floats – The Real World

• Can’t discuss every FP issue here: this is an example of why you should care

• 3D engine saw a spike in basic FP code

• Code (SLERP) was +,-,* only

• No loops, no complex functions

• What could be the problem?


Breaking Down the Problem

• Input values looked valid (no NaN)

• After a “while”, in a demo, the spike hit

• We saved the values, along with timing

• In a small app, ran the slow and fast cases in tight loops

• The slow cases all had some tiny numbers (~1.0x10-43)


Tiny ≠ 0.0f

• Slow cases were denormals

• We assumed these numbers to be 0

• But FPU was taking care to be accurate

• Denormal ops seemed to be slow


Denormal Performance

• Did some more timing tests

• Even loading a denormal was slow!

• Pentium takes a big hit on denormals True even with exceptions masked! FPU pipeline gets flushed on denormals

• Little things matter


“Don’t Doubles Solve this?”

• Doubles do help with most range and many precision problems. But: Need twice the memory of floats (duh) Frequently, significantly slower than floats Some platforms don’t support them

• Avoid switching to them without tracking down the problem first


Floating Point Wrap-up

• Floats ≠ Reals

• Understand the limits of FP

• Analyze your FP issues

• Don’t just jump to doubles at the first sign of trouble

• Be willing to rework your math functions to be FP-friendly

Documents

Essential Mathematics for Games Programmers (Fixed/Float Tutorial) Lars Bishop ([email protected])