A gentle introduction to floating point arithmetic Ho Chun Hok ([email protected])[email protected] Custom Computing Group Seminar 25 Nov 2005

A gentle introduction to floating point arithmetic

Ho Chun Hok ([email protected])Custom Computing Group Seminar

25 Nov 2005

mailto:[email protected]

IEEE 754: floating point standard

• Normal numbers (when exponent > 0 and < max exponent)– v = (-1)^s x 2^exponent x (1.fraction)

• Subnormal numbers (when exponent = 0)– v = (-1)^s x 2^exponent x (0.fraction)

• Special numbers (when exponent = max exponent)– Infinity, Nan (not a number)

• precisions– Single: esize = 8, fsize = 23, vsize = 32– Double: esize = 11, fsize = 52, vsize = 64– Double extended, vsize > 64

• Operations:– +, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f

• Rounding– Nearest even, +inf, -inf, towards 0

s exponent fraction

esize fsize

Let v =

vsize

What IEEE 754 standard supposes to be

• Approximation to real number with expected error– the epsilon can of any real number can be determined when

mapping to floating point number• Results of all operations can be correctly rounded in c

ase of inexact result• Ensure some math properties hold (in general),

x+y==y+x, -(-a) == a, a>=b & c>=0 a*c >= b*c,x+0 == x, y*y >= 0

• All exception can be detected– Using exception flags

• Same results across different machines

How to ensure the standard?

• Processor?– Rounding numbers in different mode– Gradual underflow– Raise exceptions

• Operating System?– Handle exception– Handle function which may not be supported in hardware

(what if a processor cannot handle subnormal number)– Keep track of the floating point unit state, (precision, rounding

mode)• Programming Language?

– Well-defined semantic for floating point (yes, we have infamous JAVA language)

• Compiler?– Preserve the semantic defined in that language

• Programmer?– read: What Every Computer Scientist Should Know About

Floating-Point Arithmetic

Case study 1int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0;}

Visual C compiler, running on P-M

•Same result on lulu (pentium 3) and irina (Xeon)

GCC, running on Pentium 4 (skokie)

gcc:fld1faddp %st,%st(1)fldl 0xfffffff0(%ebp)fdivp %st,%st(1)fldl 0xfffffff8(%ebp)fxch %st(1)fucomppfnstsw %axand $0x45,%ahcmp $0x40,%ahje 0x80483d2 <main+102>

VCC:

fld qword ptr [ebp-10h]

fadd qword ptr [__real@8@3fff8000000000000000 (00426028)]

fdivr qword ptr [ebp-10h]

fcomp qword ptr [ebp-8]

fnstsw ax

test ah,40h

je main+5Fh (0040106f)

jmp main+61h (00401071)

VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision valueGCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value

Case study 1

• It’s compiler issue• Using more precision to calculate the

intermediate result is a good idea• Compiler should convert the 80-bit floating

point number to 64-bit before comparison• And it’s programmer issue too

– Equality test between FP variables is dangerous

– We can detect the problem before it hurts….

• It is not easy to compliance with the standard

Case study 2

• Calculate • When x is large, result == 0, rather than • Beware, even everything compliance with

standard, the standard cannot guarantee the result is always correct

• Again, programmer should detect this before it hurts– Define routine to trap the exception– Exceptions are not errors as long as they are

handled correctly

Case Study 3

• Jean-Michel Muller’s Recurrence

Using double extended precision (80-bits)

• x[2] = 5.590164e+00• x[3] = 5.633431e+00• x[4] = 5.674649e+00• x[5] = 5.713329e+00• x[6] = 5.749121e+00• x[7] = 5.781811e+00• x[8] = 5.811314e+00• x[9] = 5.837660e+00• x[10] = 5.861018e+00

• x[11] = 5.882514e+00• x[12] = 5.918471e+00• x[13] = 6.240859e+00• x[14] = 1.115599e+01• x[15] = 5.279838e+01• x[16] = 9.469105e+01• x[17] = 9.966651e+01• x[18] = 9.998007e+01• x[19] = 9.999881e+01

Converge to 100.0, it seems correct

Case Study 3

• This series can converge to either 5, 6, and 100– Depends on the value of x0, x1– If x0 = 11/2, x1 = 61/11, the series should be

converged to 6– Little round off error may affect the result

dramatically– We can calculate the result analytically by

substituting

• In general, it’s very difficult to detect this error

Case Study 4

• Table maker’s dilemma– If we want n-digit accuracy for elementary function

like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit

– What if the last 2-digit is “10”?• We can calculate last 3 digit

– What if the last 3-digit is “100”?• We can calculate last 4 digit

– What if…• The result of most elementary function library

(e.g. libm) is not correctly rounded in some cases

Conclusion

• We have a standard representation for floating point number

• Comforting the standard requires collaboration between different parties

• Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable”

• When using elementary function, don’t expect the result can be comparable between different machine– Elementary function is NOT included in the IEEE standard

• Floating point, when use properly, can do something serious

Documents

A gentle introduction to floating point arithmetic Ho Chun Hok ([email protected])[email protected] Custom Computing Group Seminar 25 Nov 2005