Upload
andrew-holmes
View
212
Download
0
Embed Size (px)
Citation preview
A gentle introduction to floating point arithmetic
Ho Chun Hok ([email protected])Custom Computing Group Seminar
25 Nov 2005
IEEE 754: floating point standard
• Normal numbers (when exponent > 0 and < max exponent)– v = (-1)^s x 2^exponent x (1.fraction)
• Subnormal numbers (when exponent = 0)– v = (-1)^s x 2^exponent x (0.fraction)
• Special numbers (when exponent = max exponent)– Infinity, Nan (not a number)
• precisions– Single: esize = 8, fsize = 23, vsize = 32– Double: esize = 11, fsize = 52, vsize = 64– Double extended, vsize > 64
• Operations:– +, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f
• Rounding– Nearest even, +inf, -inf, towards 0
s exponent fraction
esize fsize
Let v =
vsize
What IEEE 754 standard supposes to be
• Approximation to real number with expected error– the epsilon can of any real number can be determined when
mapping to floating point number• Results of all operations can be correctly rounded in c
ase of inexact result• Ensure some math properties hold (in general),
x+y==y+x, -(-a) == a, a>=b & c>=0 a*c >= b*c,x+0 == x, y*y >= 0
• All exception can be detected– Using exception flags
• Same results across different machines
How to ensure the standard?
• Processor?– Rounding numbers in different mode– Gradual underflow– Raise exceptions
• Operating System?– Handle exception– Handle function which may not be supported in hardware
(what if a processor cannot handle subnormal number)– Keep track of the floating point unit state, (precision, rounding
mode)• Programming Language?
– Well-defined semantic for floating point (yes, we have infamous JAVA language)
• Compiler?– Preserve the semantic defined in that language
• Programmer?– read: What Every Computer Scientist Should Know About
Floating-Point Arithmetic
Case study 1int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0;}
Visual C compiler, running on P-M
•Same result on lulu (pentium 3) and irina (Xeon)
GCC, running on Pentium 4 (skokie)
gcc:fld1faddp %st,%st(1)fldl 0xfffffff0(%ebp)fdivp %st,%st(1)fldl 0xfffffff8(%ebp)fxch %st(1)fucomppfnstsw %axand $0x45,%ahcmp $0x40,%ahje 0x80483d2 <main+102>
VCC:
fld qword ptr [ebp-10h]
fadd qword ptr [__real@8@3fff8000000000000000 (00426028)]
fdivr qword ptr [ebp-10h]
fcomp qword ptr [ebp-8]
fnstsw ax
test ah,40h
je main+5Fh (0040106f)
jmp main+61h (00401071)
VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision valueGCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value
Case study 1
• It’s compiler issue• Using more precision to calculate the
intermediate result is a good idea• Compiler should convert the 80-bit floating
point number to 64-bit before comparison• And it’s programmer issue too
– Equality test between FP variables is dangerous
– We can detect the problem before it hurts….
• It is not easy to compliance with the standard
Case study 2
• Calculate • When x is large, result == 0, rather than • Beware, even everything compliance with
standard, the standard cannot guarantee the result is always correct
• Again, programmer should detect this before it hurts– Define routine to trap the exception– Exceptions are not errors as long as they are
handled correctly
Case Study 3
• Jean-Michel Muller’s Recurrence
Using double extended precision (80-bits)
• x[2] = 5.590164e+00• x[3] = 5.633431e+00• x[4] = 5.674649e+00• x[5] = 5.713329e+00• x[6] = 5.749121e+00• x[7] = 5.781811e+00• x[8] = 5.811314e+00• x[9] = 5.837660e+00• x[10] = 5.861018e+00
• x[11] = 5.882514e+00• x[12] = 5.918471e+00• x[13] = 6.240859e+00• x[14] = 1.115599e+01• x[15] = 5.279838e+01• x[16] = 9.469105e+01• x[17] = 9.966651e+01• x[18] = 9.998007e+01• x[19] = 9.999881e+01
Converge to 100.0, it seems correct
Case Study 3
• This series can converge to either 5, 6, and 100– Depends on the value of x0, x1– If x0 = 11/2, x1 = 61/11, the series should be
converged to 6– Little round off error may affect the result
dramatically– We can calculate the result analytically by
substituting
• In general, it’s very difficult to detect this error
Case Study 4
• Table maker’s dilemma– If we want n-digit accuracy for elementary function
like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit
– What if the last 2-digit is “10”?• We can calculate last 3 digit
– What if the last 3-digit is “100”?• We can calculate last 4 digit
– What if…• The result of most elementary function library
(e.g. libm) is not correctly rounded in some cases
Conclusion
• We have a standard representation for floating point number
• Comforting the standard requires collaboration between different parties
• Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable”
• When using elementary function, don’t expect the result can be comparable between different machine– Elementary function is NOT included in the IEEE standard
• Floating point, when use properly, can do something serious