Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
An Introduction to Floating PointArithmetic by Example
Pat Quillen
21 January 2010
Floating Point Arithmetic by Example – p.1/15
Example
What is the value of
1 − 3 ∗ (4/3 − 1)
according to MATLAB?
Floating Point Arithmetic by Example – p.2/15
Example
What is the value of
1 − 3 ∗ (4/3 − 1)
according to MATLAB?
2.220446049250313e-016
Floating Point Arithmetic by Example – p.2/15
Example
What is the value of
1 − 3 ∗ (4/3 − 1)
according to MATLAB?
2.220446049250313e-016
Why??
Floating Point Arithmetic by Example – p.2/15
Example
What is the value of
1 − 3 ∗ (4/3 − 1)
according to MATLAB?
2.220446049250313e-016
Why?? Essentially because 4/3 cannot be represented
exactly by a binary number with finitely many terms.
Floating Point Arithmetic by Example – p.2/15
Example (continued)
Notice that4
3=
13
4
=1
1 − 1
4
=
∞∑
k=0
1
4
k
That is,4
3= 1 +
1
22+
1
24+
1
26+ · · ·
or, in binary,4
3= 1.010101010101 · · ·
which, again, is not exactly representable by finitely many
terms.
Floating Point Arithmetic by Example – p.3/15
Floating Point Representation
In binary computers, most floating point numbers arerepresented as
(−1)s 2e (1 + f)
where
s is represented by one bit (called the sign bit).
e is the exponent.
f is the mantissa.
Floating Point Arithmetic by Example – p.4/15
Floating Point Representation
In binary computers, most floating point numbers arerepresented as
(−1)s 2e (1 + f)
where
s is represented by one bit (called the sign bit).
e is the exponent.
f is the mantissa.
For double precision numbers, e is an eleven bit number
and f is a fifty-two bit number.
Floating Point Arithmetic by Example – p.4/15
Floating Point Exponent
As e is represented by 11 bits, it can range in value from0 to 211 − 1 = 2047.
Floating Point Arithmetic by Example – p.5/15
Floating Point Exponent
As e is represented by 11 bits, it can range in value from0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e whenstored.
Floating Point Arithmetic by Example – p.5/15
Floating Point Exponent
As e is represented by 11 bits, it can range in value from0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e whenstored.
The double precision bias is 210 − 1 = 1023. Thus,−1023 ≤ e ≤ 1024.
Floating Point Arithmetic by Example – p.5/15
Floating Point Exponent
As e is represented by 11 bits, it can range in value from0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e whenstored.
The double precision bias is 210 − 1 = 1023. Thus,−1023 ≤ e ≤ 1024.
The extreme values e = −1023 (stored as eb = 0) ande = 1024 (stored as eb = 2047) are special, so−1022 ≤ e ≤ 1023 is the valid range of the exponent.
Floating Point Arithmetic by Example – p.5/15
Floating Point Mantissa
f limits the precision of the floating point number.
Floating Point Arithmetic by Example – p.6/15
Floating Point Mantissa
f limits the precision of the floating point number.
0 ≤ f < 1
Floating Point Arithmetic by Example – p.6/15
Floating Point Mantissa
f limits the precision of the floating point number.
0 ≤ f < 1
The format 2e (1 + f) provides an implicitly stored 1, sodoubles actually have 53 bits of precision.
Floating Point Arithmetic by Example – p.6/15
Floating Point Mantissa
f limits the precision of the floating point number.
0 ≤ f < 1
The format 2e (1 + f) provides an implicitly stored 1, sodoubles actually have 53 bits of precision.
252f is an integer ⇒ gaps between successive doubles.
Floating Point Arithmetic by Example – p.6/15
Floating Point Mantissa
f limits the precision of the floating point number.
0 ≤ f < 1
The format 2e (1 + f) provides an implicitly stored 1, sodoubles actually have 53 bits of precision.
252f is an integer ⇒ gaps between successive doubles.
For example, all integers up to 253 are exactly representable
as floating point numbers, but 253 + 1 is not.
Floating Point Arithmetic by Example – p.6/15
Examples
The number 1 is represented as
(−1)0 20 (1 + 0).
That is, s = 0, e = 0, f = 0. Adding the bias (1023), thebiased value of e is eb = 1023.
Floating Point Arithmetic by Example – p.7/15
Examples
The number 1 is represented as
(−1)0 20 (1 + 0).
That is, s = 0, e = 0, f = 0. Adding the bias (1023), thebiased value of e is eb = 1023.
You can use format hex in MATLAB to see the bit pattern
of the floating point number in hexadecimal. The first three
hex digits (12 bits) represent the sign bit and the biased ex-
ponent, and the remaining 13 hex digits (52 bits) represent
the mantissa.
Floating Point Arithmetic by Example – p.7/15
Examples
In the case of the number 1, s = 0 and eb = 01111111111,so the first three hex digits are 001111111111 = 3ff so, 1is represented by
3ff0000000000000
Floating Point Arithmetic by Example – p.8/15
Examples
In the case of the number 1, s = 0 and eb = 01111111111,so the first three hex digits are 001111111111 = 3ff so, 1is represented by
3ff0000000000000
For 4
3, f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with
1, 4
3has e = 1, and so it has representation
3ff5555555555555
which is just slightly smaller than 4
3.
Floating Point Arithmetic by Example – p.8/15
Examples
In the case of the number 1, s = 0 and eb = 01111111111,so the first three hex digits are 001111111111 = 3ff so, 1is represented by
3ff0000000000000
For 4
3, f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with
1, 4
3has e = 1, and so it has representation
3ff5555555555555
which is just slightly smaller than 4
3.
The real number 0.1 has e = −4, andf = 0.10011001 · · · 10011010, and thus has representation
3fb999999999999a
which is just slightly larger than 0.1.
Floating Point Arithmetic by Example – p.8/15
Round-off
Since fl(
4
3
)
6= 4
3(where fl(x) stands for “the floating point
representation of x”), we see the behavior
1 − 3 ∗ (4/3 − 1) 6= 0.
All of the operations except the division are performedwithout error, and the special value
ǫ = 2−52
is the result.
ǫ is referred to as machine epsilon, or the unit-roundoff, and it
is the distance between 1 and the next closest floating point
number.Floating Point Arithmetic by Example – p.9/15
Example
A very common example of propogation of round-off comesin the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?
Floating Point Arithmetic by Example – p.10/15
Example
A very common example of propogation of round-off comesin the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?
No! As a matter of fact, MATLAB will tell you that 0.3 isrepresented by
3fd3333333333333while 0.1 + 0.1 + 0.1 is represented by
3fd3333333333334
The difference in the last place is due to accumulation of the
difference between 0.1 and fl(0.1).
Floating Point Arithmetic by Example – p.10/15
Deadly Consequences
Numerical Disasters1991: Patriot Missile misses Scud!
1996: Ariane Rocket explodes!
Floating Point Arithmetic by Example – p.11/15
Swamping
Due to finiteness of precision, floating point addition cansuffer swamping. Suppose we have two floating pointnumbers a = 105 and b = 10−12. The quantity c = a + b isequal to a, since a and b differ by many orders ofmagnitude.
Floating Point Arithmetic by Example – p.12/15
Swamping
Due to finiteness of precision, floating point addition cansuffer swamping. Suppose we have two floating pointnumbers a = 105 and b = 10−12. The quantity c = a + b isequal to a, since a and b differ by many orders ofmagnitude.
To rectify the effects of swamping, one may compute inincreasing order of magnitude. For example, try these inMATLAB:
eps/2 + 1 − eps/2 eps/2 − eps/2 + 1
Floating Point Arithmetic by Example – p.12/15
Swamping
Due to finiteness of precision, floating point addition cansuffer swamping. Suppose we have two floating pointnumbers a = 105 and b = 10−12. The quantity c = a + b isequal to a, since a and b differ by many orders ofmagnitude.
To rectify the effects of swamping, one may compute inincreasing order of magnitude. For example, try these inMATLAB:
eps/2 + 1 − eps/2 eps/2 − eps/2 + 1
Note: It is frequently infeasible to do this!
Floating Point Arithmetic by Example – p.12/15
Cancellation
A phenomenon not dissimilar from swamping iscancellation, which occurs when a number is subtractedfrom another number of rougly the same magnitude.
For example, for values of x very near 0, the expression√
x + 1 − 1
suffers cancellation, as 1 swamps x in the computation of
x + 1, and the subsuquent subtraction results in 0.
Floating Point Arithmetic by Example – p.13/15
Cancellation
To get around the effects of cancellation, one may rewritetheir computation in an equivalent form that avoids thecancellation altogether. For example, computing with
√x + 1 − 1 =
x√x + 1 + 1
avoids the cancellation for values of x near zero. Now, theonly value of x that results in a zero output is 0 itself.
Floating Point Arithmetic by Example – p.14/15
Cancellation
To get around the effects of cancellation, one may rewritetheir computation in an equivalent form that avoids thecancellation altogether. For example, computing with
√x + 1 − 1 =
x√x + 1 + 1
avoids the cancellation for values of x near zero. Now, theonly value of x that results in a zero output is 0 itself.
Note: Not all cancellation can be avoided, and not all can-
cellation is bad!
Floating Point Arithmetic by Example – p.14/15
Resources
What Every Computer Scientist Should Know AboutFloating-Point Arithmetic by David Goldberg. Availablehere.
Numerical Analysis, 8th ed. by Richard L. Burden andJ. Douglas Faires.
Numerical Computing with MATLAB by Cleve Moler.
Accuracy and Stability of Numerical Algorithms byNicholas J. Higham.
Technical Note regarding Floating Point Arithmetic.Available here
Floating Point Arithmetic by Example – p.15/15