An Introduction to Floating Point Arithmetic by …bellen/Analisi Numerica/laboratorio...Negative exponents are represented by biasing e when stored. Floating Point Arithmetic by Example

An Introduction to Floating PointArithmetic by Example

Pat Quillen

21 January 2010

Floating Point Arithmetic by Example – p.1/15

Example

What is the value of

1 − 3 ∗ (4/3 − 1)

according to MATLAB?


Example


1 − 3 ∗ (4/3 − 1)


2.220446049250313e-016


Example


1 − 3 ∗ (4/3 − 1)


2.220446049250313e-016

Why??


Example


1 − 3 ∗ (4/3 − 1)


2.220446049250313e-016

Why?? Essentially because 4/3 cannot be represented

exactly by a binary number with finitely many terms.


Example (continued)

Notice that4

3=

13

4

=1

1 − 1

4

=

∞∑

k=0

1

4

k

That is,4

3= 1 +

1

22+

1

24+

1

26+ · · ·

or, in binary,4

3= 1.010101010101 · · ·

which, again, is not exactly representable by finitely many

terms.


Floating Point Representation

In binary computers, most floating point numbers arerepresented as

(−1)s 2e (1 + f)

where

s is represented by one bit (called the sign bit).

e is the exponent.

f is the mantissa.


Floating Point Representation

In binary computers, most floating point numbers arerepresented as

(−1)s 2e (1 + f)

where

s is represented by one bit (called the sign bit).

e is the exponent.

f is the mantissa.

For double precision numbers, e is an eleven bit number

and f is a fifty-two bit number.


Floating Point Exponent

As e is represented by 11 bits, it can range in value from0 to 211 − 1 = 2047.




Negative exponents are represented by biasing e whenstored.





The double precision bias is 210 − 1 = 1023. Thus,−1023 ≤ e ≤ 1024.





The double precision bias is 210 − 1 = 1023. Thus,−1023 ≤ e ≤ 1024.

The extreme values e = −1023 (stored as eb = 0) ande = 1024 (stored as eb = 2047) are special, so−1022 ≤ e ≤ 1023 is the valid range of the exponent.


Floating Point Mantissa

f limits the precision of the floating point number.




0 ≤ f < 1




0 ≤ f < 1

The format 2e (1 + f) provides an implicitly stored 1, sodoubles actually have 53 bits of precision.




0 ≤ f < 1


252f is an integer ⇒ gaps between successive doubles.




0 ≤ f < 1


252f is an integer ⇒ gaps between successive doubles.

For example, all integers up to 253 are exactly representable

as floating point numbers, but 253 + 1 is not.


Examples

The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), thebiased value of e is eb = 1023.


Examples

The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), thebiased value of e is eb = 1023.

You can use format hex in MATLAB to see the bit pattern

of the floating point number in hexadecimal. The first three

hex digits (12 bits) represent the sign bit and the biased ex-

ponent, and the remaining 13 hex digits (52 bits) represent

the mantissa.


Examples

In the case of the number 1, s = 0 and eb = 01111111111,so the first three hex digits are 001111111111 = 3ff so, 1is represented by

3ff0000000000000


Examples


3ff0000000000000

For 4

3, f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with

1, 4

3has e = 1, and so it has representation

3ff5555555555555

which is just slightly smaller than 4

3.


Examples


3ff0000000000000

For 4

3, f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with

1, 4

3has e = 1, and so it has representation

3ff5555555555555

which is just slightly smaller than 4

3.

The real number 0.1 has e = −4, andf = 0.10011001 · · · 10011010, and thus has representation

3fb999999999999a

which is just slightly larger than 0.1.


Round-off

Since fl(

4

3

)

6= 4

3(where fl(x) stands for “the floating point

representation of x”), we see the behavior

1 − 3 ∗ (4/3 − 1) 6= 0.

All of the operations except the division are performedwithout error, and the special value

ǫ = 2−52

is the result.

ǫ is referred to as machine epsilon, or the unit-roundoff, and it

is the distance between 1 and the next closest floating point

number.Floating Point Arithmetic by Example – p.9/15

Example

A very common example of propogation of round-off comesin the form of

0.1 + 0.1 + 0.1

Specifically, is the above expression equal to 0.3?


Example

A very common example of propogation of round-off comesin the form of

0.1 + 0.1 + 0.1

Specifically, is the above expression equal to 0.3?

No! As a matter of fact, MATLAB will tell you that 0.3 isrepresented by

3fd3333333333333while 0.1 + 0.1 + 0.1 is represented by

3fd3333333333334

The difference in the last place is due to accumulation of the

difference between 0.1 and fl(0.1).


Deadly Consequences

Numerical Disasters1991: Patriot Missile misses Scud!

1996: Ariane Rocket explodes!


http://www.ima.umn.edu/~arnold/disasters

Swamping

Due to finiteness of precision, floating point addition cansuffer swamping. Suppose we have two floating pointnumbers a = 105 and b = 10−12. The quantity c = a + b isequal to a, since a and b differ by many orders ofmagnitude.


Swamping


To rectify the effects of swamping, one may compute inincreasing order of magnitude. For example, try these inMATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1


Swamping


To rectify the effects of swamping, one may compute inincreasing order of magnitude. For example, try these inMATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Note: It is frequently infeasible to do this!


Cancellation

A phenomenon not dissimilar from swamping iscancellation, which occurs when a number is subtractedfrom another number of rougly the same magnitude.

For example, for values of x very near 0, the expression√

x + 1 − 1

suffers cancellation, as 1 swamps x in the computation of

x + 1, and the subsuquent subtraction results in 0.


Cancellation

To get around the effects of cancellation, one may rewritetheir computation in an equivalent form that avoids thecancellation altogether. For example, computing with

√x + 1 − 1 =

x√x + 1 + 1

avoids the cancellation for values of x near zero. Now, theonly value of x that results in a zero output is 0 itself.


Cancellation

To get around the effects of cancellation, one may rewritetheir computation in an equivalent form that avoids thecancellation altogether. For example, computing with

√x + 1 − 1 =

x√x + 1 + 1

avoids the cancellation for values of x near zero. Now, theonly value of x that results in a zero output is 0 itself.

Note: Not all cancellation can be avoided, and not all can-

cellation is bad!


Resources

What Every Computer Scientist Should Know AboutFloating-Point Arithmetic by David Goldberg. Availablehere.

Numerical Analysis, 8th ed. by Richard L. Burden andJ. Douglas Faires.

Numerical Computing with MATLAB by Cleve Moler.

Accuracy and Stability of Numerical Algorithms byNicholas J. Higham.

Technical Note regarding Floating Point Arithmetic.Available here


http://docs.sun.com/source/806-3568/ncg_goldberg.html

http://www.mathworks.com/support/tech-notes/1100/1108.html

Documents

An Introduction to Floating Point Arithmetic by …bellen/Analisi Numerica/laboratorio...Negative exponents are represented by biasing e when stored. Floating Point Arithmetic by Example