Lecture Topic: What is a Number? - IBM · We’ll also consider real numbers and some computer...

Preview:

Citation preview

Lecture Topic: What is a Number?

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

Overview

In the study of numerical methods, one should like to clarify what is a number.

We’ll start with some fundamental mathematics.

Next, we’ll consider integers and operations with integers.

We’ll also consider real numbers and some computer approximations thereof.

Finally, we’ll examine how operations are performed on these approximations, andwhy certain errors arise.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 2 / 1

RingsIn mathematics, there are many approaches to numbers. In Algebra, there is awell developed theory of particular sets, called rings. Elements of a ring allow forthe addition and multiplication, while preserving a number of rules:

(a + b) + c = a + (b + c) for all a, b, c ∈ R (“associativity of +”)

a + b = b+a for all a, b ∈ R (“commutativity of +”)

There is 0 ∈ R, such that a + 0 = a for all a ∈ R (“additive identityelement”)

For each a ∈ R, there is a ∈ R, such that a + (a) = 0 (−a is the “additiveinverse” of a)

(a · b) · c = a · (b · c) for all a, b, c ∈ R (“associativity of ·” )

There is 1 ∈ R, such that a · 1 = a and 1 · a = a for all a ∈ R (“multiplicativeidentity element”)

a · (b + c) = (a · b) + (a · c) for all a, b, c ∈ R (“left distributivity”)

(b + c) · a = (b · a) + (c · a) for all a, b, c ∈ R (“right distributivity”).

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 3 / 1

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Rings

Common examples of rings are integers, real numbers, and complex numbers.

One could hence see “numbers” as elements of a ring.

Unfortunately, the most common representation of numbers on computers doesnot follow these rules.

1 x = 1.0+(0.001-1.0)y = (1.0+0.001)-1.0print x, y, x == yprint "%.16f %.16f" % (x,y)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 4 / 1

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Positional Number SystemsFrom grade school, we encounter positional number systems. For example,

422 = 4× 102 + 2× 101 + 2× 100

The number system is called positional because a digit’s meaning depends on itsposition: e.g., the digit 2 above stands once for for 20 = 2× 101 and once for 2.In general, we will represent a number by a sequence of digits (dndn−1 . . . d1d0),where di ∈ S .S is a finite set of symbols: its size |S | = b is called the base of the system. Thedecimal number system uses 10 symbols: S = {0, 1, . . . , 9}.An integer N is represented as follows:

N = (dndn−1 . . . d0)10

=n∑

i=0

dibi = dn × 10n + dn−1 × 10n−1 + · · ·+ d0 × 100.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 5 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Binary NumbersComputers store information in two-state devices, i.e., devices that are either onor off.Such a device is said to contain 1 binary digit (bit) of information.Since computers store information in two-state devices, they use the binarynumber system to represent numbers.The binary system uses the symbols S = {0, 1} and so has b = 2.

Example (Binary Numbers)

The binary number N = 10011012 in expanded form is

N = 1× 26 + 0× 25 + 0× 24 + 1× 23 + 1× 22 + 0× 21 + 1× 20

= 64 + 0 + 0 + 8 + 4 + 0 + 1

= 7710.

♦Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 6 / 1

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

Computer Number SystemsSince each bit can have two states, a string of n bits can have 2 · 2 · · · 2 = 2n

distinct bit patterns. For example,

a byte of 8 bits can contain one of 28 different bit patterns and so canrepresent 28 = 256 different things;

in older computers (IA-32), a word of 4 bytes or 32 bits could represent 232

or 4 billion different things;

in most computers since 2003 (x86-64) and many mobile phones since 2013(ARMv8-A), a word of 8 bytes or 64 bits could represent 264 or18, 446, 744, 073 billion different things;

in some computers, a word of 16 bytes or 128 bits could represent 2128

different things.

There are many ways to interpret a string of bits computers do not use a plainpositional system, especially beyond integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 7 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

IntegersWe view computer memory as a set of words, each contains w bits.

To store an integer, we allow the first bit to represent the sign of the number:0 = + and 1 = −.

The remaining w − 1 bits are used to represent the magnitude of the number.

This means we can represent 2w−1 different magnitudes along with their signs in aw -bit word.

We choose an encoding or representation that maps the 2w bit patterns to asubset of the integers.

We prefer encodings that:

are one-one correspondences(exactly one encoding for each integer represented);

make it easy to find the encoding of −x , given the encoding of x .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 8 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodings for w = 3

Bit pattern: 000 001 010 011 100 101 110 111Unsigned 0 1 2 3 4 5 6 7Signed +0 +1 +2 +3 −0 −1 −2 −3Fixed-point 0 1/8 1/4 3/8 1/2 5/8 3/4 7/81’s compl. +0 +1 +2 +3 −3 −2 −1 −02’s compl. 0 +1 +2 +3 −4 −3 −2 −1

In what follows, let the notation [x ] stand for the encoding of x .

For example, in E1, we have [3] = 011.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 9 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodingsIn unsigned, we represent only non-negative integers x , without the sign.

In signed, we represent the absolute value of x by the w − 1 bits dw−2dw−3 . . . d0.The sign is represented by dw−1.

If the integer x is non-negative we let dw−1 = 0;if x is non-positive we let dw−1 = 1.

The representation of minus 0 is questionable (not a one-one correspondence).

In fixed point encoding, we again represent only non-negative numbers, but thistime fractions in the range 0 to 7/8 in steps of 1/8(regard the three bits as being after the “decimal” point,e.g., 0.1012 = 1× 2−1 + 0× 2−2 + 1× 2−3 = 5/8).

Notice that in the fixed point encoding, unlike in the floating point encodingintroduced in the next section, the decimal point is always in the same place.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 10 / 1

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Example encodings: 1’s complementIn ones complement or representation modulo 2w − 1,given an integer x such that

−2w−1 + 1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

x if x > 0

x + 2w − 1 if x < 0

0 or 2w − 1 if x = 0

.

The existence of two zeros (incl. minus 0 encoded by 2w − 1) is questionable.Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros.

This is called the ones complement representation of negative integers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 11 / 1

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Example encodingsIn twos complement or representation modulo 2w ,given an integer x such that

−2w−1 ≤ x ≤ 2w−1 − 1

we let

[x ] =

{x if x ≥ 0

x + 2w if x < 0.

Given a representation of a non-negative integer x ,one obtains a representation of −x as follows:replace all zeros by ones and all ones by zeros(i.e., apply the logic operation not to each bit),then add 1.

That is, do a ones complement and add 1, carrying to the left if necessary.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 12 / 1

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Example encodings

The name twos complement for this representation of negative integers is notperfect:only the rightmost bits up to and including the least significant one arecomplemented with respect to 2 (i.e., left unchanged) andall the other bits are complemented with respect to 1.Note that there is one extra negative number, −4 in this case(it would be −128 if w = 8, or −231 if w = 32),for which there is no corresponding positive number.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 13 / 1

Twos complement encoding: 32 bit example

Here are the numbers −4,−3, . . . , 1, 2, 3with their representations on a 32-bit PC encoded as (signed) int).

This is a 32-bit twos complement encoding.

-4 : 11111111111111111111111111111100-3 : 11111111111111111111111111111101-2 : 11111111111111111111111111111110-1 : 111111111111111111111111111111110 : 000000000000000000000000000000001 : 000000000000000000000000000000012 : 000000000000000000000000000000103 : 00000000000000000000000000000011

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 14 / 1

Twos complement encoding: 32 bit example

Here are the numbers −4,−3, . . . , 1, 2, 3with their representations on a 32-bit PC encoded as (signed) int).

This is a 32-bit twos complement encoding.

-4 : 11111111111111111111111111111100-3 : 11111111111111111111111111111101-2 : 11111111111111111111111111111110-1 : 111111111111111111111111111111110 : 000000000000000000000000000000001 : 000000000000000000000000000000012 : 000000000000000000000000000000103 : 00000000000000000000000000000011

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 14 / 1

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

Twos complement encoding: representing −12

For example, to get the representation of −12, start with that of 12:

000 · · ·0001100 (Start with 12)−→ 111 · · ·1110011 (Bitwise not)−→ 111 · · ·1110100 (Add 1, carrying bit to left, giving −12)

It should be clear from this example of 5 encodings that we may store any type ofnumber in a w -bit word. All we need is the appropriate encoding.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 15 / 1

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

The Range and Distribution of Integers

Most computer use encoding E5 above (twos complement) with the word-lengthw = 32, by default.

The 232 integers are in the range

−231 ≤ i ≤ +231 − 1 or, approximately, − 109 ≤ i ≤ +109.

These integers are distributed uniformly across this range.

If the result of an arithmetic operation is an integer outside this range then aninteger overflow occurs.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 16 / 1

Real Numbers

Beyond integers, one often needs to approximate numbers of potentially infinitebinary and decimal representation (e.g., 1

3 = 0.333 · · · ,√

2) on a machine withfinite amount of memory.

This often requires very non-trivial algorithms or some loss of precision.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 17 / 1

Real Numbers

Beyond integers, one often needs to approximate numbers of potentially infinitebinary and decimal representation (e.g., 1

3 = 0.333 · · · ,√

2) on a machine withfinite amount of memory.

This often requires very non-trivial algorithms or some loss of precision.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 17 / 1

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

1 from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Rational Numbers

For rational numbers, which can be expressed as a fraction of two integers, suchas 1

3 , it is easy to store the two integers.

Quite surprisingly, few computer programs actually do this, due to the fact thearithmetic operations over such a representation does not come with hardwaresupport.

Let us illustrate this in Python:

from fractions import Fraction3*Fraction(1, 3)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 18 / 1

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Fractions in Positional Number Systems

Alternatively, one can represent some rational numbers by extending positionalnumber systems.

There, we represent numbers with fractional parts by allowing the index i of thedigits di to be negative.

Thus N with a fractional part is represented as follows:

N =(dndn−1 . . . d0d−1 . . . d−m)10 =n∑

i=−m

dibi

=dn × 10n + · · ·+ d0 × 100 + d−1 × 10−1 + . . . + d−m × 10−m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 19 / 1

Binary Numbers

Example (Binary Numbers II)

The fractional binary number N = 1001.1012 in expanded form is

N = 1× 23 + 0× 22 + 0× 21 + 1× 20 + 1× 2−1 + 0× 2−2 + 1× 2−3

= 8 + 0 + 0 + 1 + 0.5 + 0 + 0.125

= 9.62510.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 20 / 1

Binary Numbers

Example (Binary Numbers II)

The fractional binary number N = 1001.1012 in expanded form is

N = 1× 23 + 0× 22 + 0× 21 + 1× 20 + 1× 2−1 + 0× 2−2 + 1× 2−3

= 8 + 0 + 0 + 1 + 0.5 + 0 + 0.125

= 9.62510.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 20 / 1

Binary Numbers

Example (Binary Numbers III)

The fractional decimal number N = 1.110 in its binary expansion:

N = 1× 20 + 0 · 2−1 + 0 · 2−2 + 0 · 2−3 + 2−4 + 2−5 + 0 · 2−6 + 2−7

≈ 1.00011001100110011001100110011001100110011001100110011001100110

≈ 1.0999999999999999999132638262011596452794037759304046630859375.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 21 / 1

Two Questions

print 1.1+2.2, (1.1+2.2) == 3.3print "%.16f %.16f %.16f" % (1.1, 2.2, 1.1+2.2)

Exercise

What positional number system would it take to encode 13 precisely?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 22 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

For irrational numbers, such as√

2, the situation is very different.

No matter how we pick the positional number system or alphabet, the digits are afinite alphabet.

If you allow for infinitely long strings over a finite alphabet, you still end up withonly countably many numbers.

There are, however, uncountably many irrational numbers.

Hence, there is no hope of encoding all real numbers exactly, at the same time, ona digital computer.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 23 / 1

Irrational Numbers

One option is to use symbolic representations thereof,

but these may slow the computation down considerably:

from sympy import *sympy.sqrt(8)

3 x = symbols(’x’)solve(x**2 - 2, x)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 24 / 1

Irrational Numbers

One option is to use symbolic representations thereof,

but these may slow the computation down considerably:

1 from sympy import *sympy.sqrt(8)x = symbols(’x’)solve(x**2 - 2, x)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 24 / 1

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Approximations

Alternatively, one has to be content with an approximation.

In theory, one can use an approximation with arbitrary precision. This can beimplemented in Python using module decimal.

In practice, there is a widespread encoding, the floating-point number system,with well-defined error characteristics.

This encoding has been standardised by the Institute of Electrical and ElectronicEngineers (IEEE) and in-built into most modern computers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 25 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Floating Point NumbersA floating point number system tries to represent (part of) the real numbers R ina way that a computer can handle,and which is useful over as wide a range of numbers as possible.

All floating point number systems represent a real number x as

x = ±.m × be = ±(m1

b1+

m2

b2+ · · · mt

bt

)× be

= ±.(m1m2 . . .mt)b × be ,

where m is called the mantissa,e is called the exponent,t is the precision or number of digits in the mantissa, andb is the base.

For example, the decimal number 321.456 has the floating point representation+.321456× 10+3.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 26 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Normalised Floating Point Numbers

Each digit mi of the mantissa is in the range 0 ≤ mi ≤ b − 1.

If m1 6= 0 then the number is called normalised;otherwise, it is called subnormal.

The exponent must lie in the range emin ≤ e ≤ emax.

Both the mantissa and the exponent are represented as integers.

The number 0.0 is represented as

0.0 = .(00 . . . 0)b × b0.

This is a subnormal number because m1 = 0.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 27 / 1

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Storage of Floating Point Numbers

A floating point number has a sign,a mantissa,and an exponent (whose sign is not explicitly written).

Thus, the number x = (±,m, e) has three partsand these are stored in a w -bit word. The sign occupies 1 bit, the mantissa t bits,and the exponent w − t − 1 bits.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 28 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Biasing the exponent

So that the exponent field can represent both positive and negative exponents,we add a bias to the actual exponent,to ensure that the stored exponent is always non-negative.

The actual (unbiased) exponent is needed for input and output only.

We allow for half the possible exponents to be positive, half to be non-positive.

Thus for an exponent of w − t − 1 bits,the bias is the largest integer representable with w − t − 2 bits,that is, 2w−t−2 − 1.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 29 / 1

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

Representable Numbers

A number that can be represented exactly in a floating point number system iscalled representable.

Because there is a finite number of these representable numbers,there are infinitely many numbers that cannot be represented exactly.

For example, the number x = 221.4922 = 0.2214922× 103 cannot be representedin the base-10 system with 4-digit mantissa and emin = −8, emax = 8 as x has a7-digit mantissa.To store this number the mantissa must be reduced to 4 digits.

The simplest way to do this is to chop off the last 3 digits and store the numberx = 0.2214× 103, which is representable.

The (more commonly used) alternative to chopping is to round the number up ordown to the nearest representable number, thus storing x as x = 0.2215× 103.

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

The Range of FP NumbersThe magnitudes of the floating point numbers have the range

.min(m)× bemin < |x | < .max(m)× bemax .

Now, if x is normalised, then

.min(m) = .(10 . . . 0)b = b−1 and

.max(m) = .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t .

Hence the range of the floating point numbers is

bemin−1 ≤ |x | ≤ (1− b−t)bemax ,

or approximately (since 1− b−t ≈ 1)

bemin−1 < |x | < bemax .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 31 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

Range of FP Numbers: Underflow and Overflow

Of course, it is possible for the true result of an operation to be so large or sosmall that it falls outside the available floating-point range.

underflow: number too small

overflow: number too large

Underflow/overflow occurs, roughly speaking, whenthe result of an arithmetic operation is so small/largethat it cannot be stored in its intended destination formatwithout suffering a rounding error that is larger than usual.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 32 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

The Distribution of FP NumbersThe distribution of the floating point numbers in this range is not uniform.

To see this consider the simple base-10 floating point system with 3-digitsignificant, emin = 0 and emax = 4.

This has a range 10−1 < |x | < 104, i.e., 0.1 < |x | < 10000.

For a given exponent e the mantissa represents 10t − 10t−1 = 900 normalisednumbers uniformly distributed between.(10 . . . 0)b = b−1 = .1 and .((b − 1)(b − 1) . . . (b − 1))b = 1− b−t = .999.

Hence, for e = 0 we have the 900 numbers uniformly distributed from .100× 100

to .999× 100. The spacing between these numbers is 1/1000.

For e = 1 we have the 900 numbers uniformly distributed from .100× 101 to.999× 101, i.e., 1 to 10. The spacing between these numbers is 1/100.

For e = 2 we have the 900 numbers uniformly distributed from .100× 102 to.999× 102, i.e., 10 to 100. The spacing between these numbers is 1/10.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 33 / 1

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

FP Numbers are not uniformly spaced

We can see that, in general,when the exponent is small the spacing between floating point numbers is small;andwhen it is large the spacing is large.

0.1 1 10 100

-R

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 34 / 1

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Machine Epsilon

The spacing between any pair of adjacent normalised numbers can be defined interms of the spacing about 1.0.

This spacing is called machine epsilon, εm defined as

1 εm = distance between 1.0 and the next higher floating point number, or

2 εm = the smallest number such that 1.0 + εm when chopped is distinct from1.0.

Using the first definition, we have

εm = .(00 . . . 1)b × b1 = b−t × b1 = b1−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 35 / 1

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

Spacing about a floating point number

and this can be used in the following lemma of [Higham, 2002, p. 41]:

Lemma (Floating Point Spacing)

The spacing between a normalised floating point number x and an adjacentnormalised number is at least b−1εm|x | and at most εm|x |, unless x or itsneighbour is 0.

This lemma clearly shows that the spacing between adjacent floating pointnumbers varies with the magnitude of x .

Thus the minimum spacing is bemin−t and the maximum spacing is bemax−t .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 36 / 1

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

IEEE Arithmetics

The Institute of Electrical and Electronic Engineers (IEEE) standard 745 waspublished in 1985.

It specifies how hardware and software combine to perform well-defined operationson standard floating point numbers.

Many processors have built-in floating point operations that include+,−,×, /,

√,

along with comparison and conversion operations, on one, two, or three “formats”of numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 37 / 1

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

IEEE Formats

Double 8-byteI 53 = 52 + 1 significant bits mantissa, 11-bit exponentI Default in Python, C/C++/Java double, Fortran REAL*8

Single 4-byteI 24 = 23 + 1 significant bits mantissa, 8-bit exponentI e.g., Python numpy.float32, C/C++/Java float, Fortran REAL*4,

Quad 16-byteI 113 significant bits mantissa, 15-bit exponentI The implementation varies. Python may have numpy.float128, C/C++

may have long double or not, depending on the platform.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 38 / 1

Let us illustrate this in Python:

1 x = 1.0y = numpy.float32(1.0)z = numpy.float128(1.0)print x, y, z

The focus will be on 8-byte floating-point numbers.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 39 / 1

Precisions and ranges for IEEE formats

Type Unit roundoff Smallest LargestSingle ur = 2−24 ≈ 5.96× 10−8 2−126 ≈ 1.18× 10−38 (2− 2−23)× 2127 ≈ 3.40× 1038

Double ur = 2−53 ≈ 1.11× 10−16 2−1022 ≈ 2.82× 10−308 (2− 2−52)× 21023 ≈ 1.41× 10308

Quad ur = 2−113 ≈ 9.63× 10−35 2−16382 ≈ 3.36× 10−4932 216384 − 216272 ≈ 1.19× 104932

Since the sign of floating point numbers is given by the leading bit,the range for negative numbers is just the negation of the above values.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 40 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Ranges IEEE format can not represent

Although IEEE single precision floating point numbers are becoming less common(outside GPGPUs), let’s use them to demonstrate the limits of floating-pointnumbers. Single precision numbers cannot represent:

1 Negative numbers less than −(2− 2−23)× 2127 (negative overflow)

2 Negative numbers greater than −2−126 (negative underflow)

3 Zero (treated as a special case, all bits of both exponent and mantissa set to0: so you can have both +0 and −0)

4 Positive numbers less than 2−126 (positive underflow)

5 Positive numbers greater than (2− 2−23)× 2127 (positive overflow)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 41 / 1

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Biases in IEEE formats

Let us explain the reasoning. For IEEE single precision with 8-bit exponent, thebias is 127 = 28−1 − 1.

For IEEE double precision with 11-bit exponent the bias is 1023 = 211−1 − 1.

For example, in single precision,a stored value of 100 would mean a true exponent of 100− 127 = −27;a stored value of 150 means an exponent of 150− 127 = 23.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 42 / 1

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Hidden bit normalisation

The mantissa is composed of an implicit leading bit and the fraction bits.

As the only non-zero number in binary is 1,it follows that a normalised binary number must look like 1.d1d2 . . . dn.

Thus we do not need to store the leading 1;the 23 bits (or 52 in double precision) of the mantissa can all be used to store thepart of the number after the “decimal” point.

We call this hidden bit normalisation:thus single and double precision actually have 24 and 53 bits for the mantissa,respectively.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 43 / 1

Exercise

Try to prove that√

2 is an irrational number, i.e., there exists no integers p, qsuch that the fraction p

q =√

2. How many different proofs can you find?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 44 / 1

Jean-Michael Muller Example

Exercise

Implement the sequence x0 = 4, x1 = 4.25, xn = 108− (815− 1500/xn−2)/xn−1,which has been suggested by Muller et al. [2010], first using the defaultfloating-point numbers, and second using Decimal with increasing precision. Whatis the correct value?

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software September 23, 2015 45 / 1

Recommended