7
Floating point issues and alternatives Published on 9th February 2010 07:22 Contributed by John Drew Level: Intermediate The following description applies to all platforms although there is reference to Proton Development System when it is helpful to do so. Limitations in floating point arithmetic apply to ALL platforms; differences in the degree of accuracy are mostly a result of the number of bytes allocated for storage of the floating point number. In most microcontrollers with their limited memory and speed we may have up to 32 bit storage for floating point while in a Desktop there may be 64 bit storage, although this varies depending on the language. In PDS, floats are stored in 32 bits (4 bytes). Note: most modern desktop languages use single and double precision decimal signed numbers instead of floats. Floating point numbers are known as irrational numbers, whereas 1/10 or 2/3 with a numerator and a denominator are rational numbers (a ratio of two integers). This page is about the storage and use of irrational numbers and the limitations of doing this. There are many ways to display a number using decimal notation. Floating point numbers for humans use a sequence of numerals with a decimal point to indicate place value. For many of us, the transition is shown with a “.” although some countries use a “,”. Some examples of floating point numbers are 3.14 or 0.017 or 234.0696 and so on. The position of the decimal point “floats” to inform the reader of the transition between units and tenths. Computers need to store numbers in a standard way so that they may be read by different machines. To understand how this is done it is useful to show floating point numbers alongside their scientific notation equivalent. Floating point notation / Scientific notation 3.14 becomes 3.14 * 10 0 (note 10 0 equates to 1) 31.4 becomes 3.14 * 10 1 3140 becomes 3.14 * 10 3 0.0314 becomes 3.14 * 10 -2 Picking up clues from the scientific notation way of doing things it can be seen that it should be possible to store a number using just a series of numerals eg 314 (the significand or mantissa), then a further number eg +1 (the exponent including its sign) that tells where to place the decimal point to create the float of 31.4. To cater for negative numbers we would also need to store whether the number is or +.

Floating Point Issues and Alternatives

Embed Size (px)

Citation preview

Page 1: Floating Point Issues and Alternatives

Floating point issues and alternatives

Published on 9th February 2010 07:22

Contributed by John Drew

Level: Intermediate

The following description applies to all platforms although there is reference to Proton

Development System when it is helpful to do so. Limitations in floating point arithmetic

apply to ALL platforms; differences in the degree of accuracy are mostly a result of the

number of bytes allocated for storage of the floating point number. In most

microcontrollers with their limited memory and speed we may have up to 32 bit storage

for floating point while in a Desktop there may be 64 bit storage, although this varies

depending on the language. In PDS, floats are stored in 32 bits (4 bytes). Note: most

modern desktop languages use single and double precision decimal signed numbers

instead of floats.

Floating point numbers are known as irrational numbers, whereas 1/10 or 2/3 with a

numerator and a denominator are rational numbers (a ratio of two integers). This page is

about the storage and use of irrational numbers and the limitations of doing this.

There are many ways to display a number using decimal notation. Floating point

numbers for humans use a sequence of numerals with a decimal point to indicate place

value. For many of us, the transition is shown with a “.” although some countries use a

“,”. Some examples of floating point numbers are 3.14 or 0.017 or 234.0696 and so on.

The position of the decimal point “floats” to inform the reader of the transition between

units and tenths.

Computers need to store numbers in a standard way so that they may be read by

different machines. To understand how this is done it is useful to show floating point

numbers alongside their scientific notation equivalent.

Floating point notation / Scientific notation

3.14 becomes 3.14 * 100 (note 10

0 equates to 1)

31.4 becomes 3.14 * 101

3140 becomes 3.14 * 103

0.0314 becomes 3.14 * 10-2

Picking up clues from the scientific notation way of doing things it can be seen that it

should be possible to store a number using just a series of numerals eg 314 (the

significand or mantissa), then a further number eg +1 (the exponent including its sign)

that tells where to place the decimal point to create the float of 31.4. To cater for

negative numbers we would also need to store whether the number is – or +.

Page 2: Floating Point Issues and Alternatives

Floating point notation / Possible computer storage

3.14 becomes +314 (significand) and -2 (exponent)

31.4 becomes +314 and -1

3140 becomes +314 and +1

0.0314 becomes +314 and -4

In the PDS help file, Les shows us how this is done in the system we use. Just 4 bytes

are used to store:

a) The sign (one bit of the 32 available)

b) The mantissa (or significand) without a decimal point ( 23 bits of the 32)

c) The exponent (8 bits that provide the information on where to put the decimal point)

For more detail read the Help file under Floating Point Numbers or refer to this

excellent reference in Wikipedia (http://en.wikipedia.org/wiki/Floating_point).

General comments

As you can see from above there are just 23 binary bits to store the number. The

maximum number that can be stored in 23 bits is 2147483648. Although this implies

that accuracy may be as high as 9 or 10 significant digits this is not so, as many

numbers do not have an exact binary equivalent. Well known examples include 1/3 or

0.1 or pi. What looks simple for humans may not be so for the machine, for example the

square of 0.1 should calculate as 0.01 but instead results in 0.009999999776 in a 4 byte

system. If you test for equality to 0.01 the test would fail.

In a 4 byte system, under most circumstances the result of a computation is rounded to 7

digits.

With 4 byte floats, only 7 digit precision can be expected, so consider the following

example:

123456.7 (This number is stored as accurately as possible in 4 bytes)

+ 101.7654 (so is this one)

123558.4654 (which is rounded off to 123558.5 and so loses the last 3 digits)

Another problem is the difficulty in expressing some rational numbers as a float.

Consider 1/3 which you would expect as 0.333333333 (recurring). In real life, 32 bit

floats will tend to store a number closer to 0.333333310. This is because after the 7th

decimal place, we run out of bits to put the threes in, so the result is abruptly cut off,

leading to an incorrect answer.

Page 3: Floating Point Issues and Alternatives

Subtraction of close numbers can generate significant errors, similarly multiplication

and division calculations that lead to very large or very small results MAY be unusable.

Trigonometry functions of numbers that cannot be represented exactly such as pi will

not be accurate eg sine (pi) which should equal 0 will compute as a small negative

number. Tan (pi/2) will compute in single precision C as -22877332.0 instead of

infinity.

Even rules that we take to be a fundamental truth such as (a + b)+c =a+(b+c) may not be

true in floating point practice because of the rounding that occurs and the need to

represent numbers in a practical number of bytes.

So what can we do about these inaccuracies?

Whenever possible use integer arithmetic. This is 100% accurate providing you work

within the limits of the type. Choose an appropriate integer variable type that will

accommodate the range you want to cover. Eg a Byte for values from 0 to 255, Word 0

to 65535, signed Dword from -2147483648 to +2147483647 or unsigned Dword to

4294967296.

For example if you are sending data (with two numerals after the decimal point) over a

serial link you might choose to first multiply each number by 100, convert it to an

integer and send it. At the receiving end you may choose to manipulate the data as a

Word type in the PIC® and then convert it to a float for display by dividing by 100 to

print the result in its original 2 decimal places form.

Print At 1, 1, DEC2 Result

There are examples at the end of this document.

Alternatively, imagine the number to be sent was 28.32, firstly multiply it by 100. It

becomes 2832 when assigned to a Word variable. It is sent over the serial link as an

integer and then may be modified in the PIC®, perhaps it is averaged with a group of

readings. If the result after integer arithmetic was 3851 you could send this to a display

like this without ever converting it to a float:

Code: Dim PrintVar As Word

Dim PrintVar2 As Word

PrintVar = 3851 / 100 ' PrintVar now has a value of 38

PrintVar1 = 3851 // 100 ' PrintVar1 contains the modulus value of 51

Print At 1, 1, DEC PrintVar,”.”, DEC PrintVar1 ' The display reads

38.51

Things to remember when using integer math:

1. Remember where your decimal place is!

2. This is primary school mathematics – do the sum on paper the way you know,

and then try to convert that to BASIC.

3. In this example 13.3 / 8 = 1.6625, but the implied precision is useless as the

input numbers are only to 3 and 1 significant figures.

4. When adding and subtracting in integer math, multiply both values by the same

amount so there is no truncation or rounding being performed, then add or

subtract as normal.

Page 4: Floating Point Issues and Alternatives

5. When multiplying in integer math, the output precision is equal to the sum of the

two input's precision. Multiply both numbers by 10^Precision

before executing the

multiplication.

6. When dividing in integer math, the output precision is equal to the difference

between the two input's precision.

7. Make sure you keep track of what is positive and what is negative. By default,

DWords are signed, but this can be disabled using the code: Declare

UNSIGNED_DWORDS = On

8. You may be reading a temperature sensor. Do all your arithmetic in integer

types. Leave the conversion to a float until the last moment or never convert it,

just use the strategy above to print it to the display.

9. Remember that a Byte rolls over to 0 when you exceed 255, Words rollover to 0

when you exceed 65535, and Dwords rollover to zero when you exceed

2147483647. With integer subtraction the results are accurate unless the number

you are subtracting is larger than the one you started with. For example a byte of

value 2 which has 3 subtracted from it will give 255 not -1. And so on.

10. Keep operations one to a line so that Z=sin A * sin B appears as

Code:

X = Sin A

Y = Cos A

Z = X * Y

11. Test for shortcuts. For example the sine of an angle approximates the angle (in

radians) for small values.

Code:

If A <= 0.1 Then

X = A

Else

X = Sin A

EndIf

12. Float tests may be inaccurate so use X <= Y or X >= Y. Rather than test for X =

Y, test for a small gap. For example If Y - X <= 0.000001 then do something is a

reasonable test for equality.

13. Understand that if you use floats the accuracy may be poor. You should

especially check for values at the limits where one number is large and the other

small. Make use of ISIS. Even the demo version is very useful.

14. Never use floating point maths for financial calculations. Always use integer

arithmetic.

15. Significant figures should normally be taken into account when doing

calculations. If data1 is accurate to 5 significant figures and you multiply it by

data2 that is accurate to just 2 significant figures, then the result is only valid to

2 significant figures. For example 2.3 * 18.234 shows as 41.9382 on a calculator

but should only be printed to the maximum of a rounded 2 significant figures,

that is, 42 (rounded).

16. If a possibility of divide by zero, check using something like this If X <=

0.000001 then Result = 999999.9 or whatever is acceptable to your program.

Page 5: Floating Point Issues and Alternatives

17. Don’t assume rounding works in a particular way. There are two common

schemes of rounding, the major difference being for negative numbers. In Proton

a float is rounded to the nearest integer. Eg 144.3>144, 0.6>1, 1.1>1, -0.6>-1, -

0.3>0, -2.3>-2.

18. If you need the fractional part of a floating point number in Proton turn off

rounding, assign the value of the float to an integer large enough to

accommodate the likely range, and then subtract the integer from the float.

Code:

_FP_FLAGS = 0 ' Disable Rounding

WordVar = FloatVar

_FP_FLAGS = 64 ' Enable Rounding

Float_FractionalPart = FloatVar – WordVar

Examples: Contributed by Wastrix

SUBTRACTION (SIMILAR FOR ADDITION): Take 87.9482135 from 112.1987345

With integer math:

Code: Dim DWord1 As DWord

Dim DWord2 As DWord

Dim Result As DWord

Dim Before As Byte

Dim After As DWord

DWord1 = 1121987345 ' Multiply both numbers by 10^7

DWord2 = 879482135

Result = DWord1 - DWord2

Before = Result / 10000000 ' Divide by 10^7 again

After = Result // 10000000

Print Dec Before, ".", DEC7 After

' Result is 24.250521 (correct)

With floating point:

Code: Dim Float1 As Float

Dim Float2 As Float

Dim ResultF As Float

Float1 = 112.1987345

Float2 = 87.9482135

ResultF = Float1 - Float2

Print $FE, $C0, DEC7 ResultF

' Result is 24.250564 (incorrect)

DIVISION: Divide 1 by 3

With integer math:

Code: Dim DWord1 As DWord

Dim DWord2 As DWord

Dim Result As DWord

Dim Before As DWord

Dim After As DWord

Page 6: Floating Point Issues and Alternatives

DWord1 = 1000000000 ' Set variables to correct initial values

DWord2 = 3

Result = DWord1 / DWord2 ' Do first operation (1/3)

Before = Result / 1000000000 ' Get numbers before decimal

After = Result // 1000000000 ' Get numbers after decimal place

Print Dec Before, ".", DEC9 After

' Result is 0.333333333

With floating point:

Code: Dim Float1 As Float

Dim Float2 As Float

Dim ResultF As Float

Float1 = 1

Float2 = 3

ResultF = Float1 / Float2

Print $FE, $C0, DEC8 ResultF

End

' Result is 0.333333310

MULTIPLICATION: Multiply $89.45 by 12.4

With integer math:

Code: Dim WordOne As Word ' We only need word size as we

Dim WordTwo As Word ' are working with small numbers

Dim Result As DWord

Dim Before As Word ' Likewise here...

Dim After As Byte

WordOne = 8945 ' Set the values

WordTwo = 124

Result = WordOne * WordTwo

Before = Result / 1000 ' 10^(2+1), because we multiplied

After = Result // 1000 ' the inputs by 10^2 and 10^1

Print Dec Before, ".", Dec After

' Result is 1109.18 (correct)

With floating point:

Code: Dim Float1 As Float

Dim Float2 As Float

Dim ResultF As Float

Float1 = 89.45

Float2 = 12.4

ResultF = Float1 * Float2

Print $FE, $C0, DEC2 ResultF

' Result is 1109.17 (incorrect)

ALL OF THE ABOVE: Convert 34.5189 degrees Celsius to Fahrenheit

With integer math:

Code: Dim Celsius As DWord

Dim Fahrenheit As DWord

Dim Before As DWord

Dim After As DWord

Celsius = 34518900 ' Multiplied by 10^6

Page 7: Floating Point Issues and Alternatives

Fahrenheit = Celsius * 9

Fahrenheit = Fahrenheit / 5

Fahrenheit = Fahrenheit + 32000000 ' Multiplied by 10^6

Before = Fahrenheit / 1000000 ' Divide by 10^6

After = Fahrenheit // 1000000

Print Dec Before, ".", DEC6 After ' Display to 6dp. Notice the 6?

' Result is 94.134020 (correct)

With floating point:

Code: Dim Celsius As Float

Dim Fahrenheit As Float

Celsius = 34.5189

Fahrenheit = Celsius * 9 / 5

Fahrenheit = Fahrenheit + 32

Print $FE, $C0, DEC6 Fahrenheit

' Result is 94.134017 (incorrect)

These inaccuracies may seem small, but added together than can create large errors.

This could cause significant issues with small, sensitive data or with financial

information.