Upload
frankgrimes
View
222
Download
1
Embed Size (px)
Citation preview
Floating point issues and alternatives
Published on 9th February 2010 07:22
Contributed by John Drew
Level: Intermediate
The following description applies to all platforms although there is reference to Proton
Development System when it is helpful to do so. Limitations in floating point arithmetic
apply to ALL platforms; differences in the degree of accuracy are mostly a result of the
number of bytes allocated for storage of the floating point number. In most
microcontrollers with their limited memory and speed we may have up to 32 bit storage
for floating point while in a Desktop there may be 64 bit storage, although this varies
depending on the language. In PDS, floats are stored in 32 bits (4 bytes). Note: most
modern desktop languages use single and double precision decimal signed numbers
instead of floats.
Floating point numbers are known as irrational numbers, whereas 1/10 or 2/3 with a
numerator and a denominator are rational numbers (a ratio of two integers). This page is
about the storage and use of irrational numbers and the limitations of doing this.
There are many ways to display a number using decimal notation. Floating point
numbers for humans use a sequence of numerals with a decimal point to indicate place
value. For many of us, the transition is shown with a “.” although some countries use a
“,”. Some examples of floating point numbers are 3.14 or 0.017 or 234.0696 and so on.
The position of the decimal point “floats” to inform the reader of the transition between
units and tenths.
Computers need to store numbers in a standard way so that they may be read by
different machines. To understand how this is done it is useful to show floating point
numbers alongside their scientific notation equivalent.
Floating point notation / Scientific notation
3.14 becomes 3.14 * 100 (note 10
0 equates to 1)
31.4 becomes 3.14 * 101
3140 becomes 3.14 * 103
0.0314 becomes 3.14 * 10-2
Picking up clues from the scientific notation way of doing things it can be seen that it
should be possible to store a number using just a series of numerals eg 314 (the
significand or mantissa), then a further number eg +1 (the exponent including its sign)
that tells where to place the decimal point to create the float of 31.4. To cater for
negative numbers we would also need to store whether the number is – or +.
Floating point notation / Possible computer storage
3.14 becomes +314 (significand) and -2 (exponent)
31.4 becomes +314 and -1
3140 becomes +314 and +1
0.0314 becomes +314 and -4
In the PDS help file, Les shows us how this is done in the system we use. Just 4 bytes
are used to store:
a) The sign (one bit of the 32 available)
b) The mantissa (or significand) without a decimal point ( 23 bits of the 32)
c) The exponent (8 bits that provide the information on where to put the decimal point)
For more detail read the Help file under Floating Point Numbers or refer to this
excellent reference in Wikipedia (http://en.wikipedia.org/wiki/Floating_point).
General comments
As you can see from above there are just 23 binary bits to store the number. The
maximum number that can be stored in 23 bits is 2147483648. Although this implies
that accuracy may be as high as 9 or 10 significant digits this is not so, as many
numbers do not have an exact binary equivalent. Well known examples include 1/3 or
0.1 or pi. What looks simple for humans may not be so for the machine, for example the
square of 0.1 should calculate as 0.01 but instead results in 0.009999999776 in a 4 byte
system. If you test for equality to 0.01 the test would fail.
In a 4 byte system, under most circumstances the result of a computation is rounded to 7
digits.
With 4 byte floats, only 7 digit precision can be expected, so consider the following
example:
123456.7 (This number is stored as accurately as possible in 4 bytes)
+ 101.7654 (so is this one)
123558.4654 (which is rounded off to 123558.5 and so loses the last 3 digits)
Another problem is the difficulty in expressing some rational numbers as a float.
Consider 1/3 which you would expect as 0.333333333 (recurring). In real life, 32 bit
floats will tend to store a number closer to 0.333333310. This is because after the 7th
decimal place, we run out of bits to put the threes in, so the result is abruptly cut off,
leading to an incorrect answer.
Subtraction of close numbers can generate significant errors, similarly multiplication
and division calculations that lead to very large or very small results MAY be unusable.
Trigonometry functions of numbers that cannot be represented exactly such as pi will
not be accurate eg sine (pi) which should equal 0 will compute as a small negative
number. Tan (pi/2) will compute in single precision C as -22877332.0 instead of
infinity.
Even rules that we take to be a fundamental truth such as (a + b)+c =a+(b+c) may not be
true in floating point practice because of the rounding that occurs and the need to
represent numbers in a practical number of bytes.
So what can we do about these inaccuracies?
Whenever possible use integer arithmetic. This is 100% accurate providing you work
within the limits of the type. Choose an appropriate integer variable type that will
accommodate the range you want to cover. Eg a Byte for values from 0 to 255, Word 0
to 65535, signed Dword from -2147483648 to +2147483647 or unsigned Dword to
4294967296.
For example if you are sending data (with two numerals after the decimal point) over a
serial link you might choose to first multiply each number by 100, convert it to an
integer and send it. At the receiving end you may choose to manipulate the data as a
Word type in the PIC® and then convert it to a float for display by dividing by 100 to
print the result in its original 2 decimal places form.
Print At 1, 1, DEC2 Result
There are examples at the end of this document.
Alternatively, imagine the number to be sent was 28.32, firstly multiply it by 100. It
becomes 2832 when assigned to a Word variable. It is sent over the serial link as an
integer and then may be modified in the PIC®, perhaps it is averaged with a group of
readings. If the result after integer arithmetic was 3851 you could send this to a display
like this without ever converting it to a float:
Code: Dim PrintVar As Word
Dim PrintVar2 As Word
PrintVar = 3851 / 100 ' PrintVar now has a value of 38
PrintVar1 = 3851 // 100 ' PrintVar1 contains the modulus value of 51
Print At 1, 1, DEC PrintVar,”.”, DEC PrintVar1 ' The display reads
38.51
Things to remember when using integer math:
1. Remember where your decimal place is!
2. This is primary school mathematics – do the sum on paper the way you know,
and then try to convert that to BASIC.
3. In this example 13.3 / 8 = 1.6625, but the implied precision is useless as the
input numbers are only to 3 and 1 significant figures.
4. When adding and subtracting in integer math, multiply both values by the same
amount so there is no truncation or rounding being performed, then add or
subtract as normal.
5. When multiplying in integer math, the output precision is equal to the sum of the
two input's precision. Multiply both numbers by 10^Precision
before executing the
multiplication.
6. When dividing in integer math, the output precision is equal to the difference
between the two input's precision.
7. Make sure you keep track of what is positive and what is negative. By default,
DWords are signed, but this can be disabled using the code: Declare
UNSIGNED_DWORDS = On
8. You may be reading a temperature sensor. Do all your arithmetic in integer
types. Leave the conversion to a float until the last moment or never convert it,
just use the strategy above to print it to the display.
9. Remember that a Byte rolls over to 0 when you exceed 255, Words rollover to 0
when you exceed 65535, and Dwords rollover to zero when you exceed
2147483647. With integer subtraction the results are accurate unless the number
you are subtracting is larger than the one you started with. For example a byte of
value 2 which has 3 subtracted from it will give 255 not -1. And so on.
10. Keep operations one to a line so that Z=sin A * sin B appears as
Code:
X = Sin A
Y = Cos A
Z = X * Y
11. Test for shortcuts. For example the sine of an angle approximates the angle (in
radians) for small values.
Code:
If A <= 0.1 Then
X = A
Else
X = Sin A
EndIf
12. Float tests may be inaccurate so use X <= Y or X >= Y. Rather than test for X =
Y, test for a small gap. For example If Y - X <= 0.000001 then do something is a
reasonable test for equality.
13. Understand that if you use floats the accuracy may be poor. You should
especially check for values at the limits where one number is large and the other
small. Make use of ISIS. Even the demo version is very useful.
14. Never use floating point maths for financial calculations. Always use integer
arithmetic.
15. Significant figures should normally be taken into account when doing
calculations. If data1 is accurate to 5 significant figures and you multiply it by
data2 that is accurate to just 2 significant figures, then the result is only valid to
2 significant figures. For example 2.3 * 18.234 shows as 41.9382 on a calculator
but should only be printed to the maximum of a rounded 2 significant figures,
that is, 42 (rounded).
16. If a possibility of divide by zero, check using something like this If X <=
0.000001 then Result = 999999.9 or whatever is acceptable to your program.
17. Don’t assume rounding works in a particular way. There are two common
schemes of rounding, the major difference being for negative numbers. In Proton
a float is rounded to the nearest integer. Eg 144.3>144, 0.6>1, 1.1>1, -0.6>-1, -
0.3>0, -2.3>-2.
18. If you need the fractional part of a floating point number in Proton turn off
rounding, assign the value of the float to an integer large enough to
accommodate the likely range, and then subtract the integer from the float.
Code:
_FP_FLAGS = 0 ' Disable Rounding
WordVar = FloatVar
_FP_FLAGS = 64 ' Enable Rounding
Float_FractionalPart = FloatVar – WordVar
Examples: Contributed by Wastrix
SUBTRACTION (SIMILAR FOR ADDITION): Take 87.9482135 from 112.1987345
With integer math:
Code: Dim DWord1 As DWord
Dim DWord2 As DWord
Dim Result As DWord
Dim Before As Byte
Dim After As DWord
DWord1 = 1121987345 ' Multiply both numbers by 10^7
DWord2 = 879482135
Result = DWord1 - DWord2
Before = Result / 10000000 ' Divide by 10^7 again
After = Result // 10000000
Print Dec Before, ".", DEC7 After
' Result is 24.250521 (correct)
With floating point:
Code: Dim Float1 As Float
Dim Float2 As Float
Dim ResultF As Float
Float1 = 112.1987345
Float2 = 87.9482135
ResultF = Float1 - Float2
Print $FE, $C0, DEC7 ResultF
' Result is 24.250564 (incorrect)
DIVISION: Divide 1 by 3
With integer math:
Code: Dim DWord1 As DWord
Dim DWord2 As DWord
Dim Result As DWord
Dim Before As DWord
Dim After As DWord
DWord1 = 1000000000 ' Set variables to correct initial values
DWord2 = 3
Result = DWord1 / DWord2 ' Do first operation (1/3)
Before = Result / 1000000000 ' Get numbers before decimal
After = Result // 1000000000 ' Get numbers after decimal place
Print Dec Before, ".", DEC9 After
' Result is 0.333333333
With floating point:
Code: Dim Float1 As Float
Dim Float2 As Float
Dim ResultF As Float
Float1 = 1
Float2 = 3
ResultF = Float1 / Float2
Print $FE, $C0, DEC8 ResultF
End
' Result is 0.333333310
MULTIPLICATION: Multiply $89.45 by 12.4
With integer math:
Code: Dim WordOne As Word ' We only need word size as we
Dim WordTwo As Word ' are working with small numbers
Dim Result As DWord
Dim Before As Word ' Likewise here...
Dim After As Byte
WordOne = 8945 ' Set the values
WordTwo = 124
Result = WordOne * WordTwo
Before = Result / 1000 ' 10^(2+1), because we multiplied
After = Result // 1000 ' the inputs by 10^2 and 10^1
Print Dec Before, ".", Dec After
' Result is 1109.18 (correct)
With floating point:
Code: Dim Float1 As Float
Dim Float2 As Float
Dim ResultF As Float
Float1 = 89.45
Float2 = 12.4
ResultF = Float1 * Float2
Print $FE, $C0, DEC2 ResultF
' Result is 1109.17 (incorrect)
ALL OF THE ABOVE: Convert 34.5189 degrees Celsius to Fahrenheit
With integer math:
Code: Dim Celsius As DWord
Dim Fahrenheit As DWord
Dim Before As DWord
Dim After As DWord
Celsius = 34518900 ' Multiplied by 10^6
Fahrenheit = Celsius * 9
Fahrenheit = Fahrenheit / 5
Fahrenheit = Fahrenheit + 32000000 ' Multiplied by 10^6
Before = Fahrenheit / 1000000 ' Divide by 10^6
After = Fahrenheit // 1000000
Print Dec Before, ".", DEC6 After ' Display to 6dp. Notice the 6?
' Result is 94.134020 (correct)
With floating point:
Code: Dim Celsius As Float
Dim Fahrenheit As Float
Celsius = 34.5189
Fahrenheit = Celsius * 9 / 5
Fahrenheit = Fahrenheit + 32
Print $FE, $C0, DEC6 Fahrenheit
' Result is 94.134017 (incorrect)
These inaccuracies may seem small, but added together than can create large errors.
This could cause significant issues with small, sensitive data or with financial
information.