Upload
isaac-ray
View
215
Download
0
Embed Size (px)
Citation preview
A Methodology for Producing Identical Double Precision
Floating-Point Results
Frederick (Eric) McIntosh
(Honorary Group Member AB/ABP CERN)
A Quote“Guaranteeing that two runs of a program
will produce exactly the same results is extremely difficult and may be impossible in practice.”
From “Numerical Replications of Computer Simulations: Some Pitfalls and How To Avoid Them”
Theodore C. Belding (January, 2000)
Obligatory Reading
“What Every Computer Scientist Should Know About Floating-Point Arithmetic”
David Goldberg March, 1991
“Most programs will actually produce different results on different systems”
“Why do we need a floating-point arithmetic standard?”
Prof. W. Kahan, Berkeley, 1981
3
Why Differences?• Binary floating-point arithmetic is inexact and
rounding “error” accumulates and analysis is impractical/impossible
• Mathematically equivalent expressions will often give different results e.g. (a*b)/c != a*(b/c)
• Results depend on the order of evaluation of an expression –may have catastrophic cancellation
• Hardware today “always” conforms to IEEE 754 but the standard has loopholes and is incomplete
• Software standards are evolving but are incomplete with respect to floating-point
4
5
• Defines unique reproducible result for +, -, *, /, and sqrt – the correctly rounded result being the floating-point number closest to the exact result
• It is incomplete and open to interpretation• Needs to be combined with the language
standard• Strict compliance conflicts with performance• Does NOT cover Elementary Functions
IEEE 754 (rev 2008)
…..more
• 4 rounding modes (up, down, nearest, -> 0)• We consider only “round to nearest” _rn• Double Precision• ~15 (and a bit) decimal digits• Range from ~-10**308 to 10**308 but also NaNs
and +/- Infinity• Covers gradual underflow and exceptions• ULP is Unit in the Last (binary) Place the smallest
representable difference between two numbers
Floating-Point Arithmetic• - Sign, Exponent, Mantissa ( bits)
• Single Precision (SP)– 1, 8, 23 (32)
• Double Precision (DP)– 1, 11, 52 (64)
• Extended Precision (EP) A mongrel?– 1, 15, 64 (80)
• Quadruple Precision– 1, 15, 112 (128)
• Arbitrary Precision (Maple, MPFR, etc)7
Programming Standards
• (CERN) FORTRAN 66• Fortran 77, 90, 95, 2003• C/C++ C99• Fortran 2003 (when available) and C99
represent an immense step forward• BUT benefits are often long-term and the
extra time required to conform is NOT appreciated by management and seen as a burden by the programmer
8
Case Study (SixTrack)
• Used to study LHC Beam Dynamics and determine the Dynamic Aperture with and without weak-strong beam-beam effects
• Around 60,000 lines of Fortran 77 (almost) code ported from IBM to Cray, Dec, and to PCs and part of
SPECfp 2000 ($3000.- for CERN!)
• It is a DIVERGENT application in that even a 1 ULP difference will grow exponentially with time giving significantly different results at the onset of chaotic motion (c.f. Non-linear, Lorentz, “butterfly” effect)
• Particle motion is deterministic but there is no theory for predicting the onset of chaotic motion
9
SixTrack (continued)
• A typical LHC Study might require running from 10 to 100,000 cases/jobs using 60 particles, one or more tunes, 10 initial amplitudes and 5 or more initial angles in phase space and a set of magnet errors for one million turns
• Even on a modern PC this takes of the order of 10 hours for each case
• Until now studies were limited by the available computer capacity
10
11
My Idea (not original)
• Use ~10000 Windows desktops at CERN to run SixTrack (PlayStations!) the so called CERN Physics Screen Saver (CPSS) project
• At least double the tracking capacity and potentially provide an order of magnitude increase for zero financial investment
• Later switched to BOINC (the Berkeley Open Infrastructure for Network Computing)
• This required the ability to verify results from almost any kind of PC (later MAC) running the various Operating Systems (Linux all brands, Windows all brands)
• BOINC provides “Homogeneity” and “Epsilon” or “Own Brand” result verification
The MethodologyCopyright
F. McIntosh and CERN• Restricted to double precision 64 bits, 32-bit
executables and ignores precise exception handling and underflow (for GPUs and SSE2/SSE3)
• It should apply equally to C (C++) C99 standard compliant programs but more easily applicable
• Requires source code availability (no libraries)• Will reduce performance• Produces identical (0 ULP difference) results on the
three principal Operating Systems and levels, using any of four different Fortran compilers at “any” optimisation level 12
Methodology
1) For Fortran 77 add parentheses to ensure unique order of expression evaluation
2) Disable Extended Precision as required by 3) and 4) anyway. Not available for SSE2/SSE3
3) Use crlibm from ENS Lyon for the elementary functions (plus arcsine, arccosine and arctan2)
4) Use David M. Gay routines for formatted input (and output) available since 1990
5. Implement exponentiation a**b as exp(b*log(a)) NINT6. Verify no difference between compile time and execute time
evaluation of constants7. Disable Fused Multiply Add (FMA)8. (Persuade library providers to provide a portable version, using this
methodology)
13
LHC@home• After applying the methodology to Sixtrack the BOINC
project LHC@home is providing the equivalent computing power of some 25,000 PCs (100,000 processors, available 50% of the time, every case run twice). Unprecedented capacity for ABP.
• No $25,000,000 to buy them, no software licences, no cooling or power to pay
• Undetected error rate is around 1 in 10,000 (down from one in a hundred thousand one in a million ).
14
What next?• Fix error rate (LHC service is a priority). • Publish for peer review (Berkeley and Lyon)• Extend to 64-bit executables C/C++• The GRID and the CLOUD• Other application(s) at CERN such as Geant or an Experiment
or Theory• Applications such as Climate Prediction NASA, Molecular
Dynamics, NAG library and Compiler• Computer Games• ARM processors, Tablets, Smart Phones and Raspberry Pi
($35.-,800MHz, 512MB)• WARNING: this application heavily uses CPU and device
components. If you observe too high device temperature, please stop computing or decrease CPU usage.
15
16
Acknowledgements
• ABP Group for hosting me for 7 years and providing a couple of PCs and a software licence. BTE Desktop Support .
• M.Giovannozzi and F. Schmidt the SixTrack author for long time help and support and to R. De Maria and L. Deniau.
• EPFL and Prof. Rifkin and I. Zacharov for BOINC support
• ENS Lyon and F. Denichin for crlibm
• H. Renshall (IT/CERN retired) for the parentheses
• G.Erskine (CERN deceased) for outstanding help on numerical analysis and the Error Function of a Complex Number
• A. Wagner (CERN) for CPSS and B. Segal (CERN retired) for suggesting BOINC
• J. Boothroyd, formerly English Electric, for introducing me to Floating-Point Error Analysis (50 years ago)
• Prof. Kahan Berkeley for his work on IEEE 754 and his many publications
• Prof . Anderson Berkeley for BOINC and
• (last but not least)
• The BOINC volunteers, 60,000 worldwide for LHC@home17
English Electric Deuce Computer 1960
Mercury Delay Lines
22
23
Computing at CERN
• Dominated by the needs of the experiments
• Accelerator design, a small fraction of the various mainframes (1964 – 1998) and the “PARC” IBM workstation cluster
• In 1997 the LHC Machine Advisory Committee recommended more tracking
• The “Numerical Accelerator Project”, NAP
Why Standards?
• Should be evident: safety, quality, convenience for example, and they apply in all walks of life including computing
• Program portability ensures fair competition for hardware and software acquisition
• No monopolies (c.f. a giant corporation)
• Should NOT be sacrificed to performance
24
25
Some simple test results• ULP – One Unit in the Last Place of the mantissa
of a floating-point number (one part in roughly 10**16)– libm/crlibm IA32: 304 differences of 1ULP– ibm IA32/IA64: 5 differences of 1ULP – libm IA32/AMD64: 7 differences of 1ULP– libm IA64/AMD64: 2 differences of 1ULP – libm/libm NO EP: 134623 differences of 1ULP
• NO differences with exp_rn
• 1,000,000 exp calls with random arguments (0,1)
26
…and with lf95
– lahey/crlibm IA32: 134645 differences of 1ULP
– lahey IA32/IA64: 7 differences of 1ULP – lahey IA32/AMD64: 7 differences of 1ULP– lahey IA64/AMD64: 4 differences of 1ULP
• NO differences with exp_rn
27
When quadruple precision is not enough – The Table Maker’s Dilemma
• Rounding the approximation of f(x) is not always the same as rounding f(x)
• Worst case for exp(x), x=7.5417527749959590085206221e-10
• Binary example x=1. (52)1 *2-53 exp(x)=1. (52)0 1 (104)1 010101…
• quad (112 bit) approximations : 1. (51)0 1 (60)0 and 1. (51)0 0 (60)1 are both within 1 Quad ULP but which rounded value is nearest?
28
crlibm exp performance
Pentium 4 Xeon gcc 3.3
Average Min Max
libm 365 236 5528
crlibm 432 316 41484
libultim 210 44 3105632
mpfr 23299 14636 204736