A Methodology for Producing Identical Double Precision Floating-Point Results Frederick (Eric) McIntosh (Honorary Group Member AB/ABP CERN) [email protected]

A Methodology for Producing Identical Double Precision

Floating-Point Results

Frederick (Eric) McIntosh

(Honorary Group Member AB/ABP CERN)

[email protected]

A Quote“Guaranteeing that two runs of a program

will produce exactly the same results is extremely difficult and may be impossible in practice.”

From “Numerical Replications of Computer Simulations: Some Pitfalls and How To Avoid Them”

Theodore C. Belding (January, 2000)

Obligatory Reading

“What Every Computer Scientist Should Know About Floating-Point Arithmetic”

David Goldberg March, 1991

“Most programs will actually produce different results on different systems”

“Why do we need a floating-point arithmetic standard?”

Prof. W. Kahan, Berkeley, 1981

3

Why Differences?• Binary floating-point arithmetic is inexact and

rounding “error” accumulates and analysis is impractical/impossible

• Mathematically equivalent expressions will often give different results e.g. (a*b)/c != a*(b/c)

• Results depend on the order of evaluation of an expression –may have catastrophic cancellation

• Hardware today “always” conforms to IEEE 754 but the standard has loopholes and is incomplete

• Software standards are evolving but are incomplete with respect to floating-point

4

5

• Defines unique reproducible result for +, -, *, /, and sqrt – the correctly rounded result being the floating-point number closest to the exact result

• It is incomplete and open to interpretation• Needs to be combined with the language

standard• Strict compliance conflicts with performance• Does NOT cover Elementary Functions

IEEE 754 (rev 2008)

…..more

• 4 rounding modes (up, down, nearest, -> 0)• We consider only “round to nearest” _rn• Double Precision• ~15 (and a bit) decimal digits• Range from ~-10**308 to 10**308 but also NaNs

and +/- Infinity• Covers gradual underflow and exceptions• ULP is Unit in the Last (binary) Place the smallest

representable difference between two numbers

Floating-Point Arithmetic• - Sign, Exponent, Mantissa ( bits)

• Single Precision (SP)– 1, 8, 23 (32)

• Double Precision (DP)– 1, 11, 52 (64)

• Extended Precision (EP) A mongrel?– 1, 15, 64 (80)

• Quadruple Precision– 1, 15, 112 (128)

• Arbitrary Precision (Maple, MPFR, etc)7

Programming Standards

• (CERN) FORTRAN 66• Fortran 77, 90, 95, 2003• C/C++ C99• Fortran 2003 (when available) and C99

represent an immense step forward• BUT benefits are often long-term and the

extra time required to conform is NOT appreciated by management and seen as a burden by the programmer

8

Case Study (SixTrack)

• Used to study LHC Beam Dynamics and determine the Dynamic Aperture with and without weak-strong beam-beam effects

• Around 60,000 lines of Fortran 77 (almost) code ported from IBM to Cray, Dec, and to PCs and part of

SPECfp 2000 ($3000.- for CERN!)

• It is a DIVERGENT application in that even a 1 ULP difference will grow exponentially with time giving significantly different results at the onset of chaotic motion (c.f. Non-linear, Lorentz, “butterfly” effect)

• Particle motion is deterministic but there is no theory for predicting the onset of chaotic motion

9

SixTrack (continued)

• A typical LHC Study might require running from 10 to 100,000 cases/jobs using 60 particles, one or more tunes, 10 initial amplitudes and 5 or more initial angles in phase space and a set of magnet errors for one million turns

• Even on a modern PC this takes of the order of 10 hours for each case

• Until now studies were limited by the available computer capacity

10

11

My Idea (not original)

• Use ~10000 Windows desktops at CERN to run SixTrack (PlayStations!) the so called CERN Physics Screen Saver (CPSS) project

• At least double the tracking capacity and potentially provide an order of magnitude increase for zero financial investment

• Later switched to BOINC (the Berkeley Open Infrastructure for Network Computing)

• This required the ability to verify results from almost any kind of PC (later MAC) running the various Operating Systems (Linux all brands, Windows all brands)

• BOINC provides “Homogeneity” and “Epsilon” or “Own Brand” result verification

The MethodologyCopyright

F. McIntosh and CERN• Restricted to double precision 64 bits, 32-bit

executables and ignores precise exception handling and underflow (for GPUs and SSE2/SSE3)

• It should apply equally to C (C++) C99 standard compliant programs but more easily applicable

• Requires source code availability (no libraries)• Will reduce performance• Produces identical (0 ULP difference) results on the

three principal Operating Systems and levels, using any of four different Fortran compilers at “any” optimisation level 12

Methodology

1) For Fortran 77 add parentheses to ensure unique order of expression evaluation

2) Disable Extended Precision as required by 3) and 4) anyway. Not available for SSE2/SSE3

3) Use crlibm from ENS Lyon for the elementary functions (plus arcsine, arccosine and arctan2)

4) Use David M. Gay routines for formatted input (and output) available since 1990

5. Implement exponentiation a**b as exp(b*log(a)) NINT6. Verify no difference between compile time and execute time

evaluation of constants7. Disable Fused Multiply Add (FMA)8. (Persuade library providers to provide a portable version, using this

methodology)

13

LHC@home• After applying the methodology to Sixtrack the BOINC

project LHC@home is providing the equivalent computing power of some 25,000 PCs (100,000 processors, available 50% of the time, every case run twice). Unprecedented capacity for ABP.

• No $25,000,000 to buy them, no software licences, no cooling or power to pay

• Undetected error rate is around 1 in 10,000 (down from one in a hundred thousand one in a million ).

14

What next?• Fix error rate (LHC service is a priority). • Publish for peer review (Berkeley and Lyon)• Extend to 64-bit executables C/C++• The GRID and the CLOUD• Other application(s) at CERN such as Geant or an Experiment

or Theory• Applications such as Climate Prediction NASA, Molecular

Dynamics, NAG library and Compiler• Computer Games• ARM processors, Tablets, Smart Phones and Raspberry Pi

($35.-,800MHz, 512MB)• WARNING: this application heavily uses CPU and device

components. If you observe too high device temperature, please stop computing or decrease CPU usage.

15

16

Acknowledgements

• ABP Group for hosting me for 7 years and providing a couple of PCs and a software licence. BTE Desktop Support .

• M.Giovannozzi and F. Schmidt the SixTrack author for long time help and support and to R. De Maria and L. Deniau.

• EPFL and Prof. Rifkin and I. Zacharov for BOINC support

• ENS Lyon and F. Denichin for crlibm

• H. Renshall (IT/CERN retired) for the parentheses

• G.Erskine (CERN deceased) for outstanding help on numerical analysis and the Error Function of a Complex Number

• A. Wagner (CERN) for CPSS and B. Segal (CERN retired) for suggesting BOINC

• J. Boothroyd, formerly English Electric, for introducing me to Floating-Point Error Analysis (50 years ago)

• Prof. Kahan Berkeley for his work on IEEE 754 and his many publications

• Prof . Anderson Berkeley for BOINC and

• (last but not least)

• The BOINC volunteers, 60,000 worldwide for LHC@home17

English Electric Deuce Computer 1960

Mercury Delay Lines

22

23

Computing at CERN

• Dominated by the needs of the experiments

• Accelerator design, a small fraction of the various mainframes (1964 – 1998) and the “PARC” IBM workstation cluster

• In 1997 the LHC Machine Advisory Committee recommended more tracking

• The “Numerical Accelerator Project”, NAP

Why Standards?

• Should be evident: safety, quality, convenience for example, and they apply in all walks of life including computing

• Program portability ensures fair competition for hardware and software acquisition

• No monopolies (c.f. a giant corporation)

• Should NOT be sacrificed to performance

24

25

Some simple test results• ULP – One Unit in the Last Place of the mantissa

of a floating-point number (one part in roughly 10**16)– libm/crlibm IA32: 304 differences of 1ULP– ibm IA32/IA64: 5 differences of 1ULP – libm IA32/AMD64: 7 differences of 1ULP– libm IA64/AMD64: 2 differences of 1ULP – libm/libm NO EP: 134623 differences of 1ULP

• NO differences with exp_rn

• 1,000,000 exp calls with random arguments (0,1)

26

…and with lf95

– lahey/crlibm IA32: 134645 differences of 1ULP

– lahey IA32/IA64: 7 differences of 1ULP – lahey IA32/AMD64: 7 differences of 1ULP– lahey IA64/AMD64: 4 differences of 1ULP

• NO differences with exp_rn

27

When quadruple precision is not enough – The Table Maker’s Dilemma

• Rounding the approximation of f(x) is not always the same as rounding f(x)

• Worst case for exp(x), x=7.5417527749959590085206221e-10

• Binary example x=1. (52)1 *2-53 exp(x)=1. (52)0 1 (104)1 010101…

• quad (112 bit) approximations : 1. (51)0 1 (60)0 and 1. (51)0 0 (60)1 are both within 1 Quad ULP but which rounded value is nearest?

28

crlibm exp performance

Pentium 4 Xeon gcc 3.3

Average Min Max

libm 365 236 5528

crlibm 432 316 41484

libultim 210 44 3105632

mpfr 23299 14636 204736

Documents

A Methodology for Producing Identical Double Precision Floating-Point Results Frederick (Eric) McIntosh (Honorary Group Member AB/ABP CERN) [email protected]