Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bﬂoat16 used by

Research Matters

February 25, 2009

Nick HighamDirector of Research

School of Mathematics

1 / 6

Accelerating Scientific Discovery withLow Precision Floating-Point

Arithmetic

Nick HighamSchool of Mathematics

The University of Manchester

http://www.maths.manchester.ac.uk/~higham

Slides available at http://bit.ly/dcman19

Data-Centric Materials Science and Engineering.Microstructure Fingerprinting and Digital Twinning

for Industry 4.0University of Manchester, May 14-15, 2019

http://www.manchester.ac.uk

http://www.maths.manchester.ac.uk/our-research/research-groups/numerical-analysis-and-scientific-computing/numerical-analysis/

http://www.maths.manchester.ac.uk/~higham/

http://www.maths.manchester.ac.uk/

http://www.man.ac.uk

http://www.maths.manchester.ac.uk/~higham

http://bit.ly/dcman19

TOP500 #1: Summit, Oak Ridge

9,216 CPUs27,648 NVIDIA V100 GPUsPeak 200 petaflops

Nick Higham Exploiting Low Precision Arithmetic 2 / 18

Floating-Point Arithmetics Up to 2009

Type Bits Range u = 2−t

fp32 single 32 10±38 2−24 ≈ 6.0× 10−8

fp64 double 64 10±308 2−53 ≈ 1.1× 10−16


Today’s Floating-Point Arithmetics

Type Bits Range u = 2−t

bfloat16 half 16 10±38 2−8 ≈ 3.9× 10−3

fp16 half 16 10±5 2−11 ≈ 4.9× 10−4

fp32 single 32 10±38 2−24 ≈ 6.0× 10−8

fp64 double 64 10±308 2−53 ≈ 1.1× 10−16

fp128 quadruple 128 10±4932 2−113 ≈ 9.6× 10−35

fp* forms all IEEE standard.bfloat16 used by Google TPU.bfloat16 will be in Intel Xeon Cooper Lake (2020).


Why Use Lower Precision in Sci Comp?

Faster flops.Less communication.Lower energy consumption.

But it is less accurate, so need todetermine where we can safely use low precision(at what level of granularity?),prove algs using low precision give sufficient accuracyfor all data.


NVIDIA Volta and Turing Architectures

TFLOPSdouble single half/ tensor

P100 2016 4.7 9.3 18.7V100 2017 7 14 112

“Turing GPUs include a new version of theTensor Core design that has been enhancedfor inferencing.”


NVIDIA Tensor Cores

4× 4 matrix multiplication in 1 clock cycle:

X = A B + C× × × ×× × × ×× × × ×× × × ×

︸︷︷︸fp16 or fp32

=

× × × ×× × × ×× × × ×× × × ×

︸︷︷︸

fp16

× × × ×× × × ×× × × ×× × × ×

︸︷︷︸

fp16

+

× × × ×× × × ×× × × ×× × × ×

︸︷︷︸fp16 or fp32

This is a block fused multiply-add (FMA): onerounding error per element.Algs now become intrinsically multiprecision—andmore complicated to analyze.


Harmonic Series

What is the harmonic sum sum 1 +12+

13+ · · · .

Arithmetic Computed sum No. of termsfp8 3.5000 16

bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014


Harmonic Series

What is the harmonic sum sum 1 +12+

13+ · · · .

Arithmetic Computed sum No. of termsfp8 3.5000 16

bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014


Example: Vector 2-Norm in fp16Evaluate ‖x‖2 =

√x2

1 + x22 in fp16 for

x =

[αα

]Recall uh = 4.88× 10−4, rmin = 6.10× 10−5.

α Relative error Comment10−4 1 Underflow to 0

3.3× 10−4 4.7× 10−2 Subnormal range5.5× 10−4 7.1× 10−3 Subnormal range1.1× 10−2 1.4× 10−4 Perfect rel. err

Overflow also likely: xmax = 65504.


Accelerating the Solution of Ax = b

A ∈ Rn×n nonsingular.

Standard method for solving Ax = b: factorize A = LU,solve

LUx = b ⇐⇒{

Ly = bUx = y

all at working precision.

Can we solve Ax = b faster and/or more accuratelyby exploiting multiprecision arithmetic?


Performance on NVIDIA V100Haidar, Tomov, Dongarra & H (2018), half–double–double

Matrix size2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k

Tfl

op

/s

0

2

4

6

8

10

12

14

16

18

20

22

24

263

263 2

63

263

27

3 2

73

2

73

2

73

2

7

3

2

7

3

2

7

3

2

7

3

FP16-TC->64 dhgesvFP16->64 dhgesvFP32->64 dsgesvFP64 dgesv

51

(A)

100

101

102

103

104

105

106


Iterative Refinement in Three Precisions

A,b given in double precision.

Solve Ax = b by LU factorization in half precision.r = b − Ax quad precisionSolve Ad = r half precisiony = x + d double precision

Generalizes classic IR programmed by Wilkinson(1948).Works for both dense and sparse A.


GMRES-IR with Diagonal Scaling

1: Compute A(h) = flh(µ(RAS)) by 2-sided diag scaling.2: b(h) = flh(Rb)

3: Compute A(h) ≈ LU (prec uh).4: Solve A(h)y0 = b(h) (prec uh).5: x0 = µSy0

6: for i = 0 : ∞ do7: ri = b − Axi (prec ur )8: Solve MAdi = Mri by GMRES where

M = µSU−1L−1R (apply M at prec ur ).9: xi+1 = xi + di

10: end for

Supported by rigorous rounding error analysis.


Conclusions

Growing interest in novel and low precisionfloating-point formats.

Motivation for hardware is machine learning butvaluable for scientific computing.

Low precision can incur overflow/underflow.

Must show sufficient accuracy obtained.

On NVIDIA V100 can solve Ax = b four times fasterand with 80% less energy.

Slides available at http://bit.ly/dcman19Slides from Royal Society Discussion Meeting athttp://bit.ly/nahpc19


http://bit.ly/dcman19

http://bit.ly/nahpc19

References I

E. Carson and N. J. Higham.A new analysis of iterative refinement and its applicationto accurate solution of ill-conditioned sparse linearsystems.SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017.

E. Carson and N. J. Higham.Accelerating the solution of linear systems by iterativerefinement in three precisions.SIAM J. Sci. Comput., 40(2):A817–A847, 2018.


References II

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham.Harnessing GPU tensor cores for fast FP16 arithmeticto speed up mixed-precision iterative refinementsolvers.In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, andAnalysis, SC ’18 (Dallas, TX), pages 47:1–47:11. IEEEPress, Piscataway, NJ, USA, 2018.

N. J. Higham and S. Pranesh.Simulating low precision floating-point arithmetic.MIMS EPrint 2019.4, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Mar. 2019.17 pp.


References III

N. J. Higham, S. Pranesh, and M. Zounon.Squeezing a matrix into half precision, with anapplication to solving linear systems.MIMS EPrint 2018.37, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Nov. 2018.15 pp.Revised March 2019.

D. Malone.To what does the harmonic series converge?Irish Math. Soc. Bulletin, 71:59–66, 2013.


Documents

Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bﬂoat16 used by