22
Accelerating Scientific Discovery with Low Precision Floating-Point Arithmetic Nick Higham School of Mathematics The University of Manchester http://www.maths.manchester.ac.uk/~higham Slides available at http://bit.ly/dcman19 Data-Centric Materials Science and Engineering. Microstructure Fingerprinting and Digital Twinning for Industry 4.0 University of Manchester, May 14-15, 2019

Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Research Matters

February 25, 2009

Nick HighamDirector of Research

School of Mathematics

1 / 6

Accelerating Scientific Discovery withLow Precision Floating-Point

Arithmetic

Nick HighamSchool of Mathematics

The University of Manchester

http://www.maths.manchester.ac.uk/~higham

Slides available at http://bit.ly/dcman19

Data-Centric Materials Science and Engineering.Microstructure Fingerprinting and Digital Twinning

for Industry 4.0University of Manchester, May 14-15, 2019

Page 2: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

TOP500 #1: Summit, Oak Ridge

9,216 CPUs27,648 NVIDIA V100 GPUsPeak 200 petaflops

Nick Higham Exploiting Low Precision Arithmetic 2 / 18

Page 3: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Floating-Point Arithmetics Up to 2009

Type Bits Range u = 2−t

fp32 single 32 10±38 2−24 ≈ 6.0× 10−8

fp64 double 64 10±308 2−53 ≈ 1.1× 10−16

Nick Higham Exploiting Low Precision Arithmetic 3 / 18

Page 4: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Today’s Floating-Point Arithmetics

Type Bits Range u = 2−t

bfloat16 half 16 10±38 2−8 ≈ 3.9× 10−3

fp16 half 16 10±5 2−11 ≈ 4.9× 10−4

fp32 single 32 10±38 2−24 ≈ 6.0× 10−8

fp64 double 64 10±308 2−53 ≈ 1.1× 10−16

fp128 quadruple 128 10±4932 2−113 ≈ 9.6× 10−35

fp* forms all IEEE standard.bfloat16 used by Google TPU.bfloat16 will be in Intel Xeon Cooper Lake (2020).

Nick Higham Exploiting Low Precision Arithmetic 4 / 18

Page 5: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Why Use Lower Precision in Sci Comp?

Faster flops.Less communication.Lower energy consumption.

But it is less accurate, so need todetermine where we can safely use low precision(at what level of granularity?),prove algs using low precision give sufficient accuracyfor all data.

Nick Higham Exploiting Low Precision Arithmetic 5 / 18

Page 6: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by
Page 7: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

NVIDIA Volta and Turing Architectures

TFLOPSdouble single half/ tensor

P100 2016 4.7 9.3 18.7V100 2017 7 14 112

“Turing GPUs include a new version of theTensor Core design that has been enhancedfor inferencing.”

Nick Higham Exploiting Low Precision Arithmetic 7 / 18

Page 8: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

NVIDIA Tensor Cores

4× 4 matrix multiplication in 1 clock cycle:

X = A B + C× × × ×× × × ×× × × ×× × × ×

︸ ︷︷ ︸fp16 or fp32

=

× × × ×× × × ×× × × ×× × × ×

︸ ︷︷ ︸

fp16

× × × ×× × × ×× × × ×× × × ×

︸ ︷︷ ︸

fp16

+

× × × ×× × × ×× × × ×× × × ×

︸ ︷︷ ︸fp16 or fp32

This is a block fused multiply-add (FMA): onerounding error per element.Algs now become intrinsically multiprecision—andmore complicated to analyze.

Nick Higham Exploiting Low Precision Arithmetic 8 / 18

Page 9: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Harmonic Series

What is the harmonic sum sum 1 +12+

13+ · · · .

Arithmetic Computed sum No. of termsfp8 3.5000 16

bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014

Nick Higham Exploiting Low Precision Arithmetic 9 / 18

Page 10: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Harmonic Series

What is the harmonic sum sum 1 +12+

13+ · · · .

Arithmetic Computed sum No. of termsfp8 3.5000 16

bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014

Nick Higham Exploiting Low Precision Arithmetic 9 / 18

Page 11: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Example: Vector 2-Norm in fp16Evaluate ‖x‖2 =

√x2

1 + x22 in fp16 for

x =

[αα

]Recall uh = 4.88× 10−4, rmin = 6.10× 10−5.

α Relative error Comment10−4 1 Underflow to 0

3.3× 10−4 4.7× 10−2 Subnormal range5.5× 10−4 7.1× 10−3 Subnormal range1.1× 10−2 1.4× 10−4 Perfect rel. err

Overflow also likely: xmax = 65504.

Nick Higham Exploiting Low Precision Arithmetic 10 / 18

Page 12: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Accelerating the Solution of Ax = b

A ∈ Rn×n nonsingular.

Standard method for solving Ax = b: factorize A = LU,solve

LUx = b ⇐⇒{

Ly = bUx = y

all at working precision.

Can we solve Ax = b faster and/or more accuratelyby exploiting multiprecision arithmetic?

Nick Higham Exploiting Low Precision Arithmetic 11 / 18

Page 13: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Performance on NVIDIA V100Haidar, Tomov, Dongarra & H (2018), half–double–double

Matrix size2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k

Tfl

op

/s

0

2

4

6

8

10

12

14

16

18

20

22

24

263

263 2

63

263

27

3 2

73

2

73

2

73

2

7

3

2

7

3

2

7

3

2

7

3

FP16-TC->64 dhgesvFP16->64 dhgesvFP32->64 dsgesvFP64 dgesv

51

(A)

100

101

102

103

104

105

106

Nick Higham Exploiting Low Precision Arithmetic 12 / 18

Page 14: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Iterative Refinement in Three Precisions

A,b given in double precision.

Solve Ax = b by LU factorization in half precision.r = b − Ax quad precisionSolve Ad = r half precisiony = x + d double precision

Generalizes classic IR programmed by Wilkinson(1948).Works for both dense and sparse A.

Nick Higham Exploiting Low Precision Arithmetic 13 / 18

Page 15: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

GMRES-IR with Diagonal Scaling

1: Compute A(h) = flh(µ(RAS)) by 2-sided diag scaling.2: b(h) = flh(Rb)

3: Compute A(h) ≈ LU (prec uh).4: Solve A(h)y0 = b(h) (prec uh).5: x0 = µSy0

6: for i = 0 : ∞ do7: ri = b − Axi (prec ur )8: Solve MAdi = Mri by GMRES where

M = µSU−1L−1R (apply M at prec ur ).9: xi+1 = xi + di

10: end for

Supported by rigorous rounding error analysis.

Nick Higham Exploiting Low Precision Arithmetic 14 / 18

Page 16: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by
Page 17: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by
Page 18: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by
Page 19: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

Conclusions

Growing interest in novel and low precisionfloating-point formats.

Motivation for hardware is machine learning butvaluable for scientific computing.

Low precision can incur overflow/underflow.

Must show sufficient accuracy obtained.

On NVIDIA V100 can solve Ax = b four times fasterand with 80% less energy.

Slides available at http://bit.ly/dcman19Slides from Royal Society Discussion Meeting athttp://bit.ly/nahpc19

Nick Higham Exploiting Low Precision Arithmetic 18 / 18

Page 20: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

References I

E. Carson and N. J. Higham.A new analysis of iterative refinement and its applicationto accurate solution of ill-conditioned sparse linearsystems.SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017.

E. Carson and N. J. Higham.Accelerating the solution of linear systems by iterativerefinement in three precisions.SIAM J. Sci. Comput., 40(2):A817–A847, 2018.

Nick Higham Exploiting Low Precision Arithmetic 1 / 3

Page 21: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

References II

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham.Harnessing GPU tensor cores for fast FP16 arithmeticto speed up mixed-precision iterative refinementsolvers.In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, andAnalysis, SC ’18 (Dallas, TX), pages 47:1–47:11. IEEEPress, Piscataway, NJ, USA, 2018.

N. J. Higham and S. Pranesh.Simulating low precision floating-point arithmetic.MIMS EPrint 2019.4, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Mar. 2019.17 pp.

Nick Higham Exploiting Low Precision Arithmetic 2 / 3

Page 22: Accelerating Scientific Discovery with Low Precision ...higham/talks/royce_turing19.pdffp128 quadruple 128 10 4932 2 113 ˇ9:6 10 35 fp* forms all IEEE standard. bfloat16 used by

References III

N. J. Higham, S. Pranesh, and M. Zounon.Squeezing a matrix into half precision, with anapplication to solving linear systems.MIMS EPrint 2018.37, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Nov. 2018.15 pp.Revised March 2019.

D. Malone.To what does the harmonic series converge?Irish Math. Soc. Bulletin, 71:59–66, 2013.

Nick Higham Exploiting Low Precision Arithmetic 3 / 3