Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Research Matters
February 25, 2009
Nick HighamDirector of Research
School of Mathematics
1 / 6
Accelerating Scientific Discovery withLow Precision Floating-Point
Arithmetic
Nick HighamSchool of Mathematics
The University of Manchester
http://www.maths.manchester.ac.uk/~higham
Slides available at http://bit.ly/dcman19
Data-Centric Materials Science and Engineering.Microstructure Fingerprinting and Digital Twinning
for Industry 4.0University of Manchester, May 14-15, 2019
TOP500 #1: Summit, Oak Ridge
9,216 CPUs27,648 NVIDIA V100 GPUsPeak 200 petaflops
Nick Higham Exploiting Low Precision Arithmetic 2 / 18
Floating-Point Arithmetics Up to 2009
Type Bits Range u = 2−t
fp32 single 32 10±38 2−24 ≈ 6.0× 10−8
fp64 double 64 10±308 2−53 ≈ 1.1× 10−16
Nick Higham Exploiting Low Precision Arithmetic 3 / 18
Today’s Floating-Point Arithmetics
Type Bits Range u = 2−t
bfloat16 half 16 10±38 2−8 ≈ 3.9× 10−3
fp16 half 16 10±5 2−11 ≈ 4.9× 10−4
fp32 single 32 10±38 2−24 ≈ 6.0× 10−8
fp64 double 64 10±308 2−53 ≈ 1.1× 10−16
fp128 quadruple 128 10±4932 2−113 ≈ 9.6× 10−35
fp* forms all IEEE standard.bfloat16 used by Google TPU.bfloat16 will be in Intel Xeon Cooper Lake (2020).
Nick Higham Exploiting Low Precision Arithmetic 4 / 18
Why Use Lower Precision in Sci Comp?
Faster flops.Less communication.Lower energy consumption.
But it is less accurate, so need todetermine where we can safely use low precision(at what level of granularity?),prove algs using low precision give sufficient accuracyfor all data.
Nick Higham Exploiting Low Precision Arithmetic 5 / 18
NVIDIA Volta and Turing Architectures
TFLOPSdouble single half/ tensor
P100 2016 4.7 9.3 18.7V100 2017 7 14 112
“Turing GPUs include a new version of theTensor Core design that has been enhancedfor inferencing.”
Nick Higham Exploiting Low Precision Arithmetic 7 / 18
NVIDIA Tensor Cores
4× 4 matrix multiplication in 1 clock cycle:
X = A B + C× × × ×× × × ×× × × ×× × × ×
︸ ︷︷ ︸fp16 or fp32
=
× × × ×× × × ×× × × ×× × × ×
︸ ︷︷ ︸
fp16
× × × ×× × × ×× × × ×× × × ×
︸ ︷︷ ︸
fp16
+
× × × ×× × × ×× × × ×× × × ×
︸ ︷︷ ︸fp16 or fp32
This is a block fused multiply-add (FMA): onerounding error per element.Algs now become intrinsically multiprecision—andmore complicated to analyze.
Nick Higham Exploiting Low Precision Arithmetic 8 / 18
Harmonic Series
What is the harmonic sum sum 1 +12+
13+ · · · .
Arithmetic Computed sum No. of termsfp8 3.5000 16
bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014
Nick Higham Exploiting Low Precision Arithmetic 9 / 18
Harmonic Series
What is the harmonic sum sum 1 +12+
13+ · · · .
Arithmetic Computed sum No. of termsfp8 3.5000 16
bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014
Nick Higham Exploiting Low Precision Arithmetic 9 / 18
Example: Vector 2-Norm in fp16Evaluate ‖x‖2 =
√x2
1 + x22 in fp16 for
x =
[αα
]Recall uh = 4.88× 10−4, rmin = 6.10× 10−5.
α Relative error Comment10−4 1 Underflow to 0
3.3× 10−4 4.7× 10−2 Subnormal range5.5× 10−4 7.1× 10−3 Subnormal range1.1× 10−2 1.4× 10−4 Perfect rel. err
Overflow also likely: xmax = 65504.
Nick Higham Exploiting Low Precision Arithmetic 10 / 18
Accelerating the Solution of Ax = b
A ∈ Rn×n nonsingular.
Standard method for solving Ax = b: factorize A = LU,solve
LUx = b ⇐⇒{
Ly = bUx = y
all at working precision.
Can we solve Ax = b faster and/or more accuratelyby exploiting multiprecision arithmetic?
Nick Higham Exploiting Low Precision Arithmetic 11 / 18
Performance on NVIDIA V100Haidar, Tomov, Dongarra & H (2018), half–double–double
Matrix size2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k
Tfl
op
/s
0
2
4
6
8
10
12
14
16
18
20
22
24
263
263 2
63
263
27
3 2
73
2
73
2
73
2
7
3
2
7
3
2
7
3
2
7
3
FP16-TC->64 dhgesvFP16->64 dhgesvFP32->64 dsgesvFP64 dgesv
51
(A)
100
101
102
103
104
105
106
Nick Higham Exploiting Low Precision Arithmetic 12 / 18
Iterative Refinement in Three Precisions
A,b given in double precision.
Solve Ax = b by LU factorization in half precision.r = b − Ax quad precisionSolve Ad = r half precisiony = x + d double precision
Generalizes classic IR programmed by Wilkinson(1948).Works for both dense and sparse A.
Nick Higham Exploiting Low Precision Arithmetic 13 / 18
GMRES-IR with Diagonal Scaling
1: Compute A(h) = flh(µ(RAS)) by 2-sided diag scaling.2: b(h) = flh(Rb)
3: Compute A(h) ≈ LU (prec uh).4: Solve A(h)y0 = b(h) (prec uh).5: x0 = µSy0
6: for i = 0 : ∞ do7: ri = b − Axi (prec ur )8: Solve MAdi = Mri by GMRES where
M = µSU−1L−1R (apply M at prec ur ).9: xi+1 = xi + di
10: end for
Supported by rigorous rounding error analysis.
Nick Higham Exploiting Low Precision Arithmetic 14 / 18
Conclusions
Growing interest in novel and low precisionfloating-point formats.
Motivation for hardware is machine learning butvaluable for scientific computing.
Low precision can incur overflow/underflow.
Must show sufficient accuracy obtained.
On NVIDIA V100 can solve Ax = b four times fasterand with 80% less energy.
Slides available at http://bit.ly/dcman19Slides from Royal Society Discussion Meeting athttp://bit.ly/nahpc19
Nick Higham Exploiting Low Precision Arithmetic 18 / 18
References I
E. Carson and N. J. Higham.A new analysis of iterative refinement and its applicationto accurate solution of ill-conditioned sparse linearsystems.SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017.
E. Carson and N. J. Higham.Accelerating the solution of linear systems by iterativerefinement in three precisions.SIAM J. Sci. Comput., 40(2):A817–A847, 2018.
Nick Higham Exploiting Low Precision Arithmetic 1 / 3
References II
A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham.Harnessing GPU tensor cores for fast FP16 arithmeticto speed up mixed-precision iterative refinementsolvers.In Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, andAnalysis, SC ’18 (Dallas, TX), pages 47:1–47:11. IEEEPress, Piscataway, NJ, USA, 2018.
N. J. Higham and S. Pranesh.Simulating low precision floating-point arithmetic.MIMS EPrint 2019.4, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Mar. 2019.17 pp.
Nick Higham Exploiting Low Precision Arithmetic 2 / 3
References III
N. J. Higham, S. Pranesh, and M. Zounon.Squeezing a matrix into half precision, with anapplication to solving linear systems.MIMS EPrint 2018.37, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Nov. 2018.15 pp.Revised March 2019.
D. Malone.To what does the harmonic series converge?Irish Math. Soc. Bulletin, 71:59–66, 2013.
Nick Higham Exploiting Low Precision Arithmetic 3 / 3