Towards Half-Precision Computation for Complex Matrices: A ... · 2019 ScalA Workshop, Denver, CO Monday, November 18th, 2019. Outline 1 Introduction 2 Matrix Multiplication with

Towards Half-Precision Computation for ComplexMatrices: A Case Study for Mixed-Precision Solvers

on GPUs

Ahmad Abdelfattah, Stan Tomov, and Jack Dongarra

Innovative Computing LaboratoryUniversity of Tennessee, Knoxville

2019 ScalA Workshop, Denver, CO

Monday, November 18th , 2019

Outline

1 Introduction

2 Matrix Multiplication with Half-complex Precision

3 Mixed-precision Factorization and Solve

4 Final Performance

5 Conclusion

A. Abdelfattah 2/22

Introduction Half-complex GEMM Factorization & Solve Performance Conclusion

Outline

1 Introduction



4 Final Performance

5 Conclusion

A. Abdelfattah 3/22


What are trying to solve?

Solve a linear system of equation (Ax = b)A is a general square matrix in single-complex precisionUse half-precision to accelerate the solution (mixed-precisionsolver)

Accuracy is recovered using Iterative Refinement (IR)Similar algorithm to

Carson and Higam: Accelerating the Solution of Linear Systems by IterativeRefinement in Three Precisions, SIAM SISC, 2018Haidar et al.: Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to SpeedUp Mixed-precision Iterative Refinement Solvers, SC’18

No native support for half-complex computation

This work is part of the MAGMA library

A. Abdelfattah 4/22


What are the steps?

1 Mixed-precision LU factorization with partial pivoting2 Iterative refinement using preconditioned GMRES

Is A inFP64?

Convert A: FP64 -> FP32

Mixed precision LU

1. Panel (FP32)2. Rank-k updates (FP16->FP32)

A

B

CFP32 -= AFP16 ✕ BFP16

Tensor Core GEMM with FP32 accumulate

1. GEMM Layout (interleaved vs planar)2. Panel factorization (CPU vs. GPU)

A. Abdelfattah 5/22


What are the steps?

1 Mixed-precision LU factorization with partial pivoting2 Iterative refinement using preconditioned GMRES

(1) Compute residual r = b – Ax(2) Solve for c Ac = r (3) Update solution xi+1 = xi + c(4) Repeat 1-3 until converged

Using L/U factors of AClassic IR

Using preconditioned GMRESIR + GMRES

A. Abdelfattah 6/22


Outline

1 Introduction



4 Final Performance

5 Conclusion

A. Abdelfattah 7/22


Mixed-precision Matrix Multiplication

CFP32 = CFP32 − AFP16 × BFP16

A and B are originally in FP32, but are converted tohalf-complexNo native support for half-complex kernels

cuBLAS provides only real arithmetic FP16 kernelscuBLASLt and CUTLASS suggest using split-complex kernelsusing planar layout

But interleaved layout is a better alternativeUsed by almost all linear algebra algorithmsBetter operational intensity (i.e. flops/bytes ratio)

A. Abdelfattah 8/22


Mixed-precision Matrix Multiplication

Goal: CGEMM kernel with FP16 accelerationUsing cuBLAS (planar-layout)

Four calls to cublasGemmEx: arith. intensity = mnk4mn+k(m+n)

Overhead of splitting and merging real/imaginary componentsUsing a new MAGMA kernel (interleaved-layout)

One kernel call: arith. intensity = 2mnk4mn+k(m+n)

No overheads

0

20

40

60

80

100

120

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Tfop/s

Matrix Size (M = N, K is fxed at 256)

Roofine for CGEMM-FP16 (compute-only) for rank-256 updates (BW = 847 GB/s)

cgemm-fp16-interleaved

cgemm-fp16-planar 0

20

40

60

80

100

120

140

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Tfop/s

Matrix Size (M = N, K is fxed at 512)

Roofine for CGEMM-FP16 (compute-only) for rank-512 updates (BW = 847 GB/s)

cgemm-fp16-interleaved

cgemm-fp16-planar

A. Abdelfattah 9/22


MAGMA’s New Kernel (CGEMM-FP16-Interleaved)

Input: A and B in half-complex precision (half2)Input/Output: C in single-complex precisionAn abstraction layer over the Tensor Cores (TCs)

Using WMMA device routines to manage TCsSplit and merge in shared memory

Double-sided recursive blocking + auto-tuning

Block of CBLK_M

BLK_N

Read /Write Compute

DIM_X

DIM_Y

TC_N

TC_M

A. Abdelfattah 10/22


Performance of CGEMM-FP16

Tested on Tesla V100-PCIe GPU, CUDA 10.12 advantages for MAGMA: arithmetic intensity andsplitting/merging overhead70% better than cuBLAS (k = 256), 24% if k = 512

0

5

10

15

20

25

30

35

40

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s

Matrix Size (M = N) x 1000 - (K = 256)

cgemm-fp16 (magma)

cgemm-fp16 (cublas) 0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s

Matrix Size (M = N) x 1000 - (K = 512)

cgemm-fp16 (magma)

cgemm-fp16 (cublas)



Performance of CGEMM-FP16 “cont.”

Tested on Tesla V100-PCIe GPU, CUDA 10.1MAGMA loses the advantage for k ≥ 800Summary of results

Blocking size ≤ 800, use MAGMAOtherwise, use cuBLAS

0

10

20

30

40

50

60

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s

Matrix Size (M = N) x 1000 - (K = 1024)

cgemm-fp16 (magma)

cgemm-fp16 (cublas)

Let’s test both!



Outline

1 Introduction



4 Final Performance

5 Conclusion



Performance of the mixed-precision LU factorization

Panel factorization: CPU (hybrid) vs. GPU (native)?Larger nb means more time spent in single-complex precision (i.e.the panel)Use hybrid-LU only if the matrix is larger than 22k (with MAGMACGEMM-GP16)

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Tfop/s

Matrix Size (x 1000)

cgetrf-native (magma-cgemm-fp16)cgetrf-native (cublas-cgemm-fp16)cgetrf-native

MAGMA: nb 256 ~ 512

cuBLAS: nb ⩾ 1024

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Tfop/s


cgetrf-hybrid (magma-cgemm-fp16)cgetrf-hybrid (cublas-cgemm-fp16)cgetrf-hybrid

MAGMA: nb 256 ~ 512

cuBLAS: nb ⩾ 1024

Performance of the mixed-precision LU factorization using native (left) and hybrid (right) executions. Results are shown on a

20-core Haswell CPU and a Tesla V100 GPU. MAGMA is built using CUDA 10.1 and MKL 2018.0.1



Classic IR vs. IR + GMRES

GMRES is more stable for solving the correction equation(Ac = r )GMRES is preconditioned by the low precision factors of AMuch faster conversion than classic IRBased on FGMRES implementation by Yousef Saad (ZITSOL)

0 10 20 30 40 50 60 70 80 90 100#Iterations

10-8

10-7

10-6

10-5

10-4

rela

tive

erro

r

classic IRIRGMRES

Convergence history of both IR and IRGMRES on a matrix of size 20k. k∞(A)= 105. Clustered distribution of singular

values (σi = 1, 1, · · · , 1k∞(A) ).



Outline

1 Introduction



4 Final Performance

5 Conclusion



Final Performance: Mixed-precision CGESV-IRGMRES

Performance is up to 2.5×, but depends on the matrixproperties

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s


MP cgesv-irgmres

classic cgesv

1.2x

1.3x

1.3x

1.3x

1.6x1.6x

1.6x1.7x

1.8x1.8x

2.0x

2.1x2.2x

2.3x2.4x

2.5x

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s


MP cgesv-irgmres

classic cgesv

1.2x

1.1x

1.3x

1.5x

1.5x1.5x

1.6x

1.7x1.8x

1.9x2.0x

2.1x

2.3x 2.3x2.4x

System: 20-core Intel Haswell CPU, Tesla V100 GPU - using MKL 2018.0.1 and CUDA 10.1

Left: Diagonally dominant matrices. k∞(A)≤ 102.

Right: Matrices with positive eigenvalues and arithmetic distribution of singular values (σi = 1 − ( i−1n−1 )(1 − 1

k∞(A) )),

k∞(A)≈4.3e+5



Final Performance: Mixed-precision CGESV-IRGMRES

Performance is up to 2.5×, but depends on the matrixproperties

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s


MP cgesv-irgmres

classic cgesv

0.9x

0.9x

1.0x

1.2x

1.3x

1.4x1.5x

1.6x1.6x

1.8x1.9x

2.0x2.1x

2.2x2.3x

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s


MP cgesv-irgmres

classic cgesv

1.2x

1.1x

1.1x

1.3x

1.4x

1.4x1.4x

1.5x1.6x

1.8x1.8x

2.0x2.1x

2.2x2.2x


Left: Positive eigenvalues and logarithmic uniform distribution of singular values (log(σi ) uniform between log( 1k∞(A) ) and

log(1)), k∞(A)≈1e+5

Right: Clustered singular values (σi = 1, 1, · · · , 1k∞(A) ), k∞(A)≈4.3e+4



Sometimes it does not pay off

The refinement steps can consume all the performanceadvantage of the factorization

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Tfop/s


MP cgesv-irgmres

classic cgesv

1.0x

0.8x

0.8x

0.9x

0.9x0.8x

0.9x

0.9x1.0x

1.0x

1.0x 1.0x 1.0x1.1x 1.1x


Arithmetic distribution of singular values (σi = 1 − ( i−1n−1 )(1 − 1

k∞(A) )), k∞(A)≈4.3e+4.



Outline

1 Introduction



4 Final Performance

5 Conclusion



Summary

1 FP16 arithmetic is not only for machine learning2 New family of mixed-precision linear solvers3 Half-complex precision accelerates single-complex systems by

factors up to 2.5×4 Next: Double-complex systems, matrix scaling, other

half-complex kernels

MAGMA uses a hybridization methodology where algorithms of interest are split into tasks of varying granularity and their execution scheduled over the available hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the GPU.

PERFORMANCE & ENERGY EFFICIENCY

INDUSTRY COLLABORATION

MAGMA LU factorization in double precision arithmetic

K40

P100

CPUMKL

Gflops/WattMATRIX SIZE

Gflop

/s

NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz

K40 NVIDIA Pascal GPU56 MP x 64 @ 1.19 GHz

P100 NVIDIA Volta GPU80 MP x 64 @ 1.37 GHz

V100Intel Xeon E5-2650 v3 (Haswell)2 x 10 cores @ 2.30 GHz

CPU

FEATURES AND SUPPORT

Linear system solvers

Eigenvalue problem solvers

Auxiliary BLAS

Batched LA

Sparse LA

CPU/GPU Interface

Multiple precision support

Non-GPU-resident factorizations

Multicore and multi-GPU support

MAGMA Analytics/DNN

LAPACK testing

Linux

Windows

Mac OS

HYBRID ALGORITHMS2.3 FOR CUDA

MAGMA MIC 1.4 FOR Intel Xeon PhiclMAGMA 1.4 FOR OpenCL

V100

MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra libraries for heterogeneous architectures. MAGMA is designed and implemented by the team that developed LAPACK and ScaLAPACK, incorporating the latest developments in hybrid synchronization- and communication-avoiding algorithms, as well as dynamic runtime systems. Interfaces for the current LAPACK and BLAS standards are supported to allow computational scientists to seamlessly port any linear algebra reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs to deliver the fastest possible time to

accurate solution within given energy constraints. FIND OUT MORE AT http://icl.utk.edu/magma

NEW

IN COLLABORATION WITH WITH SUPPORT FROM

SPONSORED BYU.S. Departmentof Defense National Science Foundation

The objective of the Innovative Computing Laboratory’s IPCC is the development and optimization of numerical

linear algebra libraries and technologies for applications, while tackling current challenges in

heterogeneous Intel® Xeon Phi™ coprocessor-based High Performance Computing.

Long-term collaboration and support on the development of clMAGMA,

the OpenCL™ port of MAGMA.

Long-term collaboration and support on the development of MAGMA.

Intel ParallelComputing Center

0

1000

2000

3000

4000

5000

6000

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K 22K 24K 26K 28K 30K 32K 34K 36K

1414

2121

4.44.42.62.6

https://icl.utk.edu/magma/

Release expected by Spring 2020


https://icl.utk.edu/magma/


Thank You!


Documents

Towards Half-Precision Computation for Complex Matrices: A ... · 2019 ScalA Workshop, Denver, CO Monday, November 18th, 2019. Outline 1 Introduction 2 Matrix Multiplication with