GPU accelerated Arnoldi solver for small batched matrix › content › gtc-kr › part_2_matrix.pdf · - Fermi GPU : 16, Kepler GPU : 32 concurrent kernel run - Maxwell GPU : Unknown(Help!)

GPU accelerated Arnoldi solver for small batched matrix

15. 09. 22

Hyung-Jin Kim

Samsung Advanced Institute of Technology

Contents

- Eigen value problems

- Solution

- Arnoldi Algorithm

- Target

- CUDA optimization

Eigen Value Problem

𝐴𝑥 = 𝜆𝑥

𝐴 ≡

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮

𝑎𝑛0 ⋯ 𝑎𝑛𝑛 : 𝑥 ≡

𝑎0⋮𝑎𝑛

: n x n, complex

valued matrix ,

vector(𝑥) which satisfies

above linear equation

: scalar value(λ) which satisfies above linear equation λ

Solution of Schrödinger Equation

𝐻ψ𝐸 = 𝐸ψ𝐸

Vibration analysis

m𝑥 + 𝑘𝑥 = 0 Principal Component Analysis

Eigen faces

…

How to solve eigenvalue problems?

𝑑𝑒𝑡 (𝐴 − 𝜆𝐼) = 0

- Find a solution of following equation,

- For 2x2 case,

𝑑𝑒𝑡𝑎 − 𝜆 𝑏𝑐 𝑑 − 𝜆

= 𝑎 − 𝜆 𝑑 − 𝜆 − 𝑏𝑐 = 0

𝐴 ≡𝑎 𝑏𝑐 𝑑

, 𝐼 ≡1 00 1

𝜆 = 𝑎 + 𝑑 ± (𝑎 − 𝑑)2+4𝑏𝑐

2

- No exact solution for finding n roots of n-th order polynomial equation

- Iterative method : Power method, QR, Arnoldi, Lanczos, …

Arnoldi Algorithm

v1 = v/|v| for k = 1 to m -1 do vk+1 = Avk for j = 1 to k do hjk = vj

†*vk+1 vk+1 = vk+1-vjhjk end for /*Gram-Schmidt orthogonalization*/ hk+1,k = |vk+1| if hk+1,k = 0 then return {v1,…,vk} /*is invariant under A*/ end if vk+1 = vk+1/hk+1,k end for

- Arnoldi algorithm just gives k-orthonormal basis and Hessenberg matrix H

- If matrix A is hermitian,

iteration sequence becomes simpler form : Lanczos algorithm

𝐻 ≡

𝑎00 𝑎01 𝑎02 𝑎03𝑎10 𝑎11 𝑎12 𝑎130 𝑎21 𝑎22 𝑎230 0 𝑎32 𝑎33

Further works…(1)

- QR factorization : finds upper trianglar matrix T, unitary matrix U s.t A = UTU*

Set A0 = A and U0 = I. for k = 1, 2,..., do QkRk = Ak−1; /* QR factorization */ Ak = RkQk; Uk = Uk-1Qk; /* Update transformation matrix */ end for Set T = A∞ and U = U∞.

- In general, QR factorization is O(3) algorithm → computationally expensive!

- QR rotation : define a rotation transform G(i,j,θ), 𝐺 𝑖, 𝑗, 𝜃 =

1 ⋯ 0 ⋯ 0 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 … 𝑐 … 𝑠 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 ⋯ −𝑠 ⋯ 𝑐 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 … 0 ⋯ 0 ⋯ 1

i

j

j i

𝑐 = 𝑥𝑖

𝑥𝑖 2 + 𝑥𝑗2 𝑠 =

−𝑥𝑗

𝑥𝑖 2 + 𝑥𝑗2

𝐻 ≡

× × × ×× × × ×0 × × ×0 0 × ×

𝐺 0,1, 𝜃0

× × × ×0 × × ×0 × × ×0 0 × ×

𝐺 1,2, 𝜃1

× × × ×0 × × ×0 0 × ×0 0 × ×

𝐺 … ,… , 𝜃… × × × ×0 × × ×0 0 × ×0 0 0 ×

= 𝑅

: O(2) algorithm!

→ Computationally cheap!

Further works…(2)

- Diagonal elements of T are eigenvalues of A : Done!

(Eigenvalues are preserved under similarity transform)

- From 𝐻 − 𝜆𝑖𝐼 𝑠𝑖 = 0, we can calculate n-unknown elements of i-th eigenvector

of Hessenberg matrix

- Eigenvector of A can be derived from 𝑥𝑖 = 𝑉𝑠𝑖

- Overally, most of the computations are O(2) algorithms except 𝑥𝑖 computation

, 𝑉 ≡ 𝑣1 𝑣2 ⋯ 𝑣𝑛

Implementation strategy

- Matrix size is less than 1000x1000

- Complex, non-symmetric matrix

- parallel process for more than ~100 of matrices → simultaneous kernel run

- Each matrices are aligned sequentially in memory → batched data

- CUBLAS library(ver >7.0) is good enough in most of BLAS computations

- Arnoldi, QR algorithm structural sub-routine calls are executed from CPU side

- “cudaStream” is good solution for parallel(or simultaneous) kernel launching

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮

𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘

𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1

GPU

kernel 𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2

Input data

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮

𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1

output data

Current Solver

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1

Kernel 0

𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2

Input data

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


output data

cudaStream threaded Solver

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1

Kernel 1

𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1

Kernel 2

𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1

Kernel 3

𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2

𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮


…

Implementation Detail

cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); kernel1<<<grid, block, 0, stream1>>>(data_1); kernel2<<<grid, block, 0, stream2>>>(data_2);

※ Nsight Profiler view

8 concurrent

kernel launch

- CUBLAS also supports “cudaStream” with other name,

cudaStreamCreate() and cublasSetStream()

- For <t>gemm(), <t>trsm() operations, batch mode is natively supported

Performance(1)

- Dgemm operation

※Tested on Xeon E5-2680v2, K80 GPUs

MKL(GF) Single

kernel(GF) 10 stream kernels(GF)

10/Single ratio

128x128 65 42 192 4.5

256x256 115 262 532 2.0

- Dgemv operation

MKL(GF) Single

kernel(GF) 10 stream kernels(GF)

10/Single ratio

128x128 1.10 1.14 6.35 5.5

256x256 2.58 4.00 11.8 3.0

- Dgemv performance on different kernel size

Single kernel (GF)

10 kernels (GF)

100 kernels (GF)

1000 kernels (GF)

128x128 1.14 6.35 26.8 33.8

256x256 4.00 11.8 32.5 34.9

Performance(2)

- Full eigenvalue evaluation sequences are still under developing

- Tested Arnoldi iterations only, 10 matrix solve (preliminary!)

- Intel MKL, MAGMA library data are given

※Tested on Xeon E5-2680v2, K80 GPUs

MKL(sec) Magma(sec) Optimized solver(sec)

256x256 1.0 1.1 0.37

512x512 5.3 5.1 2.1

Potential problems?

- Maximum number of cudaStream is unknown(Help!)

- Fermi GPU : 16, Kepler GPU : 32 concurrent kernel run

- Maxwell GPU : Unknown(Help!)

→ If matrix size is too small, GPU utilization could be still low

⇒ “per-thread” option would be useful?

- Shaded 4 elements are not continually aligned on GPU memory

- For QR rotation, we don’t need to fully access on Hessenberg matrix

- Custom data structure for Hessenberg matrix might be useful!

× × × ×0 × × ×0 × × ×0 0 × ×

- Eigenvectors are calculated from 𝐻 − 𝜆𝑖𝐼 𝑠𝑖 = 0

- Eigenvector elements are calculated in sequential way

- concurrent eigenvector computation is also needed

× × × ×× × × ×0 × × ×0 0 × ×

𝑠𝑖0𝑠𝑖1𝑠𝑖2𝑠𝑖3

= 0

→ ×43 𝑠𝑖2 = - ×44 𝑠𝑖3 , …

Backward computation

Thank you! Ref) Arnoldi Algorithm : http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf

Kernel Streaming : CUDA_C_Programming_Guide, CUDA SDK(concurrentKernels)

CUBLAS : CUBLAS_Library, CUDA SDK(batchCUBLAS)

http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf

http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf

Documents

GPU accelerated Arnoldi solver for small batched matrix › content › gtc-kr › part_2_matrix.pdf · - Fermi GPU : 16, Kepler GPU : 32 concurrent kernel run - Maxwell GPU : Unknown(Help!)