Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,

Fine-grained Jacobian Filling in INCOMP3D 1

Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D

July 20, 2015

Lixiang (Eric) Luo, Jack Edwards, Hong LuoDepartment of Mechanical and Aerospace Engineering

Frank MuellerDepartment of Computer Science

North Carolina State University


Recent Publications

• L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine-grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp 149-161, 2015.

• L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, 2015.


LHS Filling in INCOMP3D

2nd-order FVM with a 7-point stencil (left) results in a block-sparse linear system . LHS matrix A appears as the matrix on the right.

δ1,1,1

ε1,1,1

ζ1,1,1

η1,1,1

γ2,1,1

δ2,1,1

ε2,1,1

γ3,1,1

δ3,1,1

β1,2,1

α1,1,2

αi,j,k

β i,j,k

γ i,j,k

δ i,j,k

ε i,j,k

ζ i,j,k

η i,j,k

ηI,J,K-1

ζI,J-1,K

ε I-1,J,K

α I,J,K

β I,J,K

γ I,J,K

δ I,J,K


Challenges LHS Filling in INCOMP3D

• Two primary components in LHS filling– AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization

• LHS filling is heavily memory-bound– Before optimization: one GPU thread per grid location– Large amount of data per grid location: for RANS, each block is 6×6, 3

blocks per grid location per spatial direction

Test case: URANS, 15M grid, 204 blocks, 2000 steps

24 CPU cores

24x M2050

0 500 1000 1500 2000 2500 3000

TSD - 3.8X RHS - 6.7XBILU FACT - 17.1X BILU SOLV - 9.3XBILU CORR - 21.2X MPI and OtherSeries8

2484 s

595 s


Challenges of LHS Filling

FGBILU• Output = input

Memory-bound due to overall data amount

• HomogenousFine-grained algorithms does not cause branching

LHS• Output >> input

Memory bound due to output data amount

• Inhomogeneous partCoefficient calculations causes branching

• Homogenous partMatrix filling is highly homogenous

Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.


Optimization Strategy for FGBILU

Coarse-grained

ComputationInput Data

Output Data

ComputationInput Data

Output Data

Fully Fine-grained


Two Steps of LHS Filling

Step 1: calculations of common coefficients– Inhomogeneous: different coefficients are determined by

different mathematical expressions– To ensure reasonable data locality, this step must be carried

out in coarse grain: one thread per grid location

Step 2: filling of submatrix blocks– Highly homogeneous– All elements are calculated based on the common

coefficients and geometry data– This step can be carried out in fine grain


A Fully Fine-grained Scheme Not Suitable

Coarse-grained

Fully fine-grained

Computation Output DataInput Data

ComputationInput Data Output Data

Too much branching in Step 1

Memory bound


Coarse-grainedStep 1

2-step Mixed-grained Approach

Ideally, changing granularity within one kernel– Dynamic Parallelism attempts to address this, but probably not efficient

for LHS filling: too few child threads per grid location.

Fine-grained Step 2

Output Data

A two-step approach

Input Data Output DataComputation

Input Data CommonCoefficients

Parallel data reading: no bottleneck


Further Optimization Techniques

• Avoid unrolled private arrays– Instead, use existing global arrays to store intermediate results

• Merge spatial directions– Increases concurrent work by three times– Improves data locality by reusing shared data within a grid location– Odd-even coloring scheme is no longer necessary

• Replace long branches with short branches– May be compiled into predicate operations on GPU, which does not

incur branching penalty

• Replace mathematical branches with logical coefficients– Avoids branching– May also be compiled into predicate operations


Preliminary Results

• The new strategy significantly improves performance of LHS filling subroutines– AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is

less memory-bound originally)– Blocks sizes are small in this test case, so speedup numbers are far from

optimal

• Data transfer (not data packing) is now the bottleneck

Test case: URANS, 3M grid, 128 blocks, 200 steps

6 CPU cores

6x M2050

0 200 400 600 800 1000 1200 1400 1600

AFILL - 14.5X TSD - 6.3XRHS - 6.1X BILU FACT - 16.8XBILU SOLV - 12.2X BILU CORR - 9.2XData Packing - 7.1X MPI and Other

1387 s

191.7s


Upcoming Tasks

• High-order extension of RHS– By adopting multiple-step computation and intermediate storage, data

contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable.

• Improve performance of L2 norm calculation– Classic sum reduction, currently consumes 45% run time of RHS

• Improve MPI data transfer performance– Masking: concurrent MPI transfer and computation

• Use CPU memory to store LHS matrices– Can potentially allow much more blocks per GPU– Masking: concurrent GPU-CPU transfers and kernel executions

• Run large-scale simulations with INCOMP3D

Documents

Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,