Upload
ann-fox
View
215
Download
0
Embed Size (px)
DESCRIPTION
LHS Filling in INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 3 δ 1,1,1 ε 1,1,1 ζ 1,1,1 η 1,1,1 γ 2,1,1 δ 2,1,1 ε 2,1,1 γ 3,1,1 δ 3,1,1 β 1,2,1 α 1,1,2 α i,j,k β i,j,k γ i,j,k δ i,j,k ε i,j,k ζ i,j,k η i,j,k η I,J,K-1 ζ I,J-1,K ε I-1,J,K α I,J,K β I,J,K γ I,J,K δ I,J,K
Citation preview
Fine-grained Jacobian Filling in INCOMP3D 1
Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D
July 20, 2015
Lixiang (Eric) Luo, Jack Edwards, Hong LuoDepartment of Mechanical and Aerospace Engineering
Frank MuellerDepartment of Computer Science
North Carolina State University
Fine-grained Jacobian Filling in INCOMP3D 2
Recent Publications
• L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine-grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp 149-161, 2015.
• L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, 2015.
Fine-grained Jacobian Filling in INCOMP3D 3
LHS Filling in INCOMP3D
2nd-order FVM with a 7-point stencil (left) results in a block-sparse linear system . LHS matrix A appears as the matrix on the right.
δ1,1,1
ε1,1,1
ζ1,1,1
η1,1,1
γ2,1,1
δ2,1,1
ε2,1,1
γ3,1,1
δ3,1,1
β1,2,1
α1,1,2
αi,j,k
β i,j,k
γ i,j,k
δ i,j,k
ε i,j,k
ζ i,j,k
η i,j,k
ηI,J,K-1
ζI,J-1,K
ε I-1,J,K
α I,J,K
β I,J,K
γ I,J,K
δ I,J,K
Fine-grained Jacobian Filling in INCOMP3D 4
Challenges LHS Filling in INCOMP3D
• Two primary components in LHS filling– AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization
• LHS filling is heavily memory-bound– Before optimization: one GPU thread per grid location– Large amount of data per grid location: for RANS, each block is 6×6, 3
blocks per grid location per spatial direction
Test case: URANS, 15M grid, 204 blocks, 2000 steps
24 CPU cores
24x M2050
0 500 1000 1500 2000 2500 3000
TSD - 3.8X RHS - 6.7XBILU FACT - 17.1X BILU SOLV - 9.3XBILU CORR - 21.2X MPI and OtherSeries8
2484 s
595 s
Fine-grained Jacobian Filling in INCOMP3D 5
Challenges of LHS Filling
FGBILU• Output = input
Memory-bound due to overall data amount
• HomogenousFine-grained algorithms does not cause branching
LHS• Output >> input
Memory bound due to output data amount
• Inhomogeneous partCoefficient calculations causes branching
• Homogenous partMatrix filling is highly homogenous
Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.
Fine-grained Jacobian Filling in INCOMP3D 6
Optimization Strategy for FGBILU
Coarse-grained
ComputationInput Data
Output Data
ComputationInput Data
Output Data
Fully Fine-grained
Fine-grained Jacobian Filling in INCOMP3D 7
Two Steps of LHS Filling
Step 1: calculations of common coefficients– Inhomogeneous: different coefficients are determined by
different mathematical expressions– To ensure reasonable data locality, this step must be carried
out in coarse grain: one thread per grid location
Step 2: filling of submatrix blocks– Highly homogeneous– All elements are calculated based on the common
coefficients and geometry data– This step can be carried out in fine grain
Fine-grained Jacobian Filling in INCOMP3D 8
A Fully Fine-grained Scheme Not Suitable
Coarse-grained
Fully fine-grained
Computation Output DataInput Data
ComputationInput Data Output Data
Too much branching in Step 1
Memory bound
Fine-grained Jacobian Filling in INCOMP3D 9
Coarse-grainedStep 1
2-step Mixed-grained Approach
Ideally, changing granularity within one kernel– Dynamic Parallelism attempts to address this, but probably not efficient
for LHS filling: too few child threads per grid location.
Fine-grained Step 2
Output Data
A two-step approach
Input Data Output DataComputation
Input Data CommonCoefficients
Parallel data reading: no bottleneck
Fine-grained Jacobian Filling in INCOMP3D 10
Further Optimization Techniques
• Avoid unrolled private arrays– Instead, use existing global arrays to store intermediate results
• Merge spatial directions– Increases concurrent work by three times– Improves data locality by reusing shared data within a grid location– Odd-even coloring scheme is no longer necessary
• Replace long branches with short branches– May be compiled into predicate operations on GPU, which does not
incur branching penalty
• Replace mathematical branches with logical coefficients– Avoids branching– May also be compiled into predicate operations
Fine-grained Jacobian Filling in INCOMP3D 11
Preliminary Results
• The new strategy significantly improves performance of LHS filling subroutines– AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is
less memory-bound originally)– Blocks sizes are small in this test case, so speedup numbers are far from
optimal
• Data transfer (not data packing) is now the bottleneck
Test case: URANS, 3M grid, 128 blocks, 200 steps
6 CPU cores
6x M2050
0 200 400 600 800 1000 1200 1400 1600
AFILL - 14.5X TSD - 6.3XRHS - 6.1X BILU FACT - 16.8XBILU SOLV - 12.2X BILU CORR - 9.2XData Packing - 7.1X MPI and Other
1387 s
191.7s
Fine-grained Jacobian Filling in INCOMP3D 12
Upcoming Tasks
• High-order extension of RHS– By adopting multiple-step computation and intermediate storage, data
contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable.
• Improve performance of L2 norm calculation– Classic sum reduction, currently consumes 45% run time of RHS
• Improve MPI data transfer performance– Masking: concurrent MPI transfer and computation
• Use CPU memory to store LHS matrices– Can potentially allow much more blocks per GPU– Masking: concurrent GPU-CPU transfers and kernel executions
• Run large-scale simulations with INCOMP3D