Implementation of Deflated Preconditioned Conjugate Gradient
on the GPU
Rohit Gupta
Responsible Professor: Prof. Dr. Ir. Henk Sips
Supervisors: Prof. Dr. Ir. Kees Vuik and Ir. C.W.J. Lemmens
Two-Phase Fluid Flow
Model for Two Phase Computation
Boundary
Conditions
4 -1
-1 4 -1
-1 4 -1
-1 4 0
0 4 -1
-1 4 -1
-1 4 -1
-1 4
3 0
0 1.5015 -0.5005
-0.5005 2.002 -0.5005
-0.5005 2.002 -0.5005
-0.5005 2.002 -0.5005
-0.5005 2.002 -0.5005
-0.5005 2.002 -0.5005
-0.5005 2.002
GPU Architecture
Memory Bandwidth
Launch Configuration
•Coalescing
•Minimum Memory Transfers
•Minimum Divergence
•Caching
Key Optimizations
•SpMV Kernel
•Preconditioning Kernels
•Deflation Kernels
Related Work
Two-Phase Fluid Flow
•SpMV Kernel
•Preconditioning Kernels
•Deflation Kernels
Implementation GPU
•Meschach/ gotoBLAS
•Compiler Options
•Same Data Structures
Implementation CPU
•Conjugate Gradient (CG)
•CG with Preconditioning
•CG with Preconditioning and
Deflation
Flavors
•Diagonal
•Incomplete Cholesky
•Incomplete Poisson
Preconditioning
•Important Operations
•Subdomains
•Deflation Kernels
Deflation
•Step-by-Step Optimizations
•Profiler for Advice
•Modularity and Readability
Methodology
•DIA Format is best
•SpmV dominate GPU execution
•
Conjugate Gradient - Vanilla
5
2
23 10||||
||||10
k
kexact
X
XX
•Preconditioning dominates execution
•More Blocks – Better SpeedUp
•Shared Memory Use restricted
Preconditioning
Block Incomplete Cholesky
•Speed Up recovers
•Preconditioning gets company in
•More Deflation Vectors – More Speed Up
Deflation
bEAZ 1
•Storage of optimized
•Speed Up Suffers – CPU performs better
•Optimizations on CPU
Deflated Preconditioned
Conjugate Gradient
AZ
•Incomplete Poisson Preconditioning
•Speed Up favoring move
•
Deflated Preconditioned
Conjugate Gradient
2
2
||||
||||
k
kexact
X
XX
2
2
||||
||||
k
kexact
X
XX
•Matrix Vector Multiplication optimized
•Kernels utilize 85% of bandwidth
•Clubbing Kernels – Reducing Memory Traffic
Deflated Preconditioned
Conjugate Gradient
bE 1
•Density Contrast 1000:1
•
•False Stop
Two Phase Matrix
1
2
2 10||||
||||0
k
kexact
X
XX
Performance Timeline
0
5
10
15
20
25
30Sp
ee
d U
p
Iterative OptimizationsSpeedUp Factors
•Very close to possible peak performance
•Bandwidth bound Kernels
•Platform Utilization – Startling Facts
Analysis
•Multi-GPU Multi-CPU
•3D Domains, More Interfaces, Mixed Precision
•Different Preconditioners, Grid Types.
Future Work
Conclusions
•Deflation suits the many core platform
•Two Phase accuracy suffers
•Deflation with IP Preconditioning
Conclusions