Download pdf - Implementation Of Deflated Preconditioned Conjugate ...ta.twi.tudelft.nl/nw/users/vuik/numanal/gupta_presentation2.pdf · Implementation of Deflated Preconditioned Conjugate Gradient

Implementation of Deflated Preconditioned Conjugate Gradient

on the GPU

Rohit Gupta

Responsible Professor: Prof. Dr. Ir. Henk Sips

Supervisors: Prof. Dr. Ir. Kees Vuik and Ir. C.W.J. Lemmens

Two-Phase Fluid Flow

Model for Two Phase Computation

Boundary

Conditions

4 -1

-1 4 -1

-1 4 -1

-1 4 0

0 4 -1

-1 4 -1

-1 4 -1

-1 4

3 0

0 1.5015 -0.5005

-0.5005 2.002 -0.5005

-0.5005 2.002 -0.5005

-0.5005 2.002 -0.5005

-0.5005 2.002 -0.5005

-0.5005 2.002 -0.5005

-0.5005 2.002

GPU Architecture

Memory Bandwidth

Launch Configuration

•Coalescing

•Minimum Memory Transfers

•Minimum Divergence

•Caching

Key Optimizations

•SpMV Kernel

•Preconditioning Kernels

•Deflation Kernels

Related Work

Two-Phase Fluid Flow

•SpMV Kernel

•Preconditioning Kernels


Implementation GPU

•Meschach/ gotoBLAS

•Compiler Options

•Same Data Structures

Implementation CPU

•Conjugate Gradient (CG)

•CG with Preconditioning

•CG with Preconditioning and

Deflation

Flavors

•Diagonal

•Incomplete Cholesky

•Incomplete Poisson

Preconditioning

•Important Operations

•Subdomains


Deflation

•Step-by-Step Optimizations

•Profiler for Advice

•Modularity and Readability

Methodology

•DIA Format is best

•SpmV dominate GPU execution

•

Conjugate Gradient - Vanilla

5

2

23 10||||

||||10

k

kexact

X

XX

•Preconditioning dominates execution

•More Blocks – Better SpeedUp

•Shared Memory Use restricted

Preconditioning

Block Incomplete Cholesky

•Speed Up recovers

•Preconditioning gets company in

•More Deflation Vectors – More Speed Up

Deflation

bEAZ 1

•Storage of optimized

•Speed Up Suffers – CPU performs better

•Optimizations on CPU

Deflated Preconditioned

Conjugate Gradient

AZ

•Incomplete Poisson Preconditioning

•Speed Up favoring move

•


Conjugate Gradient

2

2

||||

||||

k

kexact

X

XX

2

2

||||

||||

k

kexact

X

XX

•Matrix Vector Multiplication optimized

•Kernels utilize 85% of bandwidth

•Clubbing Kernels – Reducing Memory Traffic


Conjugate Gradient

bE 1

•Density Contrast 1000:1

•

•False Stop

Two Phase Matrix

1

2

2 10||||

||||0

k

kexact

X

XX

Performance Timeline

0

5

10

15

20

25

30Sp

ee

d U

p

Iterative OptimizationsSpeedUp Factors

•Very close to possible peak performance

•Bandwidth bound Kernels

•Platform Utilization – Startling Facts

Analysis

•Multi-GPU Multi-CPU

•3D Domains, More Interfaces, Mixed Precision

•Different Preconditioners, Grid Types.

Future Work

Conclusions

•Deflation suits the many core platform

•Two Phase accuracy suffers

•Deflation with IP Preconditioning

Conclusions