Upload
vanhanh
View
219
Download
3
Embed Size (px)
Citation preview
Abdulrahman Manea
PhD Student
Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science
Energy Resources Engineering Department
School of Earth Sciences Stanford University
1
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
2
Reservoir Simulation (Black Oil):
Mass Conservation of Component α:
Incompressible:
Total Balance:
Incompressible Pressure Equation:
Solver is the most computationally expensive component
Unknowns have varying nature Pressure (elliptic) vs. Saturation (Hyperbolic)
Multistage preconditioning scheme Constraints Pressure Residual (CPR)* CPR with Multigrid as the first stage: very robust and widely used scheme
* Wallis, J.R., et al. SPE 13536 (1985) 3
Aramco’s GigaPOWERS
Objective
Design and Implement a massively Parallel Reservoir
Simulation Multigrid on GPU Architectures
Plan
1. Implement an optimized serial version of Multigrid to
have a reasonable serial performance baseline
2. Design and implement a parallel version of Multigrid
that harnesses the power the massively parallel GPU
architectures
4
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
5
Descretized equation is
𝐴𝑓𝑥𝑓 = 𝑏𝑓
Basic 2-Level Multigrid Algorithm (3 steps)
1. The Pre-smoothing Step
𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ 𝐴𝑓, 𝑏𝑓, 𝑥0, 𝜐1
2. The Coarse-Grid Correction Step
𝑟𝑓 = 𝑏𝑓 − 𝐴𝑓𝑥𝑓
𝑟𝑐 = 𝐼𝑓𝑐𝑟𝑓
𝑒𝑐 = 𝐴𝑐 −1𝑟𝑐
𝑒𝑓 = 𝐼𝑐𝑓
𝑒𝑐
𝑥𝑓 = 𝑥𝑓 + 𝑒𝑓 3. The Post-smoothing Step:
𝑥𝑓 ← 𝑠𝑚𝑜𝑜𝑡ℎ(𝐴𝑓, 𝑏𝑓, 𝑥𝑓, 𝜐2) * Brandt, A. (1977)
presmoothing postsmoothing
Solve the Problem on the
Coarse Grid
6
I,J I+1,J I-1,J
I-1,J-1
I-1,J+1 I+1,J+1
I,J-1 I+1,J-1
I,J+1
i+1,j i-1,j
i,j-1
i,j+1 i+1,j+1 i-1,j+1
i-1,j-1 i+1,j-1
i-1,j+1 i,j+1 i+1,j+1
i+1,j i+1,j
i,j-1 i+1,j-1 i-1,j-1
i,j
𝑇𝑖,𝑗𝑛𝑤 𝑇𝑖,𝑗
𝑛 𝑇𝑖,𝑗𝑛𝑒
𝑇𝑖,𝑗𝑤 𝑇𝑖,𝑗
𝑒
𝑇𝑖,𝑗𝑠𝑤 𝑇𝑖,𝑗
𝑠
𝑇𝑖,𝑗𝑠𝑒
The prolongation and restriction operators’ weights depends on the PDE
discontinuous coefficients
𝛻 𝝀𝛻𝑝 = 𝑞
7
* Alcouffe, R.E., et al. (1981)
Coarse grid operator: Manual Explicit handling of PDE on each coarser level
Automatic (Black Box Multigrid) ▪ Using grid transfer operators:
𝐴𝑐 = 𝐼𝑓𝑐𝐴𝑓𝐼𝑐
𝑓= (𝐼𝑐
𝑓)𝑇𝐴𝑓𝐼𝑐
𝑓
▪ No info. about coarser grid is needed
▪ Used in Algebraic multigrid
▪ Preserve operator symmetry
In Black Box Multigrid, two stages: Setup Stage:
▪ The interpolation, restriction and coarse grid operators are calculated.
Solution Stage: ▪ Carrying out the cycling process
Anisotropic PDE Coefficients Line Relaxation (2D) , plane relaxation (3D)
Semicoarsening *Dendy, J.E, (1982), (1986), Schaffer, S., (1998)
8
To handle anisotropies in all three dimensions (x,y,z): Alternating plane relaxation (too expensive)
Semicoarsening with plane relaxation (cheaper) ▪ One plane-solve, and semicoarsening in the dimension orthogonal to that plane.
When semicoarsening approach is used, with exact grid transfer
operators, MG becomes a direct solver (i.e. a Schur Complement). However, grid transfer operators are not sparse
, where
A more efficient way is to “approximate” the exact grid transfer operators using a sparse (block diagonal) operator. 2D MG is used to define the components of the operator between every two
planes (details can be found in *)
*Schaffer, S., (1998)
9
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
10
Need a Multigrid solver capable of handling highly heterogeneous and anisotropic structured 2D reservoirs, thus:
2D Black Box Multigrid, with
Alternating line-relaxation
Testing Solver’s convergence behavior:
Test the convergence ratio for the same problem with varying sizes (using grid refinement)
Compare the performance with well-established and widely-used Multigrid
solvers, e.g.
▪ SAMG: Algebraic Multigrid Solver form Fraunhofer Institute for Algorithms and Scientific Computing (SCAI)
▪ MGD9V, …etc
Test Models
▪ Geostatically Generated using the Stanford Geostatistical Modeling Software (SGeMS)
▪ Derived from SPE10 Comparative Solution Project Model.
▪ large permeability variations of 8 -12 orders of magnitude
11
12
SPE10, Layer 70
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝐹𝑎𝑐𝑡𝑜𝑟 =𝑟𝑘+1 2
𝑟𝑘 2
13
¼ Million Cell 1 Million Cell
Computational Time Comparison (SPE10 Layer 85 Refined to 1 Million Cells): • GMG: ~ 4.5 sec • SAMG: ~ 7.0 sec
Parallelization of every component of the algorithm
Both setup stage and solution stage
Does not sacrifice algorithmic scalability (convergence rate)
Smoother
Alternating zebra-line relaxation
Effectively handles anisotropies
Coarsest Solve
4-color GS relaxation (to handle 9-point stencils)
14
Shared-Memory Parallelization OpenMP
Coarse threads Hence coarse-scale parallelization
Multiple cells (multiple lines) per thread
Sparse Matrix Format
CSR for cache coherence
Tridiagonal Solver: Thomas Algorithm Serial within each line (i.e. thread)
but several lines are handled in parallel (zebra-coloring)
Architecture: 12 Intel ® Xeon ® X5650 2.66GHz cores with 48 GB Memory
15
Fine Threads
Fine-scale parallelization
Single cell per thread
Sparse Matrix Format
Diagonal with column major ordering
Ideal for structured problems
▪ Coalesces memory accesses
▪ Minimizing storage requirements
▪ Exploits the banded matrix structure for efficient data access
Minimize expensive communication with host
Fit the whole problem on the GPU (up to 16M double precision)
16
Tridiagonal Solver Parallel cyclic reduction (PCR) in Batch* to exploit:
▪ fine scale parallelism within the line
▪ coarse scale parallelism exposed by the zebra ordering of lines
Threads operates in two stages: ▪ Preparation Stage Solution Stage
For coalescing memory accesses during the Preparation Stage (NOTE: grid points are numbered along x-direction):
▪ In X-line Relaxation: Each x-line is assigned to a block of threads
▪ In Y-line Relaxation: Points with the same x-coordinate are assigned to a block of threads
*Using NVIDIA CUSPARSE Library: (https://developer.nvidia.com/cusparse) 17
y
5 21 22 23 24 25
4 16 17 18 19 20
3 11 12 13 14 15
2 6 7 8 9 10
1 1 2 3 4 5
1 2 3 4 5 x
coalesced
no
n-c
oa
lesc
ed
Criteria Multicore GPU
Architecture
Specs
12 Intel ® Xeon ® X5650 2.66GHz
cores with 48 GB Memory
Nvidia Fermi-Based C2070 with 448
CUDA Cores and 6 GB Memory
Matrix Structure CSR Format for cache coherence Diagonal Format with column major
format
(for coalescing memory accesses)
Parallelization
API
OpenMP CUDA
Parallelization
Granularity
Multiple cells per thread
(coarse)
One cell per thread
(fine)
Tridiagonal
Solver Algorithm
(for line
relaxation)
Thomas Algorithm
(serial within each line, but multiple
lines are handled in parallel by
zebra coloring)
Parallel Cyclic Reduction in Batch
(Parallel within each line and multiple
lines are handled in parallel as well)
18
Homogeneous Permeability Case: Solved with just one V(0,1) cycle
▪ Residual reduction by 109
Focuses on the scalability of the setup stage
Heterogeneous Permeability Case: Derived from SPE10 85th Layer by grid refinement
Solved with six V(0,1) cycles ▪ Residual reduction by 109
Focuses on the scalability of the solution stage
Problem Sizes: 1 Million, 4 Million and 16 Million cells
19
20
21
22
23
Introduction
Background
2D Black Box Geometric MG (GMG)
3D Semicoarsening Multigrid
Future Work
24
In reservoir simulation, z-direction
Huge variations due to natural deposition
Severe anisotropy compared to x/y directions
▪ An effect of discretization (pancake models).
Semicoarsening in z-direction, and plane relaxation in
the x-y plane
We can use 2D MG for both:
Setup Stage: construction of grid transfer operators
Solution Stage: x-y plane relaxation
25
Parallelize plane solve kernel in both:
Setup Stage: construction of grid transfer operators ▪ Five V(0,1) cycles/plane for approximating an “exact solve”
Solution Stage: red/black plane relaxation ▪ One V(0,1) cycle/plane for doing plan-relaxation
Note that those 2D V(0,1) cycles are already parallelized
(using the 2D GMG algorithm explained earlier)
Other kernels are amenable to parallelization on the GPU, but are not tackled yet (under progress).
26
2D MG
for
Plane
Solve z
Implementation: ▪ CPU: Use OpenMP threads to distribute the plane solves across multiple
cores
▪ GPU: Use CUDA with OpenMP to distribute the plane solves to multiple
GPU’s
Platform: ▪ CPU: 24 Intel(R) Xeon(R) CPU X5660 @ 2.80GHz with HT and 180
GB Memory
▪ GPU: 6 Nvidia Fermi-Based M2090’s
Test cases:
homogeneous and heterogeneous (SPE 10) with various sizes
Results:
average time for the plane solves for both setup and solution stages
27
28
0
2
4
6
8
10
12
14
16
18
20
22
24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's
Spee
d U
p 16K x 129 ~ 2M cells
66K x 33 ~ 2M cells
1M x 17 ~ 18M cells
4M x 17 ~ 71M cells
29
0
2
4
6
8
10
12
14
16
18
20
24 cores 1 GPU 2 GPU's 3 GPU's 4 GPU's 5 GPU's 6 GPU's
Spee
d U
p 16K x 129 ~ 2M cells
66K x 33 ~ 2M cells
1M x 17 ~ 18M cells
4M x 17 ~ 71M cells
Planes need to be sufficiently large ( > 1M cells) for a
noticeable advantage
This is good for reservoir simulation, as grid refinement
studies are usually made by refining the horizontal planes.
Beyond 2-3 GPU’s, no performance is gained
Could be due to number of planes, or plane size..
Needs more investigation and profiling
30
Accelerate other kernels of 3D
Semicoarsening Multigrid using GPU’s (such
as coarse operator construction, …etc)
Algebraic Multiscale Solver on GPU’s is
Next!
31
Thank you for your listening
Questions
32