Daino: A High-level Framework for Parallel and Efficient AMR on … · 2017. 4. 30. · Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib1, Naoya

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs

Mohamed Wahib1, Naoya Maruyama1,2, Takayuki Aoki2

1 RIKEN Advanced Institute for Computational Science, Kobe, Japan

2 Tokyo Institute of Technology, GSIC, Tokyo, Japan

11th May 2017

GTC17

Motivation & Problem:

“AMR is one of the paths to multi-scale exascale applications“

Producing efficient AMR code is hard (especially for GPU)

Solution:

A framework for producing efficient AMR code (for GPUs)

Architecture-independent interface provided to the user

A speedup model for quantifying the efficiency of AMR code

Key results: We evaluate three AMR applications

Speedups & scalability comparable to hand-written code

(~3,642 K20x GPUs)

Summary

2

For meshes in some simulations using PDEs:

We only require high resolution for areas of interest

Resolution changes dynamically during simulation

Achieving efficient AMR is challenging

Managing an adaptive mesh can be complicated

Balancing compute load and communication costs

3

Adaptive Mesh Refinement (AMR)

Octree-based meshes: (a) Adaptive mesh (b) Tree representation

Structured Tree-based AMR

4

Many ways to represent the mesh

We focus on octree representation (quadtree in 2D)

Mesh divided into blocks, refine/coarsen if required

(a) (b)

PE1 PE2 PE3

Operations applied on tree are distributed

How AMR Works

5

Initialize the Mesh

FOR Simulation time DO

Execute stencil operations for all blocks

Exchange ghost layers with neighbor nodes

IF time to remesh

Calculate remeshing critirion for all blocks

Refine or consolidate blocks

Balance the mesh

ENDIF

IF time to load balance

Apply load balancing algorithm

ENDIF

ENDFOR

Computation

Remeshing

Load balancing

Reduced Computation (less data in mesh)

Overhead

AMR on GPUs

6

Hard to achieve efficient AMR with GPUs

Few existing AMR frameworks support GPU:

User must provide code optimized for GPU

Scalability problems due to CPU-GPU data movement

No speedup-bound model

Contributions of our framework

1

2

3

Framework for Efficient AMR

7

A compiler and runtime

Input:

Serial code applying stencil on a uniform grid

User adds directives to identify relevant data arrays

Architecture-neutral

Output:

Executable binary for target architecture

Code is parallel and optimized for GPU (MPI+CUDA)

#pragma daino kernel void 3D_alloy(..) { #pragma daino data (Nx,Ny,Nz) {p, u, dpt, no, o;} … kernel code ... }

AMR frameworks

Architecture-neutral Interface (1 of 2)

8

Uniform Mesh Serial C Code

__global__ 3D_alloy(..) { … CUDA kernel code ... }

void 3D_alloy(..) { #pragma omp for … kernel code ... }

CUDA code OpenMP Code

Framework

GPU AMR

Executable

CPU AMR

Executable

Our framework

Framework

GPU AMR

Executable CPU AMR

Executable

1

Two benefits:

- Productivity

- Ability to apply

low-level GPU

optimizations

#pragma dno kernel

void func(float ***a, float ***b, ..) {

#pragma dno data domName(i, j, k)

a, b;

#pragma dno timeloop

for(int t; t< TIME_MAX;t++) {

for(int i; i<NX; i++)

for(int j; i<NY; j++) {

... // comput. not related to a and b

for(int k; k<NZ; k++) {

a[i][j][k] = c * (b[i-1][j][k]

+ b[i+1][j][k] + b[i][j][k]

+ b[i][j+1][k] + b[i][j-1][k]);

}

}

}

}

Minimal example of using directives in our framework

Architecture-neutral Interface (2 of 2)

9

1

A target kernel

Data arrays + iterators

Target loop

Scalable AMR: Data-centric Model (1 of 2)

10

A data-centric approach

Each computing element specializes on its data

Blocks on GPU, octree data structure on CPU

Migrate all operations touching block data to GPU

CPU only processes octree data structure

2

All kernels are data parallel (i.e. well-suited to GPU)

11

Scalable AMR: data-centric Model (2 of 2)

Finalize Copy Final Arrays

Octants

(Data Arrays)

Octants

(Data Arrays)

Octants

(Data Arrays)

GPU2 Memory

GPU1 Memory

GPU0 Memory

Octree

(AMR Metadata)

CPU Memory

Initialize

Stencil Kernel

Exchange Ghost Layers

Update & Balance Octree

Lo

op

Copy Initial Arrays

Copy Ghost Layers

Consolidate Invoke

Refine Invoke

Evaluate Error Copy δ

Post-Stencil (Correction) Invoke

Invoke

Compute Stencil Invoke

< δ

> δ

1.

2.

3.

4.

5.

6.

7.

Error Estim. Kernel

Refine Kernel

Consolidate Kernel

Correction Kernel

MO

VE

BL

OC

KS

Invoke

2

Conceptual Overview of Data-centric GPU AMR

CPU GPU

[1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co-located with SC’15

AMR promises reduced computation

Problem overhead in managing hierarchal mesh

Project speedup bound

Informs framework designer of efficiency of AMR code

Compare achieved speedup vs. projected upper-bound speedup

Takes into account AMR overhead

If projected speedup far from achieved speedup

Some AMR overheads(s) not properly accounted for

Speedup Model

12

3

Framework Implementation (1 of 2)

13

Fixed Mesh Code

(Annotated) Compiler

Front End Passes

LLVM

Optimized LLVM-IR

Object Files

LLVM-IR

Daino Runtime

Adapted Mesh Executable

Linker

AMR library

Comm. library

Call

Front

End

Pa

ss

Pa

ss

Pa

ss

Back

EndC/C++

Machine

CodeIR IR IR IR

Clang LLVM proper

Figure 1: Overview of framework implementation

Apply translations and optimizations as passes

The Daino framework overview. Application C code is transformed to an optimized executable. Daino components enclosed in red dotted line

Framework Implementation (2 of 2)

14

Application

C Code

Stencil Code

Object Files

Emit

Compile

Stencil GPU

Kernel

(NVVM IR)

AST

NVPTX

PTX

Emit

Generate Application

LLVM IR

Stencil IR

Application

LLVM IR

AMR Driver

IR

IR Pass

Refine Kernel

(NVVM IR)Coarsen Kernel

(NVVM IR)Error Kernel

(CUDA)

Daino Runtime

AMR library

Comm. library

Translator

CUDA Driver API Call

API

Call

Executable

Link

Runtime Libraries

15

AMR Management

Maintain the octree

Orchestration of work

Memory manager

Especially important with GPU

Communication

MPI processes

Halo data exchange

Transparent access to blocks

Moving blocks (load balancing)

Evaluation

16

Application Description

Hydrodynamics

Solver

A 2nd order directionally split hyperbolic

schemes to solve Euler equations.

[RTVD scheme modified from GAMER1]

Shallow-water

Solver

We model shallow water simulations by

depth-averaging Navier–Stokes equations.

[2nd order Runge-Kutta method]

Phase-field

Simulation

3D dendritic growth during binary alloy

solidification2

[Time integartion by Allen-Chan equation]

[1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl., 26(4):367–377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the

TSUBAME 2.0 Supercomputer, SC’11

Results (1 of 4)

17

Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)

We use TSUBAME2.5 supercomputer (TokyoTech)

Up to 3,642 K20x GPUs

TSUBAME Grand Challenge Category A (full machine)

1.0E+00

5.0E+02

1.0E+03

1.5E+03

2.0E+03

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)

Number GPUs (Mesh size per GPU: 4,0963)

HYDRODYNAMICSUniform Mesh Auto AMR (Daino)

Hand-written AMR Auto AMR (GAMER)

9.4

x

8.5

x

0.0E+00

5.0E+02

1.0E+03

1.5E+03

2.0E+03

2.5E+03

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)

Number GPUs (Mesh size per 16 GPUs: 4,096x512x512)

PHASE-FIELDUniform Mesh Auto AMR Hand-written AMR

1.7

8x

1.6

6x

1.0E+00

5.1E+01

1.0E+02

1.5E+02

2.0E+02

2.5E+02

3.0E+02

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)


SHALLOW-WATERSUniform Mesh Auto AMR Hand-written AMR

3.8

x

2.9

x

Results (2 of 4)

18

Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic)

Notes:

Phase-field achieves 1.7x speedup

Original implementation is Gordon Bell 2011 winner

Daino is faster than GAMER AMR version

GAMER is a leading framework for AMR over GPUs

1.0E+00

5.0E+03

1.0E+04

1.5E+04

2.0E+04

2.5E+04

3.0E+04

3.5E+04

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)

Number GPUs (Mesh size 4,0963)

PHASE-FIELD

Uniform Mesh

Auto AMR

Hand-written AMR1.7 x

1.3 x

1.0E+00

5.1E+01

1.0E+02

1.5E+02

2.0E+02

2.5E+02

3.0E+02

3.5E+02

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)


HYDRODYNAMICSUniform Mesh

Auto AMR (Daino)

Hand-written AMR

Auto AMR (GAMER)

9.6 x

2.1E+03

7.4 x

1.0E+00

5.1E+01

1.0E+02

1.5E+02

2.0E+02

2.5E+02

3.0E+02

16 64 256 576 1024 1600 2288 2880 3600

Ru

nti

me

(S

ec

on

ds

)


SHALLOW-WATERS

Uniform Mesh

Auto AMR

Hand-written AMR

4.1 x

3.2 x

Results (3 of 4)

19

Overhead of the AMR framework (weak scaling):

AMR overhead

from 12% in 16

GPUs to 16% in

3600 GPUs

Remeshing

kernels are well-

suited to GPU

Results (4 of 4)

Speedup: measured vs. projected. M is measured, P is the practical AMR speedup projection, and T is the theoretical AMR speedup projection.

20

Efficiency of transformation:

Achieved speedup > 86% of practical limit

0

2

4

6

8

10

Number GPUs

HYDRODYNAMICSM L A

0

0.5

1

1.5

2

Sp

ee

du

p

Number GPUs

PHASE-FIELDM L A

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Number GPUs

SHALLOW-WATERSM L A

Problem:

AMR is one of the paths to multi-scale exascale applications

Producing efficient AMR code is hard (especially for GPU)

Solution:

A framework for producing efficient AMR code (for GPUs)

Architecture-independent interface provided to the user

A speedup model for quantifying the efficiency of AMR code

Key results: We evaluate three AMR applications

Speedups & scalability comparable to hand-written code

(3,642 K20x GPUs)

Summary

21

Future Work Expand Daino

Incorporate Daino’s GPU backend in other AMR framework

Work-in-progress for porting new applications (CFD)

Supporting user-specified boundary conditions,

equations of state, and flux corrections

Extend support for Intel Xeon Phi (KNL)

We already introduced experimental support for

OpenMP (not fully optimized)

Leverage the speedup model analysis

Auto-tuning

22

Daino will be publically released at:

http://github.com/wahibium/Daino

Thank you for listening.

Questions?

23

Documents

Daino: A High-level Framework for Parallel and Efficient AMR on … · 2017. 4. 30. · Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib1, Naoya