Michael L. Norman, UC San Diego and SDSC [email protected]

ENZO AND EXTREME SCALE AMR FOR HYDRODYNAMIC COSMOLOGY

Michael L. Norman, UC San Diego and [email protected]

WHAT IS ENZO?

A parallel AMR application for astrophysics and cosmology simulations Hybrid physics: fluid + particle + gravity + radiation Block structured AMR MPI or hybrid parallelism

Under continuous development since 1994 Greg Bryan and Mike Norman @ NCSA Shared memorydistributed memoryhierarchical memory C++/C/F, >185,000 LOC

Community code in widespread use worldwide Hundreds of users, dozens of developers Version 2.0 @ http://enzo.googlecode.com

TWO PRIMARY APPLICATION DOMAINS

ASTROPHYSICAL FLUID DYNAMICS HYDRODYNAMIC COSMOLOGY

Supersonic turbulence Large scale structure

ENZO PHYSICSPhysics Equations Math type Algorithm(s) Communicati

on

Dark matter Newtonian N-body

Numerical integration

Particle-mesh Gather-scatter

Gravity Poisson Elliptic FFTmultigrid

Global

Gas dynamics

Euler Nonlinear hyperbolic

Explicit finite volume

Nearest neighbor

Magnetic fields

Ideal MHD Nonlinear hyperbolic

Explicit finite volume

Nearest neighbor

Radiation transport

Flux-limited radiation diffusion

Nonlinear parabolic

Implicit finite differenceMultigrid solves

Global

Multispecies chemistry

Kinetic equations

Coupled stiff ODEs

Explicit BE ,implicit

None

Inertial, tracer, source , and sink particles

Newtonian N-body

Numerical integration

Particle-mesh Gather-scatter

Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code

ENZO MESHING

Berger-Collela structured AMR

Cartesian base grid and subgrids

Hierarchical timetepping

Level 0

AMR = collection of grids (patches);each grid is a C++ object

Level 1

Level 2

Unigrid = collection of Level 0 grid patches

EVOLUTION OF ENZO PARALLELISM

Shared memory (PowerC) parallel (1994-1998) SMP and DSM architecture (SGI Origin 2000, Altix) Parallel DO across grids at a given refinement level

including block decomposed base grid O(10,000) grids

Distributed memory (MPI) parallel (1998-2008) MPP and SMP cluster architectures (e.g., IBM PowerN) Level 0 grid partitioned across processors Level >0 grids within a processor executed sequentially Dynamic load balancing by messaging grids to

underloaded processors (greedy load balancing) O(100,000) grids

Projection of refinement levels

160,000 grid patches at 4 refinement levels

1 MPI task per processor

Task = a Level 0 grid patch and all associated subgrids;

processed sequentially across and within levels

EVOLUTION OF ENZO PARALLELISM

Hierarchical memory (MPI+OpenMP) parallel (2008-) SMP and multicore cluster architectures (SUN

Constellation, Cray XT4/5) Level 0 grid partitioned across shared memory

nodes/multicore processors Parallel DO across grids at a given refinement

level within a node Dynamic load balancing less critical because of

larger MPI task granularity (statistical load balancing)

O(1,000,000) grids

N MPI tasks per SMPM OpenMP threads per task

Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and

sequentially across levels

Each grid is an OpenMP thread

ENZO ON PETASCALE PLATFORMS

ENZO ON CRAY XT5 1% OF THE 64003 SIMULATION

Non-AMR 64003 80 Mpc box 15,625 (253) MPI tasks,

2563 root grid tiles 6 OpenMP threads per

task 93,750 cores 30 TB per checkpoint/re-

start/data dump >15 GB/sec read, >7

GB/sec write Benefit of threading

reduce MPI overhead & improve disk I/O


ENZO ON CRAY XT5 105 SPATIAL DYNAMIC RANGE

AMR 10243 50 Mpc box, 7 levels of refinement 4096 (163) MPI tasks, 643

root grid tiles 1 to 6 OpenMP threads

per task - 4096 to 24,576 cores

Benefit of threading Thread count increases

with memory growth reduce replication of grid

hierarchy data

Using MPI+threads to access more RAM as the AMR calculation grows in size


ENZO-RHD ON CRAY XT5 COSMIC REIONIZATION

Including radiation transport 10x more expensive LLNL Hypre multigrid

solver dominates run time near ideal scaling to at

least 32K MPI tasks Non-AMR 10243 8 and

16 Mpc boxes 4096 (163) MPI tasks, 643

root grid tiles

BLUE WATERS TARGET SIMULATIONRE-IONIZING THE UNIVERSE

Cosmic Reionization is a weak-scaling problem large volumes at a fixed resolution to span range of scales

Non-AMR 40963 with ENZO-RHD Hybrid MPI and OpenMP SMT and SIMD tuning 1283 to 2563 root grid tiles 4-8 OpenMP threads per task 4-8 TBytes per checkpoint/re-start/data dump (HDF5) In-core intermediate checkpoints (?) 64-bit arithmetic, 64-bit integers and pointers Aiming for 64-128 K cores 20-40 M hours (?)

PETASCALE AND BEYOND

ENZO’s AMR infrastructure limits scalability to O(104) cores

We are developing a new, extremely scalable AMR infrastructure called Cello http://lca.ucsd.edu/projects/cello

ENZO-P will be implemented on top of Cello to scale to

http://lca.ucsd.edu/projects/cello

CURRENT CAPABILITIES: AMR VS TREECODE

CELLO EXTREME AMR FRAMEWORK: DESIGN PRINCIPLES

Hierarchical parallelism and load balancing to improve localization

Relax global synchronization to a minimum Flexible mapping between data structures

and concurrency Object-oriented design Build on best available software for fault-

tolerant, dynamically scheduled concurrent objects (Charm++)

CELLO EXTREME AMR FRAMEWORK: APPROACH AND SOLUTIONS

1. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth;

2. patch-local adaptive time steps; 3. flexible hybrid parallelization strategies; 4. hierarchical load balancing approach based on actual

performance measurements; 5. dynamical task scheduling and communication; 6. flexible reorganization of AMR data in memory to permit

independent optimization of computation, communication, and storage;

7. variable AMR grid block sizes while keeping parallel task sizes fixed;

8. address numerical precision and range issues that arise in particularly deep AMR hierarchies;

9. detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management.

IMPROVING THE AMR MESH:PATCH COALESCING

IMPROVING THE AMR MESH:TARGETED REFINEMENT

IMPROVING THE AMR MESH:TARGETED REFINEMENT WITH BACKFILL

CELLO SOFTWARE COMPONENTS

http://lca.ucsd.edu/projects/cello

ROADMAP

Enzo website (code, documentation) http://lca.ucsd.edu/projects/enzo

2010 Enzo User Workshop slides http://lca.ucsd.edu/workshops/enzo2010

yt website (analysis and vis.) http://yt.enzotools.org

Jacques website (analysis and vis.) http://jacques.enzotools.org/doc/Jacques/Ja

cques.html

ENZO RESOURCES

http://lca.ucsd.edu/projects/enzo

http://lca.ucsd.edu/workshops/enzo2010

http://yt.enzotools.org/

http://jacques.enzotools.org/doc/Jacques/Jacques.html

http://jacques.enzotools.org/doc/Jacques/Jacques.html

BACKUP SLIDES

Level 0

x x

x

Level 1

Level 2

GRID HIERARCHY DATA STRUCTURE

(0,0)

(1,0)

(2,0) (2,1)

(0)

(1,0) (1,1)

(2,0) (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,4) (3,5) (3,6) (3,7)

(4,0) (4,1) (4,3) (4,4)

Depth

(le

vel)

Breadth (# siblings)

Scaling the AMR grid hierarchy in depth and breadth

10243, 7 LEVEL AMR STATS

Level Grids Memory (MB) Work = Mem*(2^level)

0 512 179,029 179,029

1 223,275 114,629 229,258

2 51,522 21,226 84,904

3 17,448 6,085 48,680

4 7,216 1,975 31,600

5 3,370 1,006 32,192

6 1,674 599 38,336

7 794 311 39,808

Total 305,881 324,860 683,807

real grid object

virtual grid object

grid metadataphysics data

grid metadata

Current MPI Implementation

SCALING AMR GRID HIERARCHY

Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor For very large grid counts, this dominates memory

requirement (not physics data!) Hybrid parallel implementation helps a lot!

Now hierarchy metadata is only replicated in every SMP node instead of every processor

We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores)

Communication burden is partially shifted from MPI to intranode memory accesses

CELLO EXTREME AMR FRAMEWORK

Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores

Generic AMR scaling issues: Small AMR patches restrict available parallelism Dynamic load balancing Maintaining data locality for deep hierarchies Re-meshing efficiency and scalability Inherently global multilevel elliptic solves Increased range and precision requirements for

deep hierarchies

Documents

Michael L. Norman, UC San Diego and SDSC [email protected]