P. K. Yeung - Department of Physicsrahul/ictsweb/talk_yeung.pdf · (for a few canonical ﬂows, now comparable to laboratory) scalars and particles better resolution of small scales

Petascale computing for studies ofturbulent mixing and dispersion:

current status and future prospects

P. K. YeungSchools of AE, CSE, ME

Georgia Inst of Tech, Atlanta, USA

ICTS-TIFR Discussion Meeting“Extreme Turbulence Simulations”Bangalore, India, December 2011

P.K. Yeung, Bangalore Dec 2011 – p.1/28

AcknowledgmentsContributors, HPC consultants, science collaborators, research grantsand supercomputer resource allocations:

D. Pekurovsky, A. Majumdar (San Diego Supercomputer Center)D.A. Donzis (Texas A&M University)N.P. Malaya (Univ. Texas at Austin)K.P. Iyer (Georgia Tech)J. Kim, J. Li, R.A. Fiedler (NCSA at Univ. Illinois)

K.R. Sreenivasan(New York Univ)S.B. Pope (Cornell), B.L. Sawford (Monash Univ, Australia)R.D. Moser (Univ. Texas at Austin), J.J. Riley (Univ. Washington)

NSF Grants: Fluid Dynamics, Petascale Applications, PetascaleResource Allocations (access to “Blue Waters” at NCSA)

Approx 36 M CPU hours in 2012 allocated at NSF and DOE-supported supercomputer centers (TACC, NICS, NCCS, NERSC)


Turbulence: DNS and HPC

Exponential increase of computing power over the last 25 years

40963 isotropic turbulence (Yokokawaet al2002)

top speed:∼ 1 Petaflop/s in 2008. Exascale by 2018?

Many possible uses of massive computing power

always want higher Reynolds number(for a few canonical flows, now comparable to laboratory)

scalars and particles

better resolution of small scales

larger domain, longer integration time period

more complex physics and boundaries


The Scope of this Talk

1. Numerical Approach and Selected Results

current simulation database ( 40963)

2. Strategies in high-performance computing

the path towards the “next-largest” DNS


The basic numerical approach

Navier-Stokes equations with constant density (∇·u=0):

∂u/∂t + u · ∇u = −∇(p/ρ) + ν∇2u + f

In Fourier space,u(x) =∑

ku(k) exp(ik · x):

k · u = 0

(∂/∂t + νk2)u = −u · ∇u⊥k + f⊥k

Pseudo-spectral: forward and backward transform pairs, in3D

Aliasing control: truncation and phase shifting (Rogallo 1981)

Advance in time by 2nd or 4th order explicit Runge-Kutta:

∆t usually based on Courant no. (numerical stability)

integrating factor for viscous/diffusive terms


Turbulent mixing (Eulerian)

Advection-diffusion equation for passive scalar fluctuations

∂φ/∂t + u · ∇φ = −u · ∇Φ + D∇2φ

Uniform mean gradient∇Φ: local isotropy?

Physics and range of scales depend strongly on Schmidt no.Sc = ν/D (which varies widely in applications)

Sc . 1: inertial-convective scaling (Obukhov-Corrsin, 1949-51)Yeung et al. PoF 2005, consistent with expts (Sreenivasan 1996)

Sc ≫ 1: viscous-convective scaling (Batchelor 1959), resolvesmaller scales, at expense ofRe (Donzis et al. FTC 2010)

Sc ≪ 1: inertial-diffusive scaling (Batcheloret al. 1959),— larger domain to hold larger scales, smaller∆t to capture fastmolecular diffusion


Turbulent dispersion (Lagrangian)

Lagrangian fluid-particle tracking: (Yeung & Pope JFM 1989)

integratedx+/dt = u+, whereu+ = u[(x+(t), t)]

is obtained by interpolation from Eulerian velocity field

cubic splines: 4th-order accurate and twice differentiable

u+ =∑

k

∑

j

∑

i

bi(x+)cj(y

+)dk(z+)eijk(x

+)

basis functions (bi, cj , dk) at 4 adjacent grid intervals,and(N + 3)3 Eulerian basis-function coefficients (eijk)

One to four-particle statistics, depending on initial positions(Sawford et al. PoF 2008, Hacklet al. PoF 2011)

Pursuit of Kolmogorov similarity at highRe, in close couplingwith stochastic modeling (Sawfordet al. PoF 2011).Connection to dissipation intermittency: Yeunget al. JFM 2007


Frequently Asked Questions



From both within our community and outside...

Why are large DNS calculations worthwhile?




Why are large DNS calculations worthwhile?Will a bigger run really tell something new?





Some plausible answers:1. immense detail inherent in “full-field measurements”

2. high Reynolds no.: never quite there (but can extrapolate?)

3. flow regimes or quantities not easily accessed in laboratory

4. facilitate access to data by others





Some plausible answers:1. immense detail inherent in “full-field measurements”

2. high Reynolds no.: never quite there (but can extrapolate?)

3. flow regimes or quantities not easily accessed in laboratory

4. facilitate access to data by others

Some examples from our DNS database..P.K. Yeung, Bangalore Dec 2011 – p.8/28

Visualization: dissipation and enstrophy

[TACC visualization staff]20483, Rλ ≈ 650: intense enstrophy (red)has worm-like structure, while dissipation (blue) is more diffuse


PDFs of Dissipation and Enstrophy

HighestRe and best-resolved at moderateRe (both40963)

ǫ/〈ǫ〉, Ω/〈Ω〉

ǫ

Ωǫ

Ω

Rλ 240 Rλ 1000

PDF

High Re: most intense events in both scale similarly

Higher-order moments also become closer


K41 (?) for One-Particle Statistics

Inertial range for structure function and acceleration spectrum:

DL2 (τ) ∼ C0〈ǫ〉τ φA(ω) ∼ (C0/π)〈ǫ〉

10-1 100 101 102 103

τ/tη

0

1

2

3

4

5

6

7

1 3<∆u

i(τ)2 >

/<ε>

τ

10-1

100

101

102

103

ωTL

10-4

10-3

10-2

10-1

100

101

φ A(ω

)/ε

Requiresτη ≪ τ ≪ TL and1/TL ≪ ω ≪ 1/τη

Rλ up to≈ 1000: TL/τη ≈ 80, C0 → O(7)

Stochastic modeling (dashed lines):Rλ ∼ 30000 for plateau inDL

2 (τ), butφA(ω) gives better scaling (Sawford & Yeung 2011)P.K. Yeung, Bangalore Dec 2011 – p.11/28

Simulation Database

Rλ N kmaxη Sc

140 256 1.38 0.125 1140 512 2.74 0.125 1 4140 1024 5.48 1 4140 2048 11.2 4 64240 512 1.41 0.125 1240 2048 5.14 1 8240 4096 ∼ 11 32

390 1024 1.4 0.125 1650 2048 1.4 0.125 1650 4096 2.8 1 4

1000 4096 1.4

(+ some recent runs; Archived dataO(200) Terabytes)P.K. Yeung, Bangalore Dec 2011 – p.12/28

The Scope of this Talk

1. Numerical Approach and Selected Results

current simulation database ( 40963)

2. Strategies in high-performance computing

the path towards the “next-largest” DNS


Cyber Challenges: an overview

Largest computers in the world today haveO(105) cores:

massive parallelism dictates distributed-memory programming

communication overhead limits “scalability”

most user codes only at a few % of advertised peak

multi-cored processors: use shared memory on node forarithmetic and try to reduce communication costs

future hardware may require new programming models

Challenges beyond number-crunching

Big Data from big simulations: I/O performance

save, archive, retrieve, and share large datasets

compete for CPU hours (vs. other disciplines)


NSF: Petascale Model Problem

(One of a few for acceptance testing of 11-PF Blue Waters)

“ A 122883 simulation of fully developed homogeneous turbulence in aperiodic domain for one eddy turnover time at a value of Rλ of O(2000).”

“ The model problem should be solved using a dealiased, pseudo spectralalgorithm, a fourth-order explicit Runge-Kutta time-step ping scheme,64-bit floating point (or similar) arithmetic, and a time-st ep of 0.0001 eddyturnaround times. ”

“ Full resolution snapshots of the three-dimensional vortic ity, velocity andpressure fields should be saved to disk every 0.02 eddy turnar oundtimes. The target wall-clock time is 40 hours. ”


2D Domain Decomposition

Partition a cube along two directions, into “pencils” of data

PENCIL

Up toN2 cores forN3 grid

MPI: 2-D processor grid,

M1(cols) × M2(rows)

3D FFT from physical space to

wavenumber space:

(Starting with pencils inx)

Transform inx

Transpose to pencils inz

Transform inz

Transpose to pencils iny

Transform iny

Transposes by message-passing,

collective communication


Measures of Performance

Elapsed wall time, per time step, for given problem size

computation, communication, and memory copies

Strong scaling: for a fixed problem size,

is wall time∝ (# cores)−1? (want to compute faster)

ultimately, decrease in scalability is inevitable

Weak scaling: for fixed memory/workload per CPU core,

is wall time approx. the same? (want to compute bigger)

I/O — time taken to read/write large data sets

(∼ 1 TB each at40963, depending on number of scalarsand single vs double precision, etc)

All of these are influenced bymanyfactors, and system-dependent


Factors Affecting Performance

A partial list, besides the obvious...

Domain decomposition: the “processor grid geometry”

Load balancing: are all CPU cores equally busy?

Software libraries, compiler optimizations

Computation: cache size and memory bandwidth, per core

Communication: bandwidth and latency, per MPI task

Memory copies due to non-contiguous messages

I/O: filesystem speed and capacity; control of traffic jams

Environmental variables, network topology

Practice: job turnaround, scheduler policies, and CPU-hour economics


DNS Code: Parallel Performance

Largest machine used is 2-Petaflop Cray XT5 (Jaguarpf at ORNL)

40963 (circles) and81923 (triangles), 4th-order RK:

104

105

101

102

104

105

100

101

# cores # cores

CPU/step Speed up relative to...

best processor grid, stride-1 arithmetic

dealiasing: can skip some (highk) modes in Fourier space

better scaling when scalars added (blue, more work/core)P.K. Yeung, Bangalore Dec 2011 – p.19/28

Multicore, shared memory computing

Trend towards multi-cored designs: e.g. Cray XT5 node hasmemory shared among 12 cores; 24 for XE6, 16 for XK6)

OpenMP programming standard allows a number of parallelthreadsto perform the computation

Hybrid MPI-OpenMP: shared memory on each node,distributed memory across the nodes

less communication overhead, may scale better than pure MPIat large problem size and large core count

workload/memory management among threads can be tricky

two categories of experiments:

(i) fix number of MPI tasks, increase number of threads(ii) constant number of cores (tasks× threads)

1-D decomposition + OpenMP: Mininniet al. (2011)P.K. Yeung, Bangalore Dec 2011 – p.20/28

Hybrid MPI-OpenMP Timings

FFT kernel on 12-cored Cray XT5 using 8 cores/node: benefits at40963 and larger (being implemented in full DNS code)

Better scalability via fewer MPI tasks in communication calls

N3 Tasks Proc-Grid Num_thr Cores Used/Reqd CPU Scalability

40963 16384 16x1024 - 16384 / 16392 8.942

40963 65536 64x1024 - 65536 / 65544 5.603 39.9%

40963 65536 64x1024 - 65536 / 98304 4.370

40963 16384 16x1024 1 16384 / 24578 7.306

40963 16384 16x1024 4 65536 / 98352 3.468 52.7%

81923 65536 32x2048 - 65536 / 65544 25.70

81923 131072 64x2048 - 131072 / 131076 18.32 70.2%

81923 65536 32x2048 1 65536 / 98352 18.75

81923 65536 32x2048 2 131072 / 196608 10.25 91.5%

16-core Cray XK6 (“Titan”) just became available last weekP.K. Yeung, Bangalore Dec 2011 – p.21/28

More about FFT kernels

important tool for evaluation of new optimization strategies

detailed instrumentation of time taken for FFT and transposes

MPI vs hybrid for same number of cores

N M1 × M2 thr CPU N M1 × M2 thr CPU

768 4 × 16 1 5.94 3072 16 × 256 1 10.81768 4 × 8 2 8.15 3072 16 × 128 2 13.87768 4 × 4 4 11.67 3072 16 × 64 4 14.80

More threads (fewer MPI tasks): more time spent incommunication but scalability improves (less MPI overhead)

Multi-threaded runs become more competitive for larger problems

Max. possible# threads is determined by node architecture

Still much room for improvement (work to do!)


I/O performance and portability

Size: restart files at40963 areO(1) TB per checkpoint/dataset

not easily transferable across systems/sites

Speed: severe traffic jam if all MPI tasks do this concurrently

use “relay” scheme, and sub-directories

Filesystem performance and retrieval of archived data

can be a bottleneck for post-processing

Portability, robustness and community usage

1 file per simulation vs 1 file per MPI task

standard Fortran I/O, MPI-IO, or Parallel HDF5

I/O kernel used to find the best strategy

Community-oriented database at extreme scales: How?


Blue Waters system info

(Public info: http://www.ncsa.illinois.edu/BlueWaters/stats.html)

Cray XE6 / XK6 cabinets >235 / > 30Aggregate system memory > 1.5 PBMemory per core > 4 GBNetwork switch GeminiPeak performance > 11.5 PFNumber of AMD x86 cores >380,000Number of NVIDIA GPUs >3,000Integrated Near Line Environment Scaling to 500 PBBandwidth to near-line storage 100 GB/s

Availability timeline: 2012; Phase 1 and user workshop: this week


Future Optimization Strategies

— New Programming Models —

Major motivation: communication is completely dominant atlargeproblem sizes, followed by memory bandwith limitations

Advanced MPI: one-sided communicationlet sending task write directly onto memory in receiving task

Overlap between computation and communicationnot a new idea, but tricky to do, and little hardware supportnot too effective if there is not much to overlap

GPUs and accelerators:speed up computation and capable of v. large thread countsbut need to copy data between GPU and CPU

Other crazier and riskier stuff on the way to Exascale (2018?)


How many time steps?

A crucial (and vexing) question in computer-time proposals

Increases with: desired length of simulation, Reynolds number,spatial and/or temporal resolution, numerical stability

A reasonable formula is as follows:

TE

∆t=

〈ǫ〉L

u′3

Rλ3/2

(15)3/4

kmaxη

2.96

Umax

u′C

TE: eddy-turnover time;C: Courant number;U = |u| + |v| + |w|

How many time steps will actually get done?

allocations received (what if 50% of request?)

can we make the code yet more efficient?

is machine stable, with a good scheduler?P.K. Yeung, Bangalore Dec 2011 – p.26/28

DNS at next highest resolution

Access to 11.5 PF/s Blue Waters (NSF)

81923, with passive scalars and fluid particles(as challenging as122883 velocity-field only)

projected Reynolds no.:Rλ ∼ 1600

inhomogeneous flow problems at equivalent resolution

Much code development work still neededkernels are necessary for understanding

code usability and data availability


Concluding Remarks

Successful extreme-scale DNS will require:

Deep engagement with top HPC experts and vendors’ staff

Communication, memory, and data; rather than raw speed

Insights about the science: what will be most useful to compute,that cannot be obtained otherwise?

Competition for hours, in high demand by other disciplines

Final Question: What will we be doing in 2018?


Documents

P. K. Yeung - Department of Physicsrahul/ictsweb/talk_yeung.pdf · (for a few canonical ﬂows, now comparable to laboratory) scalars and particles better resolution of small scales