Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Petascale computing for studies ofturbulent mixing and dispersion:
current status and future prospects
P. K. YeungSchools of AE, CSE, ME
Georgia Inst of Tech, Atlanta, USA
ICTS-TIFR Discussion Meeting“Extreme Turbulence Simulations”Bangalore, India, December 2011
P.K. Yeung, Bangalore Dec 2011 – p.1/28
AcknowledgmentsContributors, HPC consultants, science collaborators, research grantsand supercomputer resource allocations:
D. Pekurovsky, A. Majumdar (San Diego Supercomputer Center)D.A. Donzis (Texas A&M University)N.P. Malaya (Univ. Texas at Austin)K.P. Iyer (Georgia Tech)J. Kim, J. Li, R.A. Fiedler (NCSA at Univ. Illinois)
K.R. Sreenivasan(New York Univ)S.B. Pope (Cornell), B.L. Sawford (Monash Univ, Australia)R.D. Moser (Univ. Texas at Austin), J.J. Riley (Univ. Washington)
NSF Grants: Fluid Dynamics, Petascale Applications, PetascaleResource Allocations (access to “Blue Waters” at NCSA)
Approx 36 M CPU hours in 2012 allocated at NSF and DOE-supported supercomputer centers (TACC, NICS, NCCS, NERSC)
P.K. Yeung, Bangalore Dec 2011 – p.2/28
Turbulence: DNS and HPC
Exponential increase of computing power over the last 25 years
40963 isotropic turbulence (Yokokawaet al2002)
top speed:∼ 1 Petaflop/s in 2008. Exascale by 2018?
Many possible uses of massive computing power
always want higher Reynolds number(for a few canonical flows, now comparable to laboratory)
scalars and particles
better resolution of small scales
larger domain, longer integration time period
more complex physics and boundaries
P.K. Yeung, Bangalore Dec 2011 – p.3/28
The Scope of this Talk
1. Numerical Approach and Selected Results
current simulation database ( 40963)
2. Strategies in high-performance computing
the path towards the “next-largest” DNS
P.K. Yeung, Bangalore Dec 2011 – p.4/28
The basic numerical approach
Navier-Stokes equations with constant density (∇·u=0):
∂u/∂t + u · ∇u = −∇(p/ρ) + ν∇2u + f
In Fourier space,u(x) =∑
ku(k) exp(ik · x):
k · u = 0
(∂/∂t + νk2)u = −u · ∇u⊥k + f⊥k
Pseudo-spectral: forward and backward transform pairs, in3D
Aliasing control: truncation and phase shifting (Rogallo 1981)
Advance in time by 2nd or 4th order explicit Runge-Kutta:
∆t usually based on Courant no. (numerical stability)
integrating factor for viscous/diffusive terms
P.K. Yeung, Bangalore Dec 2011 – p.5/28
Turbulent mixing (Eulerian)
Advection-diffusion equation for passive scalar fluctuations
∂φ/∂t + u · ∇φ = −u · ∇Φ + D∇2φ
Uniform mean gradient∇Φ: local isotropy?
Physics and range of scales depend strongly on Schmidt no.Sc = ν/D (which varies widely in applications)
Sc . 1: inertial-convective scaling (Obukhov-Corrsin, 1949-51)Yeung et al. PoF 2005, consistent with expts (Sreenivasan 1996)
Sc ≫ 1: viscous-convective scaling (Batchelor 1959), resolvesmaller scales, at expense ofRe (Donzis et al. FTC 2010)
Sc ≪ 1: inertial-diffusive scaling (Batcheloret al. 1959),— larger domain to hold larger scales, smaller∆t to capture fastmolecular diffusion
P.K. Yeung, Bangalore Dec 2011 – p.6/28
Turbulent dispersion (Lagrangian)
Lagrangian fluid-particle tracking: (Yeung & Pope JFM 1989)
integratedx+/dt = u+, whereu+ = u[(x+(t), t)]
is obtained by interpolation from Eulerian velocity field
cubic splines: 4th-order accurate and twice differentiable
u+ =∑
k
∑
j
∑
i
bi(x+)cj(y
+)dk(z+)eijk(x
+)
basis functions (bi, cj , dk) at 4 adjacent grid intervals,and(N + 3)3 Eulerian basis-function coefficients (eijk)
One to four-particle statistics, depending on initial positions(Sawford et al. PoF 2008, Hacklet al. PoF 2011)
Pursuit of Kolmogorov similarity at highRe, in close couplingwith stochastic modeling (Sawfordet al. PoF 2011).Connection to dissipation intermittency: Yeunget al. JFM 2007
P.K. Yeung, Bangalore Dec 2011 – p.7/28
Frequently Asked Questions
P.K. Yeung, Bangalore Dec 2011 – p.8/28
Frequently Asked Questions
From both within our community and outside...
Why are large DNS calculations worthwhile?
P.K. Yeung, Bangalore Dec 2011 – p.8/28
Frequently Asked Questions
From both within our community and outside...
Why are large DNS calculations worthwhile?Will a bigger run really tell something new?
P.K. Yeung, Bangalore Dec 2011 – p.8/28
Frequently Asked Questions
From both within our community and outside...
Why are large DNS calculations worthwhile?Will a bigger run really tell something new?
Some plausible answers:1. immense detail inherent in “full-field measurements”
2. high Reynolds no.: never quite there (but can extrapolate?)
3. flow regimes or quantities not easily accessed in laboratory
4. facilitate access to data by others
P.K. Yeung, Bangalore Dec 2011 – p.8/28
Frequently Asked Questions
From both within our community and outside...
Why are large DNS calculations worthwhile?Will a bigger run really tell something new?
Some plausible answers:1. immense detail inherent in “full-field measurements”
2. high Reynolds no.: never quite there (but can extrapolate?)
3. flow regimes or quantities not easily accessed in laboratory
4. facilitate access to data by others
Some examples from our DNS database..P.K. Yeung, Bangalore Dec 2011 – p.8/28
Visualization: dissipation and enstrophy
[TACC visualization staff]20483, Rλ ≈ 650: intense enstrophy (red)has worm-like structure, while dissipation (blue) is more diffuse
P.K. Yeung, Bangalore Dec 2011 – p.9/28
PDFs of Dissipation and Enstrophy
HighestRe and best-resolved at moderateRe (both40963)
ǫ/〈ǫ〉, Ω/〈Ω〉
ǫ
Ωǫ
Ω
Rλ 240 Rλ 1000
High Re: most intense events in both scale similarly
Higher-order moments also become closer
P.K. Yeung, Bangalore Dec 2011 – p.10/28
K41 (?) for One-Particle Statistics
Inertial range for structure function and acceleration spectrum:
DL2 (τ) ∼ C0〈ǫ〉τ φA(ω) ∼ (C0/π)〈ǫ〉
10-1 100 101 102 103
τ/tη
0
1
2
3
4
5
6
7
1 3<∆u
i(τ)2 >
/<ε>
τ
10-1
100
101
102
103
ωTL
10-4
10-3
10-2
10-1
100
101
φ A(ω
)/ε
Requiresτη ≪ τ ≪ TL and1/TL ≪ ω ≪ 1/τη
Rλ up to≈ 1000: TL/τη ≈ 80, C0 → O(7)
Stochastic modeling (dashed lines):Rλ ∼ 30000 for plateau inDL
2 (τ), butφA(ω) gives better scaling (Sawford & Yeung 2011)P.K. Yeung, Bangalore Dec 2011 – p.11/28
Simulation Database
Rλ N kmaxη Sc
140 256 1.38 0.125 1140 512 2.74 0.125 1 4140 1024 5.48 1 4140 2048 11.2 4 64240 512 1.41 0.125 1240 2048 5.14 1 8240 4096 ∼ 11 32
390 1024 1.4 0.125 1650 2048 1.4 0.125 1650 4096 2.8 1 4
1000 4096 1.4
(+ some recent runs; Archived dataO(200) Terabytes)P.K. Yeung, Bangalore Dec 2011 – p.12/28
The Scope of this Talk
1. Numerical Approach and Selected Results
current simulation database ( 40963)
2. Strategies in high-performance computing
the path towards the “next-largest” DNS
P.K. Yeung, Bangalore Dec 2011 – p.13/28
Cyber Challenges: an overview
Largest computers in the world today haveO(105) cores:
massive parallelism dictates distributed-memory programming
communication overhead limits “scalability”
most user codes only at a few % of advertised peak
multi-cored processors: use shared memory on node forarithmetic and try to reduce communication costs
future hardware may require new programming models
Challenges beyond number-crunching
Big Data from big simulations: I/O performance
save, archive, retrieve, and share large datasets
compete for CPU hours (vs. other disciplines)
P.K. Yeung, Bangalore Dec 2011 – p.14/28
NSF: Petascale Model Problem
(One of a few for acceptance testing of 11-PF Blue Waters)
“ A 122883 simulation of fully developed homogeneous turbulence in aperiodic domain for one eddy turnover time at a value of Rλ of O(2000).”
“ The model problem should be solved using a dealiased, pseudo spectralalgorithm, a fourth-order explicit Runge-Kutta time-step ping scheme,64-bit floating point (or similar) arithmetic, and a time-st ep of 0.0001 eddyturnaround times. ”
“ Full resolution snapshots of the three-dimensional vortic ity, velocity andpressure fields should be saved to disk every 0.02 eddy turnar oundtimes. The target wall-clock time is 40 hours. ”
P.K. Yeung, Bangalore Dec 2011 – p.15/28
2D Domain Decomposition
Partition a cube along two directions, into “pencils” of data
PENCIL
Up toN2 cores forN3 grid
MPI: 2-D processor grid,
M1(cols) × M2(rows)
3D FFT from physical space to
wavenumber space:
(Starting with pencils inx)
Transform inx
Transpose to pencils inz
Transform inz
Transpose to pencils iny
Transform iny
Transposes by message-passing,
collective communication
P.K. Yeung, Bangalore Dec 2011 – p.16/28
Measures of Performance
Elapsed wall time, per time step, for given problem size
computation, communication, and memory copies
Strong scaling: for a fixed problem size,
is wall time∝ (# cores)−1? (want to compute faster)
ultimately, decrease in scalability is inevitable
Weak scaling: for fixed memory/workload per CPU core,
is wall time approx. the same? (want to compute bigger)
I/O — time taken to read/write large data sets
(∼ 1 TB each at40963, depending on number of scalarsand single vs double precision, etc)
All of these are influenced bymanyfactors, and system-dependent
P.K. Yeung, Bangalore Dec 2011 – p.17/28
Factors Affecting Performance
A partial list, besides the obvious...
Domain decomposition: the “processor grid geometry”
Load balancing: are all CPU cores equally busy?
Software libraries, compiler optimizations
Computation: cache size and memory bandwidth, per core
Communication: bandwidth and latency, per MPI task
Memory copies due to non-contiguous messages
I/O: filesystem speed and capacity; control of traffic jams
Environmental variables, network topology
Practice: job turnaround, scheduler policies, and CPU-hour economics
P.K. Yeung, Bangalore Dec 2011 – p.18/28
DNS Code: Parallel Performance
Largest machine used is 2-Petaflop Cray XT5 (Jaguarpf at ORNL)
40963 (circles) and81923 (triangles), 4th-order RK:
104
105
101
102
104
105
100
101
# cores # cores
CPU/step Speed up relative to...
best processor grid, stride-1 arithmetic
dealiasing: can skip some (highk) modes in Fourier space
better scaling when scalars added (blue, more work/core)P.K. Yeung, Bangalore Dec 2011 – p.19/28
Multicore, shared memory computing
Trend towards multi-cored designs: e.g. Cray XT5 node hasmemory shared among 12 cores; 24 for XE6, 16 for XK6)
OpenMP programming standard allows a number of parallelthreadsto perform the computation
Hybrid MPI-OpenMP: shared memory on each node,distributed memory across the nodes
less communication overhead, may scale better than pure MPIat large problem size and large core count
workload/memory management among threads can be tricky
two categories of experiments:
(i) fix number of MPI tasks, increase number of threads(ii) constant number of cores (tasks× threads)
1-D decomposition + OpenMP: Mininniet al. (2011)P.K. Yeung, Bangalore Dec 2011 – p.20/28
Hybrid MPI-OpenMP Timings
FFT kernel on 12-cored Cray XT5 using 8 cores/node: benefits at40963 and larger (being implemented in full DNS code)
Better scalability via fewer MPI tasks in communication calls
N3 Tasks Proc-Grid Num_thr Cores Used/Reqd CPU Scalability
40963 16384 16x1024 - 16384 / 16392 8.942
40963 65536 64x1024 - 65536 / 65544 5.603 39.9%
40963 65536 64x1024 - 65536 / 98304 4.370
40963 16384 16x1024 1 16384 / 24578 7.306
40963 16384 16x1024 4 65536 / 98352 3.468 52.7%
81923 65536 32x2048 - 65536 / 65544 25.70
81923 131072 64x2048 - 131072 / 131076 18.32 70.2%
81923 65536 32x2048 1 65536 / 98352 18.75
81923 65536 32x2048 2 131072 / 196608 10.25 91.5%
16-core Cray XK6 (“Titan”) just became available last weekP.K. Yeung, Bangalore Dec 2011 – p.21/28
More about FFT kernels
important tool for evaluation of new optimization strategies
detailed instrumentation of time taken for FFT and transposes
MPI vs hybrid for same number of cores
N M1 × M2 thr CPU N M1 × M2 thr CPU
768 4 × 16 1 5.94 3072 16 × 256 1 10.81768 4 × 8 2 8.15 3072 16 × 128 2 13.87768 4 × 4 4 11.67 3072 16 × 64 4 14.80
More threads (fewer MPI tasks): more time spent incommunication but scalability improves (less MPI overhead)
Multi-threaded runs become more competitive for larger problems
Max. possible# threads is determined by node architecture
Still much room for improvement (work to do!)
P.K. Yeung, Bangalore Dec 2011 – p.22/28
I/O performance and portability
Size: restart files at40963 areO(1) TB per checkpoint/dataset
not easily transferable across systems/sites
Speed: severe traffic jam if all MPI tasks do this concurrently
use “relay” scheme, and sub-directories
Filesystem performance and retrieval of archived data
can be a bottleneck for post-processing
Portability, robustness and community usage
1 file per simulation vs 1 file per MPI task
standard Fortran I/O, MPI-IO, or Parallel HDF5
I/O kernel used to find the best strategy
Community-oriented database at extreme scales: How?
P.K. Yeung, Bangalore Dec 2011 – p.23/28
Blue Waters system info
(Public info: http://www.ncsa.illinois.edu/BlueWaters/stats.html)
Cray XE6 / XK6 cabinets >235 / > 30Aggregate system memory > 1.5 PBMemory per core > 4 GBNetwork switch GeminiPeak performance > 11.5 PFNumber of AMD x86 cores >380,000Number of NVIDIA GPUs >3,000Integrated Near Line Environment Scaling to 500 PBBandwidth to near-line storage 100 GB/s
Availability timeline: 2012; Phase 1 and user workshop: this week
P.K. Yeung, Bangalore Dec 2011 – p.24/28
Future Optimization Strategies
— New Programming Models —
Major motivation: communication is completely dominant atlargeproblem sizes, followed by memory bandwith limitations
Advanced MPI: one-sided communicationlet sending task write directly onto memory in receiving task
Overlap between computation and communicationnot a new idea, but tricky to do, and little hardware supportnot too effective if there is not much to overlap
GPUs and accelerators:speed up computation and capable of v. large thread countsbut need to copy data between GPU and CPU
Other crazier and riskier stuff on the way to Exascale (2018?)
P.K. Yeung, Bangalore Dec 2011 – p.25/28
How many time steps?
A crucial (and vexing) question in computer-time proposals
Increases with: desired length of simulation, Reynolds number,spatial and/or temporal resolution, numerical stability
A reasonable formula is as follows:
TE
∆t=
〈ǫ〉L
u′3
Rλ3/2
(15)3/4
kmaxη
2.96
Umax
u′C
TE: eddy-turnover time;C: Courant number;U = |u| + |v| + |w|
How many time steps will actually get done?
allocations received (what if 50% of request?)
can we make the code yet more efficient?
is machine stable, with a good scheduler?P.K. Yeung, Bangalore Dec 2011 – p.26/28
DNS at next highest resolution
Access to 11.5 PF/s Blue Waters (NSF)
81923, with passive scalars and fluid particles(as challenging as122883 velocity-field only)
projected Reynolds no.:Rλ ∼ 1600
inhomogeneous flow problems at equivalent resolution
Much code development work still neededkernels are necessary for understanding
code usability and data availability
P.K. Yeung, Bangalore Dec 2011 – p.27/28
Concluding Remarks
Successful extreme-scale DNS will require:
Deep engagement with top HPC experts and vendors’ staff
Communication, memory, and data; rather than raw speed
Insights about the science: what will be most useful to compute,that cannot be obtained otherwise?
Competition for hours, in high demand by other disciplines
Final Question: What will we be doing in 2018?
P.K. Yeung, Bangalore Dec 2011 – p.28/28