Upload
doandat
View
218
Download
5
Embed Size (px)
Citation preview
SC13 BoF INCITE and Leadership-class
Systems
Julia C. White, INCITE Program Manager [email protected]
TP Straatsma, Group Leader, Scientific Computing Oak Ridge Leadership Computing Facility
Tim Williams, Catalyst Argonne Leadership Computing Facility
2
Three primary ways for access to LCF Distribution of allocable hours
60% INCITE 5.8 billion core-hours in
CY2014
Up to 30% ASCR Leadership Computing
Challenge
10% Director’s Discretionary
Leadership-class computing
DOE/SC capability computing
INCITE seeks computationally intensive, large- scale research and/or development
projects with the potential to significantly advance key
areas in science and engineering.
3
INCITE criteria Access on a competitive, merit-reviewed basis*
1 Merit criterion Research campaign with the potential for significant domain and/or community impact
3 Eligibility criterion • Grant allocations regardless of funding source*
• Non-US-based researchers are welcome to apply
2 Computational leadership criterion Computationally intensive runs that cannot be done
anywhere else: capability, architectural needs
*DOE High-End Computing Revitalization Act of 2004: Public Law 108-423
4
INCITE is open to researchers around the world in a broad array of domains
Accelerator physics Astrophysics
Chemical sciences
Climate research
Bioenergy
Computer science
Environmental science
Fusion energy
Life sciences
Materials science Nuclear physics
Engineering
Advancing the state of the art across a range of disciplines No designated number of hours for a particular science area
5
INCITE breakthroughs since inception A few of the many science and engineering advances
Hours allocated 4.9 M 6.5 M 18.2 M 95 M 268 M 889 M 1.6 B 1.7 B 1.7 B 4.7 B 5.8 B
Projects 3 3 15 45 55 66 69 57 60 61 59
Unprecedented simulation of magnitude-8 earthquake over 125-square miles, Proceedings, SC10
World’s first continuous simulation of 21,000 years of Earth’s climate history. Science (2009)
Largest-ever LES of a full-sized commercial combustion chamber used in an existing
helicopter turbine
Largest simulation of a galaxy’s worth of dark matter, showed for the first time the fractal-like appearance of dark
matter substructures. Nature (2008), Science (2009)
OMEN breaks the petascale barrier using more than 220,000 cores, Proceedings SC10
NIST proposes new standard reference materials from LCF concrete simulations
New method to rapidly determine protein structure, with limited experimental data Science (2010), Nature (2011)
Researchers solved the 2D Hubbard model and presented evidence that it predicts HTSC behavior
Phys. Rev. Lett 2005
Hours requested vs. allocated: ~2X per year ~3X per year
2007 2008 2009 2010 2011 2013 2012 2004 2005 2006
Modeling of molecular basis of Parkinson’s disease named #1 computational accomplishment
Breakthroughs 2008 Calculation of the number of bound nuclei in nature, Nature (2012)
2014
6
2014 INCITE award statistics
Contact information Julia C. White, INCITE Manager
• Request for Information helped attract new projects
• Call closed June 28th, 2013
• Total requests ~14 billion core-hours
• Awards of 5.8 billion core-hours for CY 2014
• 59 projects awarded of which 21 are renewals
Acceptance rates 36% of nonrenewal submittals and 91% of renewals
PIs by Affiliation (Awards)
7
2014 award statistics, by system
Titan Mira
Number projects* 29 40
Average Project 77.6M 88.2M
Median Project 75M 77.5M
Total Awards (core-hrs in CY2014) 2.25B 3.53B
* Total of 59 INCITE projects (many of the projects received time on both Mira and Titan)
8
How do you get there?
• Attend an LCF workshop, webinar, etc • Apply for Director’s Discretionary time
– www.alcf.anl.gov/getting-started/apply-for-dd – www.olcf.ornl.gov/support/getting-started/
Each INCITE project is assigned a staff member to provide scientific support. This ‘science liaison’ at the OLCF or ‘catalyst’ at the ALCF will help you use the resources as effectively as possible and achieve your technical milestones.
9
Leadership-class systems
Argonne LCF Oak Ridge LCF Name Mira Titan
System IBM Blue Gene/Q Cray XK7 Compute nodes 49,152 18,688
Node architecture PowerPC, 16 cores AMD Opteron, 16 cores NVIDIA Kepler GPU
Processing cores 786,432 299,008 Memory per node,
(gigabytes) 16 32 + 6
Peak performance, (petaflops) 10 27
10
Oak Ridge Leadership Computing Facility Titan – Cray XK7 – 18,688 nodes, AMD Opteron 6274/NVIDIA K20X
– 688 TB of memory
– Peak flop rate: 27 PF
Eos – Cray XC30 – 744 nodes, Intel Xeon E5-2670
– 48 TB of memory
– 248 TF
Storage – Spider Lustre filesystem: 40 PB, >1 TB/s BW; HPSS archival mass storage: 240PB, 6 tape libraries
Rhea – Analytics and visualization cluster
EVEREST– Visualization Laboratory – Stereoscopic 6x3 1920x1080 Display Wall, 30.5’ x 8.5’ – Planar 4x4 1920x1080 Display Wall – Distributed memory Linux cluster
11
Cray XK7 Compute Node
XK7 Compute Node Characteristics
AMD Opteron 6274 “Interlagos” 16 core processor @ 2.2GHz
NVIDIA Tesla K20X “Kepler” 1.31 TF
6GB GDDR5 ECC memory 250 GB/s Memory BW
2688 CUDA cores
Host Memory 32GB
1600 MHz DDR3 ECC memory
Gemini High Speed Interconnect
Four compute nodes per XK6 blade. 24 blades per rack
12
Bulldozer Module Architecture
• Two dedicated integer clusters: • Four operations per clock per cluster • 16 kB 4-way L1 per cluster • Dedicated HW for two threads
• Shared resources: • Two 128-bit FMAC FP pipelines
• Vector length • 8 for 32 bit operands • 4 for 64 bit operands
• 2 MB L2 cache
13
Interlagos Die Architecture • Four Bulldozer Compute Modules:
• Shared Resources: • 8 MB L3 Cache • Two DDR3 Memory Channels • Multiple HT3 Links
14
Interlagos Processor Architecture
• Two Interlagos dies packed on multi-chip Interlagos processor: • 8 processor modules = 16 cores • 16 MB L3 cache • 4 DDR3 1600 memory channels
15
Interlagos Cache and Memory
• L1 Cache • 16 kB, 4-way predicted, parity protected • Write-through and inclusive wrt L2 • Load latency: 4 clock cycles
• L2 Cache • 2 MB shared within core module • Load latency: 18-20 clock cycles
• L3 Cache • 8 MB • Data used by multiple modules remains in cache • 4 sub-caches, one per module • Load latency: 55-60 clock cycles
• Memory latency > 90-100 clock cycles
16
NVIDIA Tesla K20X Accelerators
• 1.31 Tflop/s peak DP (3.95 SP) • 2688 CUDA cores • 6 GB GDDR5 memory, 250 GB/s • 16 GB/s PCIe Gen2
• Dynamic parallelism: launch new threads
without data transfer to CPU • Hyper-Q: CUDA streams from multiple MPI
processes
17
Cray XK7 blades
Each XK7 blade consists of 4 nodes: • 4 NVIDIA Tesla K20X GPUs • 4 16-core AMD Opterons • 2 Cray Gemini interconnects
18
Two Processes per Compute Unit
• One MPI task pinned to each integer unit: • Exclusive access to
• Integer scheduler • Integer pipelines • L1 cache
• Shared access to • 218-bit FP units • L2 cache
• Use for codes that:
• scale to large numbers of MPI ranks • use < 2 GB of memory per rank • do not vectorize
MPI task 0
Shared MPI task 1
19
One Process per Compute Unit
• One MPI task pinned to one integer unit: • Exclusive access to
• Integer scheduler • Integer pipelines • L1 cache • 256-bit FP unit • Full L2 cache
• Use for codes that:
• are highly vectorized • need > 2 GB of memory per rank and benefit from high memory BW
• Requires pinning of processes
MPI task
Active Inactive
20
One Process and two threads per unit • One MPI task pinned to compute unit • One thread pinned to each integer unit:
• Exclusive access to • Integer scheduler • Integer pipelines • L1 cache
• Shared access to • 218-bit FP units • L2 cache
• Use for codes that
• need > 2 GB of memory per rank • can exploit thread parallelism
• Requires pinning of processes
MPI task
Thread 0 Shared Thread 1
21
NUMA Considerations
• Each XK7 Interlagos chip has 2 NUMA memory domains • Each NUMA domain has 4 Bulldozer modules
• Memory access across NUMA domains: – Lower bandwidth – Increased latency
• OpenMP performance best with threads within NUMA domain – 8 threads per MPI process in dual stream mode – 4 threads per MPI process in single stream mode
22
Hybrid Programming Model
• On Titan today with 299,008 cores, we are seeing the limits of a single level of MPI scaling for most applications
• To take advantage of the vastly large parallelism in Titan, users need to use hierarchical parallelism in their codes – Distributed memory: MPI, Shmem, PGAS – Node Local: OpenMP, Pthreads, local MPI communicators – Within threads: Vector constructs on GPU, libraries, CPU SIMD
• These are the same types of constructs needed on all multi-PFLOPS computers to scale to the full size of the systems!
23
How do you program these nodes? • Compilers
– OpenACC is a set of compiler directives to express hierarchical parallelism in the source code letting the compiler generate parallel code for the target platform (GPU, MIC, vector SIMD)
– Cray compiler supports XK6 nodes and is OpenACC compatible – CAPS HMPP compiler supports C, C++ and Fortran compilation
for heterogeneous nodes and is adding OpenACC support – PGI compiler supports OpenACC and CUDA Fortran
• Tools – Allinea DDT debugger scales to full system size and with ORNL
support will be able to debug heterogeneous (x86/GPU) apps – ORNL has worked with the Vampir team at TUD to add support
for profiling codes on heterogeneous nodes – CrayPAT and Cray Apprentice support XK6 programming
24
Titan Tool Suite
Compilers Performance Tools
GPU Libraries
Debuggers Source Code
Cray PGI
CAP-HMPP Pathscale
NVIDIA CUDA GNU Intel
CrayPAT Apprentice
Vampir VampirTrace
TAU HPCToolkit
CUDA Profiler
MAGMA CULA
Trillinos libSCI
DDT NVIDIA
Gdb
HMPP Wizard
25
OLCF Scientific Computing Group 23 Staff and Postdoctoral Fellows
• Computational Science Liaisons – Astrophysics
– Biophysics – Chemical Physics – Climate Sciences
– Fluid Dynamics – Materials Science – Mathematics
– Mechanical Engineering – Nuclear Physics
• Open Job Requisitions
– Biophysics – Materials Science – Visualization Liaison and Task Lead
• Visualization Liaisons – Data Analytics & Visualization
– EVEREST Visualization Laboratory
• Data Liaisons – ADIOS I/O
– Data management and workflow
26
Classical N-body atomistic modeling Force fields available for chemical, biological,
and materials applications Long-range electrostatics evaluated using
PPPM solver. 3D FFT in particle-mesh solver limits scaling
LAMMPS Large-scale, massively parallel molecular dynamics
Porting Strategy
Code Description
For PPPM solver, replace 3-D FFT with grid-based algorithms that reduce inter-process communication Parallelism through domain
decomposition of particle-mesh grid Accelerated code builds with
OpenCL or CUDA
Insights into the molecular mechanism of membrane fusion from simulation. Stevens et al., PRL 91 (2003)
27
Compressible Navier-Stokes equations 3D Cartesian grid, 8th-order
finite difference Explicit 4th-order Runge-Kutta
integration Fortran, 3D Domain decomposition,
non-blocking MPI
S3D Direct Numerical Simulation of Turbulent Combustion
Porting Strategy
Code Description
Hybrid MPI / OpenMP / OpenACC application All intensive calculations
can be on the accelerator Redesign message passing
to overlap communication and computation
DNS provides unique fundamental insight into the chemistry-turbulence interaction
28
Linear Boltzmann radiation transport
Discrete ordinates method
Iterative eigenvalue solution
Multigrid, preconditioned linear solvers
C++ with F95 kernels
DENOVO 3D Neutron Transport for Nuclear Reactor Design
Porting Strategy
Code Description
SWEEP kernel re-written in C++ & CUDA, runs on CPU or GPU Scaling to over 200K cores with
opportunities for increased parallelism on GPUs Reintegrate SWEEP into DENOVO
DENOVO is a component of the DOE CASL Hub, necessary to achieve CASL challenge problems
29
Combines classical statistical mechanics (W-L) for atomic magnetic moment distributions with first-principles calculations (LSMS) of the associated energies. Main computational effort is dense linear
algebra for complex numbers F77 with some F90 and C++ for the statistical
mechanics driver.
Wang-Landau LSMS First principles, statistical mechanics of magnetic materials
Porting Strategy
Code Description
Leverage accelerated linear algebra libraries, e.g., cuBLAS + CULA, LibSci_acc Parallelization over (1) W-L Monte-Carlo walkers, (2) over
atoms through MPI process, (3) OpenMP on CPU sections. Restructure communications: moved outside energy loop
Compute magnetic structure and thermodynamics of low-dimensional magnetic structures
30
Effectiveness of Heterogeneous Nodes? OLCF-3 Early Science Codes -- Performance on Titan XK7
Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific problem size chosen
Application Cray XK7 vs. Cray XE6 Performance Ratio*
LAMMPS* Molecular dynamics 7.4
S3D Turbulent combustion 2.2
Denovo 3D neutron transport for nuclear reactors 3.8
WL-LSMS Statistical mechanics of magnetic materials 3.8
31
Applications from Community Efforts Current Performance Measurements on Titan
Application Cray XK7 vs. Cray XE6 Performance Ratio*
AWP-ODC Seismology 2.1
DCA++ Condensed Matter Physics 4.4
QMCPACK Electronic structure 2.0
RMG (DFT – real-space, multigrid) Electronic Structure 2.0
XGC1 Plasma Physics for Fusion Energy R&D 1.8
Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific problem size chosen
32
Argonne Leadership Computing Facility Mira – BG/Q system – 49,152 nodes / 786,432 cores – 786 TB of memory – Peak flop rate: 10 PF
Cetus – BG/Q system – 1,024 nodes / 16,384 cores – 16 TB of memory – Peak flop rate: 209 TF
Vesta - BG/Q system – 2,048 nodes / 32,768 cores – 32 TB of memory – Peak flop rate: 419 TF
Tukey – NVIDIA system – 96 nodes / 1536 AMD CPU cores – 192 NVIDIA Tesla M2070 GPUs – 6 TB memory / 1.1TB GPU memory – GPU Peak flop rate: 99 TF
Storage - Scratch: 28.8 PB raw capacity, 240 GB/s bw (GPFS); Home: 1.8 PB raw capacity; Tape: 16 PB of archival storage, 15,906 volume tape archive (HPSS)
33
Mira Parallelism
• 48 racks of 1024 nodes • Node – PowerPC A2 CPU
– 16 cores – 4 HW threads/core
• Max hardware concurrency: – 3,145,728
• 5D torus interconnect – Bisection bandwidth (32 racks):
13.1 TB/s – Latency (MPI zero-length,
nearest-neighbor node): 2.2 µs
34
Blue Gene/Q Chip
• Based on simple, power efficient PowerPC core – 64 bit – 1.6 GHz @ 0.8V
• Unique BG/Q ASIC special features: – 4-wide SIMD floating point unit (QPX) – Transactional Memory & Speculative Execution – Fast memory based atomic operations – Stream and list based prefetching – Universal Performance Counters
35
Programming the ALCF BG/Qs
• Languages – Fortran, C, C++, Python – IBM XL and GNU compilers (and LLVM/CLANG) – Cross-compile on login node
• Message Passing Interface (MPI) – Derivative of MPICH—MPI 2.2
• Threads – OpenMP 2.5 – NPTL Pthreads
36
Programming the ALCF BG/Qs (cont’d)
• Linux development environment – Compute Node Kernel gives Linux look and feel
• POSIX routines (some restrictions: no fork() or system())
• QPX vector intrinsics: vec_ld, vec_add, vec_madd, …. • Topology interfaces
– E.g. MPIX_* functions
• Run modes: combinations of – MPI ranks/node = {1,2,4,…,64} – Threads/node = {1,2,4,…,64}
• Run using Cobalt batch-job system
37
Performance Measurement Tools
Tool Name Source Provides
BGPM IBM HW Counter API
PAPI UTK HW Counter API
gprof GNU/IBM Timing (sample)
TAU Unv. Oregon Timing (inst, sample), MPI, HW Counters (inst)
Rice HPCToolkit Rice Unv. Timing (sample), HW Counters (sample)
Scalasca Juelich Timing (inst), MPI
OpenSpeedShop Krell Timing (sample), HCP, MPI, IO
IBM HPCT IBM MPI, HW Counters
mpiP LLNL MPI Darshan ANL IO
Jumpshot ANL MPI
38
Simple Library for HW Perf. Access
LDRFLAGS = -L/soft/perftools/hpctw/lib –lmpihpm_r
call hpm_start('timesteps’) ...
call hpm_stop('timesteps’)
hpm_job_summary.246094.0
hpm_process_summary.246094.0
hpm_process_summary.246094.9194
hpm_process_summary.246094.13207
hpm_process_summary.246094.15804
mpi_profile.246094.0
mpi_profile.246094.9194
mpi_profile.246094.13207
mpi_profile.246094.15804
39
Simple Library for HW Perf. Access (cont’d)
hpm_job_summary.246094.0: ====================================================================== Aggregate BG/Q counter data for jobid 246094 This report includes all 16384 processes in MPI_COMM_WORLD and a total of 8192 cores ====================================================================== ... timesteps, call count = 1, avg cycles = 774242758762, max cycles = 775421013059 : -- Counter values for processes in this reporting group ---- min-value min-rank max-value max-rank avg-value label 2.588019e+09 15836 2.617834e+09 8188 2.596955e+09 Committed Load Misses 2.242292e+10 15836 2.602456e+10 9194 2.364120e+10 Committed Cacheable Loads 4.671928e+08 490 4.964638e+08 8188 4.750988e+08 L1p miss 4.693814e+10 14812 5.904277e+10 9194 5.112951e+10 All XU Instruction Completions 1.096274e+11 43 1.102540e+11 7552 1.098924e+11 All AXU Instruction Completions 4.839854e+11 4096 4.852488e+11 2611 4.845228e+11 FP Operations Group 1 ... Derived metrics for code block "timesteps" averaged over process(es) in the reporting group Instruction mix: FPU = 68.25 %, FXU = 31.75 % Instructions per cycle completed per core = 0.4153 Per cent of max issue rate per core = 28.34 % Total weighted TFlops = 16.380
40
Simple Library for HW Perf. Access (cont’d)
hpm_process_summary.246094.13207: ... Derived metrics for code block "timesteps" averaged over process(es) on node <3,0,3,2,0>: Instruction mix: FPU = 67.83 %, FXU = 32.17 % Instructions per cycle completed per core = 0.4178 Per cent of max issue rate per core = 28.34 % Total weighted GFlops for this node = 32.008 Loads that hit in L1 d-cache = 89.14 % L1P buffer = 8.87 % L2 cache = 0.00 % DDR = 2.00 % DDR traffic for the node: ld = 6.691, st = 5.807, total = 12.497 (Bytes/cycle)
41
Simple Library for HW Perf. Access (cont’d)
mpi_profile.246094.0: Data for MPI rank 0 of 16384 Times and statistics from MPI_Init() to MPI_Finalize(). ----------------------------------------------------------------- MPI Routine #calls avg. bytes time(sec) ----------------------------------------------------------------- MPI_Comm_size 1 0.0 0.000 MPI_Comm_rank 1 0.0 0.000 MPI_Send 1260 512000.0 10.646 MPI_Irecv 1260 512000.0 0.011 MPI_Wait 1260 0.0 7.102 MPI_Reduce 20 34.8 1.343 ----------------------------------------------------------------- total communication time = 19.101 seconds. total elapsed time = 500.470 seconds. heap memory used = 284.430 MBytes. heap memory available = 191.551 MBytes. ….
42
Optimize Your Application for ALCF
• Measure performance – Time based profile – MPI profile – Performance counter data for critical routines
• Improve performance – Threads: introduce/tune – Cache, L1 Prefetcher: reorg/align data structures, restructure
loops – QPX: re-order, unroll, fuse loops (for compiler); call intrinsics – Communication: load balance, collectives, rank-torus mapping
43
Optimize Your Application for ALCF (cont’d)
• Work with us. – Have an INCITE project? Use your Catalyst. – Working up to an INCITE project? Come to a workshop. – Work with our Performance Engineering team
Hands-on Workshops
Webinars
44
Case Study: Plasma Simulation
• Alfven turbulence as solar coronal heating mechanism • Reduced MHD • At hands-on workshop:
– PI worked with TAU developer, Catalyst • Analyzed performance • Improved performance 7X
– Revised communication volley – Use optimized FFT library
45
Case Study: Cosmology Simulation
• Gravitational simulation of evolving mass structure in universe • Long-range force: particle-mesh • Worked with ALCF postdoc, performance engineer, I/O expert
– New RCB-tree algorithm for short-range force – 2D FFT decomposition – I/O optimization – QPX, cache optimization – Gordon Bell Finalist, SC13
• 2012: >69% of peak on Mira – Largest cosmology simulation ever
1000 Mpc
100 Mpc
20 Mpc
2 Mpc
46
Case Study: Molecular Dynamics
• Identified source reason for antibiotic resistance in a major family of bacteria – NDM-1 enzyme cavity binds many antibiotics – Cuts carbapenum ring, destroying antibiotic effect
• NAMD classical MD code – Built on Charm++ framework
• ALCF integrated with IBM PAMI version of Charm++ – 40% speedup
47
Case Study: Ignition in H2 O2 Mixture
• Compressible reactive flow Navier-Stokes equations • Dynamic adaptive mesh refinement • Worked with ALCF Catalyst
– Introduced OpenMP threads – Revamped rebalance algorithm
• Sped up 100x via new algorithms with one-sided communications
– Overall speedup vs. BG/P: 2.5x/core, 9.2x/node
• First-ever simulation of weak ignition – No theory, known experimentally since 1960s
48
Case Study: Astrophysics
• FLASH: Finite-volume Eulerian code & framework – Multi-physics – Simulate Type 1a supernovae (one application)
• Introduced OpenMP threads • Analyzed, tuned
– Used libmpihpm library – Optimizations:
• Thread granularity • MASS library • Reorder arrays • ….
Graph from: Petascale Simulations of Turbulent Nuclear Combustion: ALCF-2 Early Science Program Technical Report, ANL/ALCF/ESP-13/10, May 2013
49
Contacts
For details about the INCITE program:
http://www.doeleadershipcomputing.org (general) http://proposals.doeleadershipcomputing.org (proposal site) [email protected]