INCITE and Leadership-class · PDF fileComputer science . Environmental science . ... Researchers solved the 2D Hubbard model and ... Each INCITE project is assigned a staff member

SC13 BoF INCITE and Leadership-class

Systems

Julia C. White, INCITE Program Manager [email protected]

TP Straatsma, Group Leader, Scientific Computing Oak Ridge Leadership Computing Facility

Tim Williams, Catalyst Argonne Leadership Computing Facility

2

Three primary ways for access to LCF Distribution of allocable hours

60% INCITE 5.8 billion core-hours in

CY2014

Up to 30% ASCR Leadership Computing

Challenge

10% Director’s Discretionary

Leadership-class computing

DOE/SC capability computing

INCITE seeks computationally intensive, large- scale research and/or development

projects with the potential to significantly advance key

areas in science and engineering.

3

INCITE criteria Access on a competitive, merit-reviewed basis*

1 Merit criterion Research campaign with the potential for significant domain and/or community impact

3 Eligibility criterion • Grant allocations regardless of funding source*

• Non-US-based researchers are welcome to apply

2 Computational leadership criterion Computationally intensive runs that cannot be done

anywhere else: capability, architectural needs

*DOE High-End Computing Revitalization Act of 2004: Public Law 108-423

4

INCITE is open to researchers around the world in a broad array of domains

Accelerator physics Astrophysics

Chemical sciences

Climate research

Bioenergy

Computer science

Environmental science

Fusion energy

Life sciences

Materials science Nuclear physics

Engineering

Advancing the state of the art across a range of disciplines No designated number of hours for a particular science area

5

INCITE breakthroughs since inception A few of the many science and engineering advances

Hours allocated 4.9 M 6.5 M 18.2 M 95 M 268 M 889 M 1.6 B 1.7 B 1.7 B 4.7 B 5.8 B

Projects 3 3 15 45 55 66 69 57 60 61 59

Unprecedented simulation of magnitude-8 earthquake over 125-square miles, Proceedings, SC10

World’s first continuous simulation of 21,000 years of Earth’s climate history. Science (2009)

Largest-ever LES of a full-sized commercial combustion chamber used in an existing

helicopter turbine

Largest simulation of a galaxy’s worth of dark matter, showed for the first time the fractal-like appearance of dark

matter substructures. Nature (2008), Science (2009)

OMEN breaks the petascale barrier using more than 220,000 cores, Proceedings SC10

NIST proposes new standard reference materials from LCF concrete simulations

New method to rapidly determine protein structure, with limited experimental data Science (2010), Nature (2011)

Researchers solved the 2D Hubbard model and presented evidence that it predicts HTSC behavior

Phys. Rev. Lett 2005

Hours requested vs. allocated: ~2X per year ~3X per year

2007 2008 2009 2010 2011 2013 2012 2004 2005 2006

Modeling of molecular basis of Parkinson’s disease named #1 computational accomplishment

Breakthroughs 2008 Calculation of the number of bound nuclei in nature, Nature (2012)

2014

6

2014 INCITE award statistics

Contact information Julia C. White, INCITE Manager

[email protected]

• Request for Information helped attract new projects

• Call closed June 28th, 2013

• Total requests ~14 billion core-hours

• Awards of 5.8 billion core-hours for CY 2014

• 59 projects awarded of which 21 are renewals

Acceptance rates 36% of nonrenewal submittals and 91% of renewals

PIs by Affiliation (Awards)

7

2014 award statistics, by system

Titan Mira

Number projects* 29 40

Average Project 77.6M 88.2M

Median Project 75M 77.5M

Total Awards (core-hrs in CY2014) 2.25B 3.53B

* Total of 59 INCITE projects (many of the projects received time on both Mira and Titan)

8

How do you get there?

• Attend an LCF workshop, webinar, etc • Apply for Director’s Discretionary time

– www.alcf.anl.gov/getting-started/apply-for-dd – www.olcf.ornl.gov/support/getting-started/

Each INCITE project is assigned a staff member to provide scientific support. This ‘science liaison’ at the OLCF or ‘catalyst’ at the ALCF will help you use the resources as effectively as possible and achieve your technical milestones.

9

Leadership-class systems

Argonne LCF Oak Ridge LCF Name Mira Titan

System IBM Blue Gene/Q Cray XK7 Compute nodes 49,152 18,688

Node architecture PowerPC, 16 cores AMD Opteron, 16 cores NVIDIA Kepler GPU

Processing cores 786,432 299,008 Memory per node,

(gigabytes) 16 32 + 6

Peak performance, (petaflops) 10 27

10

Oak Ridge Leadership Computing Facility Titan – Cray XK7 – 18,688 nodes, AMD Opteron 6274/NVIDIA K20X

– 688 TB of memory

– Peak flop rate: 27 PF

Eos – Cray XC30 – 744 nodes, Intel Xeon E5-2670

– 48 TB of memory

– 248 TF

Storage – Spider Lustre filesystem: 40 PB, >1 TB/s BW; HPSS archival mass storage: 240PB, 6 tape libraries

Rhea – Analytics and visualization cluster

EVEREST– Visualization Laboratory – Stereoscopic 6x3 1920x1080 Display Wall, 30.5’ x 8.5’ – Planar 4x4 1920x1080 Display Wall – Distributed memory Linux cluster

11

Cray XK7 Compute Node

XK7 Compute Node Characteristics

AMD Opteron 6274 “Interlagos” 16 core processor @ 2.2GHz

NVIDIA Tesla K20X “Kepler” 1.31 TF

6GB GDDR5 ECC memory 250 GB/s Memory BW

2688 CUDA cores

Host Memory 32GB

1600 MHz DDR3 ECC memory

Gemini High Speed Interconnect

Four compute nodes per XK6 blade. 24 blades per rack

12

Bulldozer Module Architecture

• Two dedicated integer clusters: • Four operations per clock per cluster • 16 kB 4-way L1 per cluster • Dedicated HW for two threads

• Shared resources: • Two 128-bit FMAC FP pipelines

• Vector length • 8 for 32 bit operands • 4 for 64 bit operands

• 2 MB L2 cache

13

Interlagos Die Architecture • Four Bulldozer Compute Modules:

• Shared Resources: • 8 MB L3 Cache • Two DDR3 Memory Channels • Multiple HT3 Links

14

Interlagos Processor Architecture

• Two Interlagos dies packed on multi-chip Interlagos processor: • 8 processor modules = 16 cores • 16 MB L3 cache • 4 DDR3 1600 memory channels

15

Interlagos Cache and Memory

• L1 Cache • 16 kB, 4-way predicted, parity protected • Write-through and inclusive wrt L2 • Load latency: 4 clock cycles

• L2 Cache • 2 MB shared within core module • Load latency: 18-20 clock cycles

• L3 Cache • 8 MB • Data used by multiple modules remains in cache • 4 sub-caches, one per module • Load latency: 55-60 clock cycles

• Memory latency > 90-100 clock cycles

16

NVIDIA Tesla K20X Accelerators

• 1.31 Tflop/s peak DP (3.95 SP) • 2688 CUDA cores • 6 GB GDDR5 memory, 250 GB/s • 16 GB/s PCIe Gen2

• Dynamic parallelism: launch new threads

without data transfer to CPU • Hyper-Q: CUDA streams from multiple MPI

processes

17

Cray XK7 blades

Each XK7 blade consists of 4 nodes: • 4 NVIDIA Tesla K20X GPUs • 4 16-core AMD Opterons • 2 Cray Gemini interconnects

18

Two Processes per Compute Unit

• One MPI task pinned to each integer unit: • Exclusive access to

• Integer scheduler • Integer pipelines • L1 cache

• Shared access to • 218-bit FP units • L2 cache

• Use for codes that:

• scale to large numbers of MPI ranks • use < 2 GB of memory per rank • do not vectorize

MPI task 0

Shared MPI task 1

19

One Process per Compute Unit

• One MPI task pinned to one integer unit: • Exclusive access to

• Integer scheduler • Integer pipelines • L1 cache • 256-bit FP unit • Full L2 cache

• Use for codes that:

• are highly vectorized • need > 2 GB of memory per rank and benefit from high memory BW

• Requires pinning of processes

MPI task

Active Inactive

20

One Process and two threads per unit • One MPI task pinned to compute unit • One thread pinned to each integer unit:

• Exclusive access to • Integer scheduler • Integer pipelines • L1 cache

• Shared access to • 218-bit FP units • L2 cache

• Use for codes that

• need > 2 GB of memory per rank • can exploit thread parallelism

• Requires pinning of processes

MPI task

Thread 0 Shared Thread 1

21

NUMA Considerations

• Each XK7 Interlagos chip has 2 NUMA memory domains • Each NUMA domain has 4 Bulldozer modules

• Memory access across NUMA domains: – Lower bandwidth – Increased latency

• OpenMP performance best with threads within NUMA domain – 8 threads per MPI process in dual stream mode – 4 threads per MPI process in single stream mode

22

Hybrid Programming Model

• On Titan today with 299,008 cores, we are seeing the limits of a single level of MPI scaling for most applications

• To take advantage of the vastly large parallelism in Titan, users need to use hierarchical parallelism in their codes – Distributed memory: MPI, Shmem, PGAS – Node Local: OpenMP, Pthreads, local MPI communicators – Within threads: Vector constructs on GPU, libraries, CPU SIMD

• These are the same types of constructs needed on all multi-PFLOPS computers to scale to the full size of the systems!

23

How do you program these nodes? • Compilers

– OpenACC is a set of compiler directives to express hierarchical parallelism in the source code letting the compiler generate parallel code for the target platform (GPU, MIC, vector SIMD)

– Cray compiler supports XK6 nodes and is OpenACC compatible – CAPS HMPP compiler supports C, C++ and Fortran compilation

for heterogeneous nodes and is adding OpenACC support – PGI compiler supports OpenACC and CUDA Fortran

• Tools – Allinea DDT debugger scales to full system size and with ORNL

support will be able to debug heterogeneous (x86/GPU) apps – ORNL has worked with the Vampir team at TUD to add support

for profiling codes on heterogeneous nodes – CrayPAT and Cray Apprentice support XK6 programming

24

Titan Tool Suite

Compilers Performance Tools

GPU Libraries

Debuggers Source Code

Cray PGI

CAP-HMPP Pathscale

NVIDIA CUDA GNU Intel

CrayPAT Apprentice

Vampir VampirTrace

TAU HPCToolkit

CUDA Profiler

MAGMA CULA

Trillinos libSCI

DDT NVIDIA

Gdb

HMPP Wizard

25

OLCF Scientific Computing Group 23 Staff and Postdoctoral Fellows

• Computational Science Liaisons – Astrophysics

– Biophysics – Chemical Physics – Climate Sciences

– Fluid Dynamics – Materials Science – Mathematics

– Mechanical Engineering – Nuclear Physics

• Open Job Requisitions

– Biophysics – Materials Science – Visualization Liaison and Task Lead

• Visualization Liaisons – Data Analytics & Visualization

– EVEREST Visualization Laboratory

• Data Liaisons – ADIOS I/O

– Data management and workflow

26

Classical N-body atomistic modeling Force fields available for chemical, biological,

and materials applications Long-range electrostatics evaluated using

PPPM solver. 3D FFT in particle-mesh solver limits scaling

LAMMPS Large-scale, massively parallel molecular dynamics

Porting Strategy

Code Description

For PPPM solver, replace 3-D FFT with grid-based algorithms that reduce inter-process communication Parallelism through domain

decomposition of particle-mesh grid Accelerated code builds with

OpenCL or CUDA

Insights into the molecular mechanism of membrane fusion from simulation. Stevens et al., PRL 91 (2003)

27

Compressible Navier-Stokes equations 3D Cartesian grid, 8th-order

finite difference Explicit 4th-order Runge-Kutta

integration Fortran, 3D Domain decomposition,

non-blocking MPI

S3D Direct Numerical Simulation of Turbulent Combustion

Porting Strategy

Code Description

Hybrid MPI / OpenMP / OpenACC application All intensive calculations

can be on the accelerator Redesign message passing

to overlap communication and computation

DNS provides unique fundamental insight into the chemistry-turbulence interaction

28

Linear Boltzmann radiation transport

Discrete ordinates method

Iterative eigenvalue solution

Multigrid, preconditioned linear solvers

C++ with F95 kernels

DENOVO 3D Neutron Transport for Nuclear Reactor Design

Porting Strategy

Code Description

SWEEP kernel re-written in C++ & CUDA, runs on CPU or GPU Scaling to over 200K cores with

opportunities for increased parallelism on GPUs Reintegrate SWEEP into DENOVO

DENOVO is a component of the DOE CASL Hub, necessary to achieve CASL challenge problems

29

Combines classical statistical mechanics (W-L) for atomic magnetic moment distributions with first-principles calculations (LSMS) of the associated energies. Main computational effort is dense linear

algebra for complex numbers F77 with some F90 and C++ for the statistical

mechanics driver.

Wang-Landau LSMS First principles, statistical mechanics of magnetic materials

Porting Strategy

Code Description

Leverage accelerated linear algebra libraries, e.g., cuBLAS + CULA, LibSci_acc Parallelization over (1) W-L Monte-Carlo walkers, (2) over

atoms through MPI process, (3) OpenMP on CPU sections. Restructure communications: moved outside energy loop

Compute magnetic structure and thermodynamics of low-dimensional magnetic structures

30

Effectiveness of Heterogeneous Nodes? OLCF-3 Early Science Codes -- Performance on Titan XK7

Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific problem size chosen

Application Cray XK7 vs. Cray XE6 Performance Ratio*

LAMMPS* Molecular dynamics 7.4

S3D Turbulent combustion 2.2

Denovo 3D neutron transport for nuclear reactors 3.8

WL-LSMS Statistical mechanics of magnetic materials 3.8

31

Applications from Community Efforts Current Performance Measurements on Titan

Application Cray XK7 vs. Cray XE6 Performance Ratio*

AWP-ODC Seismology 2.1

DCA++ Condensed Matter Physics 4.4

QMCPACK Electronic structure 2.0

RMG (DFT – real-space, multigrid) Electronic Structure 2.0

XGC1 Plasma Physics for Fusion Energy R&D 1.8

Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific problem size chosen

32

Argonne Leadership Computing Facility Mira – BG/Q system – 49,152 nodes / 786,432 cores – 786 TB of memory – Peak flop rate: 10 PF

Cetus – BG/Q system – 1,024 nodes / 16,384 cores – 16 TB of memory – Peak flop rate: 209 TF

Vesta - BG/Q system – 2,048 nodes / 32,768 cores – 32 TB of memory – Peak flop rate: 419 TF

Tukey – NVIDIA system – 96 nodes / 1536 AMD CPU cores – 192 NVIDIA Tesla M2070 GPUs – 6 TB memory / 1.1TB GPU memory – GPU Peak flop rate: 99 TF

Storage - Scratch: 28.8 PB raw capacity, 240 GB/s bw (GPFS); Home: 1.8 PB raw capacity; Tape: 16 PB of archival storage, 15,906 volume tape archive (HPSS)

33

Mira Parallelism

• 48 racks of 1024 nodes • Node – PowerPC A2 CPU

– 16 cores – 4 HW threads/core

• Max hardware concurrency: – 3,145,728

• 5D torus interconnect – Bisection bandwidth (32 racks):

13.1 TB/s – Latency (MPI zero-length,

nearest-neighbor node): 2.2 µs

34

Blue Gene/Q Chip

• Based on simple, power efficient PowerPC core – 64 bit – 1.6 GHz @ 0.8V

• Unique BG/Q ASIC special features: – 4-wide SIMD floating point unit (QPX) – Transactional Memory & Speculative Execution – Fast memory based atomic operations – Stream and list based prefetching – Universal Performance Counters

35

Programming the ALCF BG/Qs

• Languages – Fortran, C, C++, Python – IBM XL and GNU compilers (and LLVM/CLANG) – Cross-compile on login node

• Message Passing Interface (MPI) – Derivative of MPICH—MPI 2.2

• Threads – OpenMP 2.5 – NPTL Pthreads

36

Programming the ALCF BG/Qs (cont’d)

• Linux development environment – Compute Node Kernel gives Linux look and feel

• POSIX routines (some restrictions: no fork() or system())

• QPX vector intrinsics: vec_ld, vec_add, vec_madd, …. • Topology interfaces

– E.g. MPIX_* functions

• Run modes: combinations of – MPI ranks/node = {1,2,4,…,64} – Threads/node = {1,2,4,…,64}

• Run using Cobalt batch-job system

37

Performance Measurement Tools

Tool Name Source Provides

BGPM IBM HW Counter API

PAPI UTK HW Counter API

gprof GNU/IBM Timing (sample)

TAU Unv. Oregon Timing (inst, sample), MPI, HW Counters (inst)

Rice HPCToolkit Rice Unv. Timing (sample), HW Counters (sample)

Scalasca Juelich Timing (inst), MPI

OpenSpeedShop Krell Timing (sample), HCP, MPI, IO

IBM HPCT IBM MPI, HW Counters

mpiP LLNL MPI Darshan ANL IO

Jumpshot ANL MPI

38

Simple Library for HW Perf. Access

LDRFLAGS = -L/soft/perftools/hpctw/lib –lmpihpm_r

call hpm_start('timesteps’) ...

call hpm_stop('timesteps’)

hpm_job_summary.246094.0

hpm_process_summary.246094.0




mpi_profile.246094.0




39

Simple Library for HW Perf. Access (cont’d)

hpm_job_summary.246094.0: ====================================================================== Aggregate BG/Q counter data for jobid 246094 This report includes all 16384 processes in MPI_COMM_WORLD and a total of 8192 cores ====================================================================== ... timesteps, call count = 1, avg cycles = 774242758762, max cycles = 775421013059 : -- Counter values for processes in this reporting group ---- min-value min-rank max-value max-rank avg-value label 2.588019e+09 15836 2.617834e+09 8188 2.596955e+09 Committed Load Misses 2.242292e+10 15836 2.602456e+10 9194 2.364120e+10 Committed Cacheable Loads 4.671928e+08 490 4.964638e+08 8188 4.750988e+08 L1p miss 4.693814e+10 14812 5.904277e+10 9194 5.112951e+10 All XU Instruction Completions 1.096274e+11 43 1.102540e+11 7552 1.098924e+11 All AXU Instruction Completions 4.839854e+11 4096 4.852488e+11 2611 4.845228e+11 FP Operations Group 1 ... Derived metrics for code block "timesteps" averaged over process(es) in the reporting group Instruction mix: FPU = 68.25 %, FXU = 31.75 % Instructions per cycle completed per core = 0.4153 Per cent of max issue rate per core = 28.34 % Total weighted TFlops = 16.380

40


hpm_process_summary.246094.13207: ... Derived metrics for code block "timesteps" averaged over process(es) on node <3,0,3,2,0>: Instruction mix: FPU = 67.83 %, FXU = 32.17 % Instructions per cycle completed per core = 0.4178 Per cent of max issue rate per core = 28.34 % Total weighted GFlops for this node = 32.008 Loads that hit in L1 d-cache = 89.14 % L1P buffer = 8.87 % L2 cache = 0.00 % DDR = 2.00 % DDR traffic for the node: ld = 6.691, st = 5.807, total = 12.497 (Bytes/cycle)

41


mpi_profile.246094.0: Data for MPI rank 0 of 16384 Times and statistics from MPI_Init() to MPI_Finalize(). ----------------------------------------------------------------- MPI Routine #calls avg. bytes time(sec) ----------------------------------------------------------------- MPI_Comm_size 1 0.0 0.000 MPI_Comm_rank 1 0.0 0.000 MPI_Send 1260 512000.0 10.646 MPI_Irecv 1260 512000.0 0.011 MPI_Wait 1260 0.0 7.102 MPI_Reduce 20 34.8 1.343 ----------------------------------------------------------------- total communication time = 19.101 seconds. total elapsed time = 500.470 seconds. heap memory used = 284.430 MBytes. heap memory available = 191.551 MBytes. ….

42

Optimize Your Application for ALCF

• Measure performance – Time based profile – MPI profile – Performance counter data for critical routines

• Improve performance – Threads: introduce/tune – Cache, L1 Prefetcher: reorg/align data structures, restructure

loops – QPX: re-order, unroll, fuse loops (for compiler); call intrinsics – Communication: load balance, collectives, rank-torus mapping

43

Optimize Your Application for ALCF (cont’d)

• Work with us. – Have an INCITE project? Use your Catalyst. – Working up to an INCITE project? Come to a workshop. – Work with our Performance Engineering team

Hands-on Workshops

Webinars

44

Case Study: Plasma Simulation

• Alfven turbulence as solar coronal heating mechanism • Reduced MHD • At hands-on workshop:

– PI worked with TAU developer, Catalyst • Analyzed performance • Improved performance 7X

– Revised communication volley – Use optimized FFT library

45

Case Study: Cosmology Simulation

• Gravitational simulation of evolving mass structure in universe • Long-range force: particle-mesh • Worked with ALCF postdoc, performance engineer, I/O expert

– New RCB-tree algorithm for short-range force – 2D FFT decomposition – I/O optimization – QPX, cache optimization – Gordon Bell Finalist, SC13

• 2012: >69% of peak on Mira – Largest cosmology simulation ever

1000 Mpc

100 Mpc

20 Mpc

2 Mpc

46

Case Study: Molecular Dynamics

• Identified source reason for antibiotic resistance in a major family of bacteria – NDM-1 enzyme cavity binds many antibiotics – Cuts carbapenum ring, destroying antibiotic effect

• NAMD classical MD code – Built on Charm++ framework

• ALCF integrated with IBM PAMI version of Charm++ – 40% speedup

47

Case Study: Ignition in H2 O2 Mixture

• Compressible reactive flow Navier-Stokes equations • Dynamic adaptive mesh refinement • Worked with ALCF Catalyst

– Introduced OpenMP threads – Revamped rebalance algorithm

• Sped up 100x via new algorithms with one-sided communications

– Overall speedup vs. BG/P: 2.5x/core, 9.2x/node

• First-ever simulation of weak ignition – No theory, known experimentally since 1960s

48

Case Study: Astrophysics

• FLASH: Finite-volume Eulerian code & framework – Multi-physics – Simulate Type 1a supernovae (one application)

• Introduced OpenMP threads • Analyzed, tuned

– Used libmpihpm library – Optimizations:

• Thread granularity • MASS library • Reorder arrays • ….

Graph from: Petascale Simulations of Turbulent Nuclear Combustion: ALCF-2 Early Science Program Technical Report, ANL/ALCF/ESP-13/10, May 2013

49

Contacts

For details about the INCITE program:

http://www.doeleadershipcomputing.org (general) http://proposals.doeleadershipcomputing.org (proposal site) [email protected]

http://www.doeleadershipcomputing.org/

http://proposals.doeleadershipcomputing.org/

Documents

INCITE and Leadership-class · PDF fileComputer science . Environmental science . ... Researchers solved the 2D Hubbard model and ... Each INCITE project is assigned a staff member