Beyond ppOpen-HPC - ORNL · • RFI: Mid-August, 2015, RFC: Late-October, 2015 • Final Decision: March 22, 2016 • Operations – Phase-I for compute nodes (CPU only): July 1,

Beyond ppOpen-HPCApplications and Algorithms in the

Post-K/Post-Moore Era

Kengo NakajimaInformation Technology Center

The University of Tokyo

Accelerated Data and Computing Workshop (ADAC)June 12-14, 2016, Lugano, Switzerland

This presentation is at the University of Tokyo, Japan

• Supercomputer Systems, Information Technology Center, The University of Tokyo

• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary

2

3

Supercomputers at ITC, U. of TokyoTotal Users > 2,000

3

Total Peak performance : 54.9 TFLOPSTotal number of nodes : 56Total memory : 11200 GBPeak performance / node : 980.48 GFLOPSMain memory per node : 200 GBDisk capacity : 556 TBIBM POWER 7 3.83GHz

since November 2011

Yayoi (Hitachi SR16000/M1)

Total Peak performance : 1.13 PFLOPSTotal number of nodes : 4800Total memory : 150 TBPeak performance / node : 236.5 GFLOPSMain memory per node : 32 GBDisk capacity : 1.1 PB + 2.1 PBSPARC64 Ixfx 1.84GHz

since April 2012

Oakleaf-FX (Fujitsu PRIMEHPC FX10)

Total Peak performance : 136.2 TFLOPSTotal number of nodes : 576Total memory : 18.4 TBPeak performance / node : 236.5 GFLOPSMain memory per node : 32 GBDisk capacity : 147TB + 295TBSPARC64 Ixfx 1.84GHz

since April 2014Special System for Long-Term Jobs up to 168 hours

Oakbridge-FX(Fujitsu PRIMEHPC FX10)

Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle

4

FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+

18.8TFLOPS, 16.4TB

KPeta

Yayoi: Hitachi SR16000/M1IBM Power‐7

54.9 TFLOPS, 11.2 TB

Hitachi HA8000 (T2K)AMD Opteron

140TFLOPS, 31.3TB

Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB

Post‐KK computer

Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB

Oakbridge‐FX136.2 TFLOPS, 18.4 TB

JCAHPC: U.Tsukuba & U.Tokyo

Post T2K System: Oakforest-PACShttp://jcahpc.jp/pr/pr-en-20160510.html

• 25 PFLOPS, Fall 2016: by Fujitsu• 8,208 Intel Xeon/Phi (KNL)

– Full operation starts on December 1, 2016• Joint Center for Advanced High Performance

Computing (JCAHPC, http://jcahpc.jp/)– University of Tsukuba– University of Tokyo

• New system will installed in Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba

5

Integrated Supercomputer System for Data Analyses & Scientific

Simulations: MPT2K（Mini Post T2K）• Post T2K (PT2K): Oakforest-PACS (OFP)

– Operation starts Fall 2016 (Delayed)• MPT2K (Mini PT2K)

– Additional Computational Resource until PT2K starts• FX10’s are so busy

– Pilot system towards Post FX10 after Fall 2018• Data Analysis, Deep Learning etc.• New area of users to be targeted in Post FX10

– Schedule• RFI: Mid-August, 2015, RFC: Late-October, 2015• Final Decision: March 22, 2016• Operations

– Phase-I for compute nodes (CPU only): July 1, 2016– Phase-II for full system: March 1, 2017

6

Reedbush (Mini PostT2K) (1/2)• SGI was awarded (Mar. 22, 2016)• Compute Nodes (CPU only): Reedbush-U

– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x 2socket (1.210 TF), 256 GiB (153.6GB/sec)

– InfiniBand EDR, Full bisection Fat-tree– Total System: 420 nodes, 508.0 TF

• Compute Nodes (with Accelerators): Reedbush-H– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x

2socket, 256 GiB (153.6GB/sec)– NVIDIA Pascal GPU (Tesla P100) : Our First GPU System

• (4.8-5.3TF, 720GB/sec, 16GiB) x 2 / node– InfiniBand FDR x 2ch (for ea. GPU), Full bisection Fat-tree– 120 nodes, 145.2 TF(CPU)+ 1.15~1.27 PF(GPU)=

1.30~1.42 PF

7

Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle

8

FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+

18.8TFLOPS, 16.4TB

KPeta

Yayoi: Hitachi SR16000/M1IBM Power‐7

54.9 TFLOPS, 11.2 TB

Reedbush, SGIBroadwell + Pascal1.80‐1.93 PFLOPS

Hitachi HA8000 (T2K)AMD Opteron

140TFLOPS, 31.3TB

Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB

Post FX1050+ PFLOPS (?)

Post‐KK computer

Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB

Oakbridge‐FX136.2 TFLOPS, 18.4 TB

JCAHPC: U.Tsukuba & U.Tokyo

CSE & Big DataU.Tokyo’s 1st

System with GPU’s

Configuration of Each Compute Node of Reedbush‐H

NVIDIA Pascal

NVIDIA Pascal

NVLinK20 GB/s

Intel Xeon E5‐2695 v4

(Broadwell‐EP)

NVLinK20 GB/s

QPI76.8 GB/s

76.8 GB/s

IB FDRHCA

G3

x16 16 GB/s 16 GB/s

DDR4Mem128GB

EDR switch

EDR

76.8 GB/sw. 4ch

76.8 GB/sw. 4ch

Intel Xeon E5‐2695 v4

(Broadwell‐EP)QPIDDR4DDR4DDR4

DDR4DDR4DDR4DDR4

Mem128GB

PCIe swG3

x16

PCIe sw

G3

x16

G3

x16

IB FDRHCA

Why “Reedbush” ?• L'homme est un roseau

pensant.• Man is a thinking reed.• 人間は考える葦である

Pensées (Blaise Pascal)

Blaise Pascal(1623-1662)

Reedbush (Mini PostT2K) (2/2)• Storage/File Systems

– Shared Parallel File-system (Lustre) • 5.04 PB, 145.2 GB/sec

– Fast File Cache System: Burst Buffer (DDN IME (Infinite Memory Engine))

• SSD: 209.5 TB, 450 GB/sec

• Power, Cooling, Space– Air cooling only, < 500 kVA (without A/C): 378 kVA– < 90 m2

• Software & Toolkit for Data Analysis, Deep Learning …– OpenCV, Theano, Anaconda, ROOT, TensorFlow– Torch, Caffe, Cheiner, GEANT4

11

Management Servers

InfiniBand EDR 4x, Full‐bisection Fat‐tree

Parallel File System5.04 PB

Lustre FilesystemDDN SFA14KE x3

High‐speedFile Cache System

209 TB

DDN IME14K x6

Dual‐port InfiniBand FDR 4x

Login node

Login Node x6

Compute Nodes: 1.925 PFlops

CPU: Intel Xeon E5‐2695 v4 x 2 socket(Broadwell‐EP 2.1 GHz 18 core, 45 MB L3‐cache)

Mem: 256GB (DDR4‐2400, 153.6 GB/sec)×420

Reedbush‐U (CPU only) 508.03 TFlopsCPU: Intel Xeon E5‐2695 v4 x 2 socketMem: 256 GB (DDR4‐2400, 153.6 GB/sec)GPU: NVIDIA Tesla P100 x 2

(Pascal, SXM2, 4.8‐5.3 TF, Mem: 16 GB, 720 GB/sec, PCIe Gen3 x16, NVLink (for GPU) 20 GB/sec x 2 brick )

×120

Reedbush‐H (w/Accelerators) 1297.15‐1417.15 TFlops

436.2 GB/s145.2 GB/s

Login node Login node Login node Login node Login node UTnet Users

InfiniBand EDR 4x 100 Gbps /node

Mellanox CS7500 634 port + SB7800/7890 36 port x 14

SGI RackableC2112‐4GP3

56 Gbps x2 /node

SGI Rackable C1100 series



13

ppOpen-HPC: Summary• ppOpen-HPC is an open source infrastructure for

development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures with automatic tuning (AT), and it consists of various types of libraries, which cover general procedures for scientific computation

• Application framework with automatic tuning (AT) “pp” : post-peta-scale

• Target: Post T2K System (Original Schedule: FY.2015) could be extended to various types of platforms

• Team with 7 institutes, >50 people (5 PDs) from various fields: Co-Design ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo Hokkaido U., Kyoto U., JAMSTEC 14

15

Collaborations, Outreaching• International Collaborations

– Lawrence Berkeley National Lab.– National Taiwan University– National Central University (Taiwan)– ESSEX/SPPEXA/DFG, Germany– IPCC（Intel Parallel Comp. Ctr.）

• Outreaching, Applications– Large-Scale Simulations

• Geologic CO2 Storage• Astrophysics• Earthquake Simulations etc.• ppOpen-AT, ppOpen-MATH/VIS,

ppOpen-MATH/MP, Linear Solvers– Intl. Workshops (2012,13,15)– Tutorials, Classes

16

FrameworkAppl. Dev.

MathLibraries

AutomaticTuning (AT)

SystemSoftware

ppOpen-HPC covers …1717

Schedule of Public Release (with English Documents, MIT License)

http://ppopenhpc.cc.u-tokyo.ac.jp/• Released at SC-XY (or can be downloaded)• Multicore/manycore cluster version (Flat MPI,

OpenMP/MPI Hybrid) with documents in English• We are now focusing on MIC/Xeon Phi• Collaborations are welcome• History

– SC12, Nov 2012 (Ver.0.1.0)– SC13, Nov 2013 (Ver.0.2.0)– SC14, Nov 2014 (Ver.0.3.0)– SC15, Nov 2015 (Ver.1.0.0)

18

New Features in Ver.1.0.0http://ppopenhpc.cc.u-tokyo.ac.jp/

• HACApK library for H-matrix comp. in ppOpen-APPL/BEM (OpenMP/MPI Hybrid Version)– First Open Source Library by OpenMP/MPI Hybrid

• ppOpen-MATH/MP (Coupler for Multiphysics Simulations, Loose Coupling of FEM & FDM)

• Matrix Assembly and Linear Solvers for ppOpen-APPL/FVM

19

20

Support• FY.2011-2015: Post Peta CREST

– JST/CREST• FY.2016-2018: ESSEX-II

– JST/CREST & DFG/SPPEXA (Germany) Collaboration– ESSEX: Equipping Sparse Solvers for Exascale

(FY.2013-2015)• http://blogs.fau.de/essex/• Leading PI: Prof. Gerhard Wellein (U. Erlangen)

– ESSEX II (FY.2016-2018)• ESSEX + U.Tsukuba (Sakurai) + U. Tokyo (Nakajima)

21



22

23

c/o Satoshi Matsuoka (Tokyo Tech)

Assumptions & Expectations towards Post-K/Post-Moore Era

• Post-K (-2020)– Memory Wall– Hierarchical Memory (e.g. KNL: MCDRAM-DDR)

• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Large & Heterogeneous Latency

• 3D Stacked Memory, Optical Network• Both of Memory & Network will be more hierarchical

– Application-Customized Hardware, FPGA• Common Issues

– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes, Large Number of Cores per Node

• under certain constraints (e.g. power, space …)

24



• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Large & Heterogeneous Latency



– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes, Large Number of Cores per Node


25

CAE Applications in Post-K Era• Block-structured AMR, Voxel-type FEM

– Semi-structured, Semi-unstructured– Length of inner loops is fixed, or its variation

is rather smaller– (ELL, Sliced ELL, SELL-C-) can be applied– Better han fully unstructured FEM

26

CRS ELL Sliced ELL SELL-C-(SELL-2-8)

C

• Robust/Scalable GMG/AMG Preconditioning– Limited or fixed

number of non-zero off-diagonals of coefficient matrices

COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED

Simulation of Geologic CO2 Storage

27

Fujitsu FX10（Oakleaf-FX），30M DOF: 2x-3x improvement[Dr. Hajime Yamamoto, Taisei]

30 million DoF (10 million grids × 3 DoF/grid node)

0.1

1

10

100

1000

10000

10 100 1000 10000

Calculation Time (sec)

Number of Processors

0.1

1

10

100

1000

10000

10 100 1000 10000

Calculation Time (sec)

Number of Processors

TOUGH2‐MPon FX10

2‐3 times speedup

Average time for solving matrix for one time step

3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer

pK-Open-HPCFramework for Exa-Feasible Applications

pK: Post K or similar manycore architectures• pK-Open-FVM

– Framework for application development using FVM with Block-AMR and related utilities

• pK-Open-SOL– ELL, Sliced-ELL, SELL-C-– Efficient parallel preconditioned iterative solvers for

sparse coefficient matrices/H-matrices• AMG/GMG Preconditioning• Low-Rank Approximation

• pK-Open-AT– Auto-tuning (AT) capabilities for selection of optimum

computational functions/procedures with optimum combinations of numerical parameters for pK-Open-SOL

28

29

pK-Open-FVM

pK-Open-SOL

pK-Open-AT

pK-Open-HPC

Works in ESSEX-IIFY.2016-2018

• pK-Open-HPC‒ Extension of ppOpen-HPC for Block-Structured FVM

pK-Open-FVM: Framework for Block-Structured FVM pK-Open-SOL: Linear Solver pK-Open-AT: Automatic Tuning

• Preconditioned Iterative Solver for Quantum Science‒ pK-Open-SOL

• Performance Modeling of Scientific Applications/AT‒ pK-Open-AT

• SELL-C- based Preconditioned Iterative Solvers‒ pK-Open-SOL

Interoperability

30



31



• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Larger & Heterogeneous Latency



– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes/Number of Cores per Node


32

Applications & Algorithms in Post-Moore Era

• （Compute Intensity）⇒（Data Movement Intensity）

• Implicit scheme strikes back !: but not straightforward• Hierarchical Methods for Hiding Latency

– Hierarchical Coarse Grid Aggregation (hCGA) in Multigrid– Parallel in Space/Time (PiST)

• Communication/Synchronization Avoiding/Reducing Algorithms– Network latency is already a big bottleneck for parallel sparse

linear solvers (SpMV, Dot Products) • Utilization of Manycores• Power-aware Methods

– Approximate Computing, Power Management, FPGA

33

Parallel multigrid (MG) is suitable for large scale computations

34

CGA(Coarse Grid Aggregation)

Groundwater Flow Simulation with up to 4,096 nodes on Fujitsu FX10 (GMG-CG)up to 17,179,869,184 meshes (643 meshes/core) [KN ICPADS 2014]

Linear equations with 17.2 billions of DOF can be solved in 8 sec.

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

CGALevel=1

Level=2

Level=m-3

Fine

Coarse

Coarse grid solver on a single MPI process (multi-threaded, further multigrid)

• Communication overhead could be reduced

• Coarse grid solver is more expensive than original approach.

• If process number is larger, this effect might be significant

Level=m-2

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

CGAhCGA

Level=1

Level=2

Level=m‐3

Level=m‐3

Fine

Coarse

Level=m‐2

Coarse grid solver on a single MPI process (multi‐threaded, further multigrid)

Hierarchical Method will work well on the Post-Moore Systems

35

hCGA (Hierarchical Coarse Grid Aggregation)

Groundwater Flow Simulation with up to 4,096 nodes on Fujitsu FX10 (GMG-CG)up to 17,179,869,184 meshes (643 meshes/core) [KN ICPADS 2014]

Linear equations with 17.2 billions of DOF can be solved in 8 sec.

x1.61

Parallel-in-Space/Time (PiST) MG is scalable, but improvement of performance is

limited by parallelization only in space direction Time-Dependent Problems: Concurrency in Time Dir. Multigrid in (Space+Time) DirectionTraditional time-dependent method: Point-Wise Gauss SeidelXBraid：Lawrence Livermore National Laboratory

Application to nonlinear problems (Transient Navier-Stokes Eqn’s)

MS with 3 sessions in SIAM PP16 (April 2016) PiST approach is suitable for the Post-Moore Systems

with a complex and deeply hierarchical data network

36

that causes large latency.

Comparison between PiST and “Time Stepping” for Transient

Poisson EquationsEffective if processor# is VERY large

37

2D：129 1638516 processors in space direction for PiST

3D：33 40978 processors in space direction for PiST

R. D. Falgout, S. Friedhoff, T. V. Kolev, S. P. MacLachlan, and J. B. Schroder. Parallel time integration with multigrid. SIAM Journal on Scientific Computing, 36(6), C635-C661. 2014

Applications & Algorithms in Post-Moore Era

• 計算量重視（Compute Intensity）⇒データ移動重視（Data Movement Intensity）

• Implicit scheme strikes back !: but not straightforward• Hierarchical Methods for Hiding Latency

– Hierarchical Coarse Grid Aggregation (hCGA) in Multigrid– Parallel in Space/Time (PiST)

• Communication/Synchronization Avoiding/Reducing Algorithms– Network latency is already a big bottleneck for parallel

sparse linear solvers (SpMV, Dot Products) • Utilization of Manycores• Power-aware Methods

– Approximate Computing, Power Management, FPGA

38

39

3D FEMSolid Mechanics96x80x64 nodesStrong Scaling

0

250

500

750

1000

1250

1500

0 160 320 480 640 800 960 1120 1280

Spee

d-U

p

CORE#

Alg.1 Alg.2Alg.3 Alg.4Ideal

80

100

120

140

160

0 160 320 480 640 800 960 1120 1280

Rel

ativ

e Pe

rfor

ance

(%)

CORE#

Alg.2 Alg.3 Alg.4

Speed-Up (20-1,280 cores) Relative Performance to Alg.1 (Original)

Alg.1 Original PCGAlg.2 Chronopoulos/GearAlg.3 Pipelined CGAlg.4 Gropp’s CG

P. Ghysels et al., Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing 40, 2014

40

Alg.4: Gropp’s Asynch. CGDot products – Preconditioning/SpMV

Framework for Appl. Developmentin Post-Moore Era

• ppOpen-HPC＋α• Pre/Post Processor for Parallel-in-Space/Time（PiST）

– PiST： Suitable for Post-Moore features (Deep/complex hierarchy of memory/network, higher latency): Pre/Post-Data-Movement

• (a) Nonlinear Algorithm• (b) AMR, (c) Visualization, (d) Coupler for Multiphysics

• Optimization of the Coupler– Fault-Resiliency, Strong Coupling, AMR

• Co-Design/Collaboration– Linear Solvers, AT framework, Space/Time Tiling– Approximate Computing, Power Management

41

42

FrameworkAppl. Dev.

MathLibraries

AutomaticTuning (AT)

SystemSoftware

Framework for Appl. Developmentin Post-Moore Era

• ppOpen-HPC＋α• Pre/Post Processor for Parallel-in-Space/Time（PiST）

– PiST： Suitable for Post-Moore features (Deep/complex hierarchy of memory/network, higher latency): Pre/Post-Data-Movement

• (a) Nonlinear Algorithm• (b) AMR, (c) Visualization, (d) Coupler for Multiphysics

• Optimization of the Coupler– Fault-Resiliency, Strong Coupling, AMR

• Co-Design/Collaboration– Linear Solvers, AT framework, Space/Time Tiling– Approximate Computing, Power Management

43

Data-MovementChallenging !!

Atmosphere-Ocean Coupling on K Computer

44c/o M.Satoh

KVS (red is o r m ong oDB)

web socket/REST(C+ + )

H IVE Renderer(C+ + )

SURFACE(C+ + )

[Raytracer]

Lua(C)

Scene fi l( Lua )

nod e.js server(js)

OpenM P

Load er Bu ild er

lib cio ,

lib cpm ,

…

Im ag e / SceneCom m and (Lua) / Param eter(Json )

Browser U I(js)

CSSUI JS

socket.io (js)

nod e-red is w ith h ired is(js)

GLES

Op en

GL

stand alone m ode

M PI

Parallel Visualization on the K computer

Mobile and Portable Devices

Desktop PCs

Visualization ClustersSupercomputers

[c/o K. Ono]



46

Summary• Supercomputers in ITC/U.Tokyo• ppOpen-HPC

– Supporting GPU by OpenACC on Reedbush-H• pK-Open-HPC & ESSEX

• Post Moore Issues• Co-Design/Collaborations towards Post Moore Era

– Utilization of FPGA• OpenACC-based Programming Environment

– Construction of Performance Models• Architecture• AT Framework

47

16th SIAM Conference on Parallel Processing for Scientific Computing (PP18)March 7-10, 2018, Tokyo, Japan

• Venue– Nishiwaseda Campus, Waseda University (near Shinjuku)

• Organizing Committee Co-Chairs’s– Satoshi Matsuoka (Tokyo Institute of Technology, Japan) – Kengo Nakajima (The University of Tokyo, Japan)– Olaf Schenk (Universita della Svizzera italiana, Switzerland)

• Contact– Kengo Nakajima, nakajima(at)cc.u-tokyo.ac.jp– http://nkl.cc.u-tokyo.ac.jp/SIAMPP18/

48