Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Beyond ppOpen-HPCApplications and Algorithms in the
Post-K/Post-Moore Era
Kengo NakajimaInformation Technology Center
The University of Tokyo
Accelerated Data and Computing Workshop (ADAC)June 12-14, 2016, Lugano, Switzerland
This presentation is at the University of Tokyo, Japan
• Supercomputer Systems, Information Technology Center, The University of Tokyo
• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary
2
3
Supercomputers at ITC, U. of TokyoTotal Users > 2,000
3
Total Peak performance : 54.9 TFLOPSTotal number of nodes : 56Total memory : 11200 GBPeak performance / node : 980.48 GFLOPSMain memory per node : 200 GBDisk capacity : 556 TBIBM POWER 7 3.83GHz
since November 2011
Yayoi (Hitachi SR16000/M1)
Total Peak performance : 1.13 PFLOPSTotal number of nodes : 4800Total memory : 150 TBPeak performance / node : 236.5 GFLOPSMain memory per node : 32 GBDisk capacity : 1.1 PB + 2.1 PBSPARC64 Ixfx 1.84GHz
since April 2012
Oakleaf-FX (Fujitsu PRIMEHPC FX10)
Total Peak performance : 136.2 TFLOPSTotal number of nodes : 576Total memory : 18.4 TBPeak performance / node : 236.5 GFLOPSMain memory per node : 32 GBDisk capacity : 147TB + 295TBSPARC64 Ixfx 1.84GHz
since April 2014Special System for Long-Term Jobs up to 168 hours
Oakbridge-FX(Fujitsu PRIMEHPC FX10)
Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle
4
FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+
18.8TFLOPS, 16.4TB
KPeta
Yayoi: Hitachi SR16000/M1IBM Power‐7
54.9 TFLOPS, 11.2 TB
Hitachi HA8000 (T2K)AMD Opteron
140TFLOPS, 31.3TB
Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB
Post‐KK computer
Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB
Oakbridge‐FX136.2 TFLOPS, 18.4 TB
JCAHPC: U.Tsukuba & U.Tokyo
Post T2K System: Oakforest-PACShttp://jcahpc.jp/pr/pr-en-20160510.html
• 25 PFLOPS, Fall 2016: by Fujitsu• 8,208 Intel Xeon/Phi (KNL)
– Full operation starts on December 1, 2016• Joint Center for Advanced High Performance
Computing (JCAHPC, http://jcahpc.jp/)– University of Tsukuba– University of Tokyo
• New system will installed in Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba
5
Integrated Supercomputer System for Data Analyses & Scientific
Simulations: MPT2K(Mini Post T2K)• Post T2K (PT2K): Oakforest-PACS (OFP)
– Operation starts Fall 2016 (Delayed)• MPT2K (Mini PT2K)
– Additional Computational Resource until PT2K starts• FX10’s are so busy
– Pilot system towards Post FX10 after Fall 2018• Data Analysis, Deep Learning etc.• New area of users to be targeted in Post FX10
– Schedule• RFI: Mid-August, 2015, RFC: Late-October, 2015• Final Decision: March 22, 2016• Operations
– Phase-I for compute nodes (CPU only): July 1, 2016– Phase-II for full system: March 1, 2017
6
Reedbush (Mini PostT2K) (1/2)• SGI was awarded (Mar. 22, 2016)• Compute Nodes (CPU only): Reedbush-U
– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x 2socket (1.210 TF), 256 GiB (153.6GB/sec)
– InfiniBand EDR, Full bisection Fat-tree– Total System: 420 nodes, 508.0 TF
• Compute Nodes (with Accelerators): Reedbush-H– Intel Xeon E5-2695v4 (Broadwell-EP, 2.1GHz 18core ) x
2socket, 256 GiB (153.6GB/sec)– NVIDIA Pascal GPU (Tesla P100) : Our First GPU System
• (4.8-5.3TF, 720GB/sec, 16GiB) x 2 / node– InfiniBand FDR x 2ch (for ea. GPU), Full bisection Fat-tree– 120 nodes, 145.2 TF(CPU)+ 1.15~1.27 PF(GPU)=
1.30~1.42 PF
7
Supercomputers in ITC/U.Tokyo2 big systems, 6 yr. cycle
8
FY08 09 10 11 12 13 14 15 16 17 18 19 20 21 22Hitachi SR11K/J2IBM Power‐5+
18.8TFLOPS, 16.4TB
KPeta
Yayoi: Hitachi SR16000/M1IBM Power‐7
54.9 TFLOPS, 11.2 TB
Reedbush, SGIBroadwell + Pascal1.80‐1.93 PFLOPS
Hitachi HA8000 (T2K)AMD Opteron
140TFLOPS, 31.3TB
Oakforest‐PACSFujitsu, Intel KNL25PFLOPS, 919.3TB
Post FX1050+ PFLOPS (?)
Post‐KK computer
Oakleaf‐FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx1.13 PFLOPS, 150 TB
Oakbridge‐FX136.2 TFLOPS, 18.4 TB
JCAHPC: U.Tsukuba & U.Tokyo
CSE & Big DataU.Tokyo’s 1st
System with GPU’s
Configuration of Each Compute Node of Reedbush‐H
NVIDIA Pascal
NVIDIA Pascal
NVLinK20 GB/s
Intel Xeon E5‐2695 v4
(Broadwell‐EP)
NVLinK20 GB/s
QPI76.8 GB/s
76.8 GB/s
IB FDRHCA
G3
x16 16 GB/s 16 GB/s
DDR4Mem128GB
EDR switch
EDR
76.8 GB/sw. 4ch
76.8 GB/sw. 4ch
Intel Xeon E5‐2695 v4
(Broadwell‐EP)QPIDDR4DDR4DDR4
DDR4DDR4DDR4DDR4
Mem128GB
PCIe swG3
x16
PCIe sw
G3
x16
G3
x16
IB FDRHCA
Why “Reedbush” ?• L'homme est un roseau
pensant.• Man is a thinking reed.• 人間は考える葦である
Pensées (Blaise Pascal)
Blaise Pascal(1623-1662)
Reedbush (Mini PostT2K) (2/2)• Storage/File Systems
– Shared Parallel File-system (Lustre) • 5.04 PB, 145.2 GB/sec
– Fast File Cache System: Burst Buffer (DDN IME (Infinite Memory Engine))
• SSD: 209.5 TB, 450 GB/sec
• Power, Cooling, Space– Air cooling only, < 500 kVA (without A/C): 378 kVA– < 90 m2
• Software & Toolkit for Data Analysis, Deep Learning …– OpenCV, Theano, Anaconda, ROOT, TensorFlow– Torch, Caffe, Cheiner, GEANT4
11
Management Servers
InfiniBand EDR 4x, Full‐bisection Fat‐tree
Parallel File System5.04 PB
Lustre FilesystemDDN SFA14KE x3
High‐speedFile Cache System
209 TB
DDN IME14K x6
Dual‐port InfiniBand FDR 4x
Login node
Login Node x6
Compute Nodes: 1.925 PFlops
CPU: Intel Xeon E5‐2695 v4 x 2 socket(Broadwell‐EP 2.1 GHz 18 core, 45 MB L3‐cache)
Mem: 256GB (DDR4‐2400, 153.6 GB/sec)×420
Reedbush‐U (CPU only) 508.03 TFlopsCPU: Intel Xeon E5‐2695 v4 x 2 socketMem: 256 GB (DDR4‐2400, 153.6 GB/sec)GPU: NVIDIA Tesla P100 x 2
(Pascal, SXM2, 4.8‐5.3 TF, Mem: 16 GB, 720 GB/sec, PCIe Gen3 x16, NVLink (for GPU) 20 GB/sec x 2 brick )
×120
Reedbush‐H (w/Accelerators) 1297.15‐1417.15 TFlops
436.2 GB/s145.2 GB/s
Login node Login node Login node Login node Login node UTnet Users
InfiniBand EDR 4x 100 Gbps /node
Mellanox CS7500 634 port + SB7800/7890 36 port x 14
SGI RackableC2112‐4GP3
56 Gbps x2 /node
SGI Rackable C1100 series
• Supercomputer Systems, Information Technology Center, The University of Tokyo
• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary
13
ppOpen-HPC: Summary• ppOpen-HPC is an open source infrastructure for
development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures with automatic tuning (AT), and it consists of various types of libraries, which cover general procedures for scientific computation
• Application framework with automatic tuning (AT) “pp” : post-peta-scale
• Target: Post T2K System (Original Schedule: FY.2015) could be extended to various types of platforms
• Team with 7 institutes, >50 people (5 PDs) from various fields: Co-Design ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo Hokkaido U., Kyoto U., JAMSTEC 14
15
Collaborations, Outreaching• International Collaborations
– Lawrence Berkeley National Lab.– National Taiwan University– National Central University (Taiwan)– ESSEX/SPPEXA/DFG, Germany– IPCC(Intel Parallel Comp. Ctr.)
• Outreaching, Applications– Large-Scale Simulations
• Geologic CO2 Storage• Astrophysics• Earthquake Simulations etc.• ppOpen-AT, ppOpen-MATH/VIS,
ppOpen-MATH/MP, Linear Solvers– Intl. Workshops (2012,13,15)– Tutorials, Classes
16
FrameworkAppl. Dev.
MathLibraries
AutomaticTuning (AT)
SystemSoftware
ppOpen-HPC covers …1717
Schedule of Public Release (with English Documents, MIT License)
http://ppopenhpc.cc.u-tokyo.ac.jp/• Released at SC-XY (or can be downloaded)• Multicore/manycore cluster version (Flat MPI,
OpenMP/MPI Hybrid) with documents in English• We are now focusing on MIC/Xeon Phi• Collaborations are welcome• History
– SC12, Nov 2012 (Ver.0.1.0)– SC13, Nov 2013 (Ver.0.2.0)– SC14, Nov 2014 (Ver.0.3.0)– SC15, Nov 2015 (Ver.1.0.0)
18
New Features in Ver.1.0.0http://ppopenhpc.cc.u-tokyo.ac.jp/
• HACApK library for H-matrix comp. in ppOpen-APPL/BEM (OpenMP/MPI Hybrid Version)– First Open Source Library by OpenMP/MPI Hybrid
• ppOpen-MATH/MP (Coupler for Multiphysics Simulations, Loose Coupling of FEM & FDM)
• Matrix Assembly and Linear Solvers for ppOpen-APPL/FVM
19
20
Support• FY.2011-2015: Post Peta CREST
– JST/CREST• FY.2016-2018: ESSEX-II
– JST/CREST & DFG/SPPEXA (Germany) Collaboration– ESSEX: Equipping Sparse Solvers for Exascale
(FY.2013-2015)• http://blogs.fau.de/essex/• Leading PI: Prof. Gerhard Wellein (U. Erlangen)
– ESSEX II (FY.2016-2018)• ESSEX + U.Tsukuba (Sakurai) + U. Tokyo (Nakajima)
21
• Supercomputer Systems, Information Technology Center, The University of Tokyo
• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary
22
23
c/o Satoshi Matsuoka (Tokyo Tech)
Assumptions & Expectations towards Post-K/Post-Moore Era
• Post-K (-2020)– Memory Wall– Hierarchical Memory (e.g. KNL: MCDRAM-DDR)
• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Large & Heterogeneous Latency
• 3D Stacked Memory, Optical Network• Both of Memory & Network will be more hierarchical
– Application-Customized Hardware, FPGA• Common Issues
– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes, Large Number of Cores per Node
• under certain constraints (e.g. power, space …)
24
Assumptions & Expectations towards Post-K/Post-Moore Era
• Post-K (-2020)– Memory Wall– Hierarchical Memory (e.g. KNL: MCDRAM-DDR)
• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Large & Heterogeneous Latency
• 3D Stacked Memory, Optical Network• Both of Memory & Network will be more hierarchical
– Application-Customized Hardware, FPGA• Common Issues
– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes, Large Number of Cores per Node
• under certain constraints (e.g. power, space …)
25
CAE Applications in Post-K Era• Block-structured AMR, Voxel-type FEM
– Semi-structured, Semi-unstructured– Length of inner loops is fixed, or its variation
is rather smaller– (ELL, Sliced ELL, SELL-C-) can be applied– Better han fully unstructured FEM
26
CRS ELL Sliced ELL SELL-C-(SELL-2-8)
C
• Robust/Scalable GMG/AMG Preconditioning– Limited or fixed
number of non-zero off-diagonals of coefficient matrices
COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED
Simulation of Geologic CO2 Storage
27
Fujitsu FX10(Oakleaf-FX),30M DOF: 2x-3x improvement[Dr. Hajime Yamamoto, Taisei]
30 million DoF (10 million grids × 3 DoF/grid node)
0.1
1
10
100
1000
10000
10 100 1000 10000
Calculation Time (sec)
Number of Processors
0.1
1
10
100
1000
10000
10 100 1000 10000
Calculation Time (sec)
Number of Processors
TOUGH2‐MPon FX10
2‐3 times speedup
Average time for solving matrix for one time step
3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer
pK-Open-HPCFramework for Exa-Feasible Applications
pK: Post K or similar manycore architectures• pK-Open-FVM
– Framework for application development using FVM with Block-AMR and related utilities
• pK-Open-SOL– ELL, Sliced-ELL, SELL-C-– Efficient parallel preconditioned iterative solvers for
sparse coefficient matrices/H-matrices• AMG/GMG Preconditioning• Low-Rank Approximation
• pK-Open-AT– Auto-tuning (AT) capabilities for selection of optimum
computational functions/procedures with optimum combinations of numerical parameters for pK-Open-SOL
28
29
pK-Open-FVM
pK-Open-SOL
pK-Open-AT
pK-Open-HPC
Works in ESSEX-IIFY.2016-2018
• pK-Open-HPC‒ Extension of ppOpen-HPC for Block-Structured FVM
pK-Open-FVM: Framework for Block-Structured FVM pK-Open-SOL: Linear Solver pK-Open-AT: Automatic Tuning
• Preconditioned Iterative Solver for Quantum Science‒ pK-Open-SOL
• Performance Modeling of Scientific Applications/AT‒ pK-Open-AT
• SELL-C- based Preconditioned Iterative Solvers‒ pK-Open-SOL
Interoperability
30
• Supercomputer Systems, Information Technology Center, The University of Tokyo
• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary
31
Assumptions & Expectations towards Post-K/Post-Moore Era
• Post-K (-2020)– Memory Wall– Hierarchical Memory (e.g. KNL: MCDRAM-DDR)
• Post-Moore (-2025? -2029?)– Larger Size of Memory & Cache– Higher Bandwidth, Larger & Heterogeneous Latency
• 3D Stacked Memory, Optical Network• Both of Memory & Network will be more hierarchical
– Application-Customized Hardware, FPGA• Common Issues
– Hierarchy, Latency (Memory, Network etc.)– Large Number of Nodes/Number of Cores per Node
• under certain constraints (e.g. power, space …)
32
Applications & Algorithms in Post-Moore Era
• (Compute Intensity)⇒(Data Movement Intensity)
• Implicit scheme strikes back !: but not straightforward• Hierarchical Methods for Hiding Latency
– Hierarchical Coarse Grid Aggregation (hCGA) in Multigrid– Parallel in Space/Time (PiST)
• Communication/Synchronization Avoiding/Reducing Algorithms– Network latency is already a big bottleneck for parallel sparse
linear solvers (SpMV, Dot Products) • Utilization of Manycores• Power-aware Methods
– Approximate Computing, Power Management, FPGA
33
Parallel multigrid (MG) is suitable for large scale computations
34
CGA(Coarse Grid Aggregation)
Groundwater Flow Simulation with up to 4,096 nodes on Fujitsu FX10 (GMG-CG)up to 17,179,869,184 meshes (643 meshes/core) [KN ICPADS 2014]
Linear equations with 17.2 billions of DOF can be solved in 8 sec.
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
CGALevel=1
Level=2
Level=m-3
Fine
Coarse
Coarse grid solver on a single MPI process (multi-threaded, further multigrid)
• Communication overhead could be reduced
• Coarse grid solver is more expensive than original approach.
• If process number is larger, this effect might be significant
Level=m-2
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
CGAhCGA
Level=1
Level=2
Level=m‐3
Level=m‐3
Fine
Coarse
Level=m‐2
Coarse grid solver on a single MPI process (multi‐threaded, further multigrid)
Hierarchical Method will work well on the Post-Moore Systems
35
hCGA (Hierarchical Coarse Grid Aggregation)
Groundwater Flow Simulation with up to 4,096 nodes on Fujitsu FX10 (GMG-CG)up to 17,179,869,184 meshes (643 meshes/core) [KN ICPADS 2014]
Linear equations with 17.2 billions of DOF can be solved in 8 sec.
x1.61
Parallel-in-Space/Time (PiST) MG is scalable, but improvement of performance is
limited by parallelization only in space direction Time-Dependent Problems: Concurrency in Time Dir. Multigrid in (Space+Time) DirectionTraditional time-dependent method: Point-Wise Gauss SeidelXBraid:Lawrence Livermore National Laboratory
Application to nonlinear problems (Transient Navier-Stokes Eqn’s)
MS with 3 sessions in SIAM PP16 (April 2016) PiST approach is suitable for the Post-Moore Systems
with a complex and deeply hierarchical data network
36
that causes large latency.
Comparison between PiST and “Time Stepping” for Transient
Poisson EquationsEffective if processor# is VERY large
37
2D:129 1638516 processors in space direction for PiST
3D:33 40978 processors in space direction for PiST
R. D. Falgout, S. Friedhoff, T. V. Kolev, S. P. MacLachlan, and J. B. Schroder. Parallel time integration with multigrid. SIAM Journal on Scientific Computing, 36(6), C635-C661. 2014
Applications & Algorithms in Post-Moore Era
• 計算量重視(Compute Intensity)⇒データ移動重視(Data Movement Intensity)
• Implicit scheme strikes back !: but not straightforward• Hierarchical Methods for Hiding Latency
– Hierarchical Coarse Grid Aggregation (hCGA) in Multigrid– Parallel in Space/Time (PiST)
• Communication/Synchronization Avoiding/Reducing Algorithms– Network latency is already a big bottleneck for parallel
sparse linear solvers (SpMV, Dot Products) • Utilization of Manycores• Power-aware Methods
– Approximate Computing, Power Management, FPGA
38
39
3D FEMSolid Mechanics96x80x64 nodesStrong Scaling
0
250
500
750
1000
1250
1500
0 160 320 480 640 800 960 1120 1280
Spee
d-U
p
CORE#
Alg.1 Alg.2Alg.3 Alg.4Ideal
80
100
120
140
160
0 160 320 480 640 800 960 1120 1280
Rel
ativ
e Pe
rfor
ance
(%)
CORE#
Alg.2 Alg.3 Alg.4
Speed-Up (20-1,280 cores) Relative Performance to Alg.1 (Original)
Alg.1 Original PCGAlg.2 Chronopoulos/GearAlg.3 Pipelined CGAlg.4 Gropp’s CG
P. Ghysels et al., Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing 40, 2014
40
Alg.4: Gropp’s Asynch. CGDot products – Preconditioning/SpMV
Framework for Appl. Developmentin Post-Moore Era
• ppOpen-HPC+α• Pre/Post Processor for Parallel-in-Space/Time(PiST)
– PiST: Suitable for Post-Moore features (Deep/complex hierarchy of memory/network, higher latency): Pre/Post-Data-Movement
• (a) Nonlinear Algorithm• (b) AMR, (c) Visualization, (d) Coupler for Multiphysics
• Optimization of the Coupler– Fault-Resiliency, Strong Coupling, AMR
• Co-Design/Collaboration– Linear Solvers, AT framework, Space/Time Tiling– Approximate Computing, Power Management
41
42
FrameworkAppl. Dev.
MathLibraries
AutomaticTuning (AT)
SystemSoftware
Framework for Appl. Developmentin Post-Moore Era
• ppOpen-HPC+α• Pre/Post Processor for Parallel-in-Space/Time(PiST)
– PiST: Suitable for Post-Moore features (Deep/complex hierarchy of memory/network, higher latency): Pre/Post-Data-Movement
• (a) Nonlinear Algorithm• (b) AMR, (c) Visualization, (d) Coupler for Multiphysics
• Optimization of the Coupler– Fault-Resiliency, Strong Coupling, AMR
• Co-Design/Collaboration– Linear Solvers, AT framework, Space/Time Tiling– Approximate Computing, Power Management
43
Data-MovementChallenging !!
Atmosphere-Ocean Coupling on K Computer
44c/o M.Satoh
KVS (red is o r m ong oDB)
web socket/REST(C+ + )
H IVE Renderer(C+ + )
SURFACE(C+ + )
[Raytracer]
Lua(C)
Scene fi l( Lua )
nod e.js server(js)
OpenM P
Load er Bu ild er
lib cio ,
lib cpm ,
…
Im ag e / SceneCom m and (Lua) / Param eter(Json )
Browser U I(js)
CSSUI JS
socket.io (js)
nod e-red is w ith h ired is(js)
GLES
Op en
GL
stand alone m ode
M PI
Parallel Visualization on the K computer
Mobile and Portable Devices
Desktop PCs
Visualization ClustersSupercomputers
[c/o K. Ono]
• Supercomputer Systems, Information Technology Center, The University of Tokyo
• ppOpen-HPC• pK-Open-HPC• Post Moore Era• Summary
46
Summary• Supercomputers in ITC/U.Tokyo• ppOpen-HPC
– Supporting GPU by OpenACC on Reedbush-H• pK-Open-HPC & ESSEX
• Post Moore Issues• Co-Design/Collaborations towards Post Moore Era
– Utilization of FPGA• OpenACC-based Programming Environment
– Construction of Performance Models• Architecture• AT Framework
47
16th SIAM Conference on Parallel Processing for Scientific Computing (PP18)March 7-10, 2018, Tokyo, Japan
• Venue– Nishiwaseda Campus, Waseda University (near Shinjuku)
• Organizing Committee Co-Chairs’s– Satoshi Matsuoka (Tokyo Institute of Technology, Japan) – Kengo Nakajima (The University of Tokyo, Japan)– Olaf Schenk (Universita della Svizzera italiana, Switzerland)
• Contact– Kengo Nakajima, nakajima(at)cc.u-tokyo.ac.jp– http://nkl.cc.u-tokyo.ac.jp/SIAMPP18/
48