Upload
volien
View
218
Download
0
Embed Size (px)
Citation preview
Copyright 2015 FUJITSU LIMITED
FUJITSU HPC The Next Step Toward Exascale
Toshiyuki Shimizu
0
November 17th, 2015
Past, PRIMEHPC FX100, and “Roadmap for Exascale”
2011 2012 2013 2014 2015 2016 2017 2018 2019
PRIMEHPC FX10
1.8x CPU perf. Easier installation
4x(DP) / 8x(SP) CPU & Tofu2 High-density pkg & lower energy
App. review
FS projects
HPCI strategic apps program
Operation of K computer Development
Japan’s National Projects
FUJITSU
FLAGSHIP 2020 Project (Post K computer development)
PRIMEHPC FX100
Copyright 2015 FUJITSU LIMITED
K Computer and PRIMEHPC FX10 in Operation
Many applications are currently running and being developed for science and various industries
PRIMEHPC FX100 in Operation The CPU and interconnect inherit the K computer architectural concept, featuring state-of-the-art technologies
Towards Exascale RIKEN selected Fujitsu as a partner for the basic design of the Post K computer
Technical Computing Suite (TCS)
System software TCS supports the FX100 with newly developed technologies
Handles millions of parallel jobs FEFS: super scalable file system
MPI: Ultra scalable collective communication libraries
OS: Lower OS jitter w/ assistant core
1
Toward Higher Performance beyond 100PF
Hybrid and hierarchical implementation must be chosen! Four and sixteen thread hybrid parallelization shows the best performance
for most of the applications
# of
th
read
s fo
r op
tim
al p
erfo
rman
ce
Hybrid Implementation on 32-core CPU Nodes
Variety of applications showing highly scalable performance
“sweet spot” of thread parallelization
Copyright 2015 FUJITSU LIMITED 2
Thin connection
Fujitsu’s Approach for FX100 and Beyond
Copyright 2015 FUJITSU LIMITED
Core Memory Group
Single process multi-threading
Tofu Scalable Architecture SMaC Scalable Architecture
12-node Tofu unit
6D mesh/torus Tofu interconnect
Scalable System Software Architecture: Resource-saving, Flexible, Reliable
HMC
Single chip CPU compute node single Linux on coherent memory
HMC HMC HMC
Tofu
AC
CE
CE
HMC HMC HMC HMC
HMC HMC HMC HMC
Shared L2 $
Compute Engine(CE)
Core Core Core
Core Core Core
Hardware barrier
Using State-of-the-Art SW/HW Technologies for Application Performance System software is also enhanced for long term evolution
A scalable, many-core micro architecture concept, “SMaC,” has been developed
Scalable interconnect “Tofu”
Dedicated memory controller
3
Scalable System Software: Resource-saving, Flexible, Reliable
For Higher Application Performance Thin and low overhead system design
Resource-saving Small system memory footprint Power-saving control
Usability and Compatibility OSS, ISV support, asset protection
Flexibility Supports computer science research and
data analytic science research in addition to computational science
Reliability Stable operation, immediate fault
detection, minimization of downtime through fast recovery
Maintainability System updates occur during operation,
minimizing maintenance time by enabling fault log sampling
Copyright 2015 FUJITSU LIMITED
K computer & PRIMEHPC
Applications
HPC Portal / Sys Mgmt Portal
Job Mgmt.
System Mgmt.
Linux-based OS
Highly Scalable
File System
FEFS Tools and
math. libraries
Parallel languages
and libraries
Automatic paralleliz-
ation compiler
Technical Computing Suite
4
core core
core core
core core
core core
core core
core core
core core
core core
Assistant
core Assistant
core
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interface
Tofu2 controller H
MC in
terface H
MC
inte
rfac
e
L2 cache
L2 cache
PCI interface
MA
C M
AC M
AC
M
AC
PCI controller
SMaC (Scalable Many Core) Concept & Approach
Copyright 2015 FUJITSU LIMITED
Many core-oriented, long-lasting architecture
Scalable performance improvement by increasing the number of cores Increasing the number of cores would be safe, even in the post-Moore’s Law era
using 3D stacking and alternative newer technologies
FX100 CPU implementation
CMG (Shared L2 cache & Memories)
CMG
Core Memory Group (CMG), many core building block, ccNUMA integration
・Hierarchal structure for hybrid parallel model ・Optimized area and performance
CORE
Middle-sized, general purpose, out-of-order, superscalar processor core ・Good performance for variety of apps ・Low power
Assistant core
Assistant core ・OS jitter reduction and assistance for IO, async MPI ・Highly scalable nodes
5
Cores in the group share the same L2 cache Dedicated memory and memory controller for the CMG provide high BW and low
latency data access Loosely coupled CMGs using tagged coherent protocol share data with small
silicon overhead Hierarchical configuration promises good core/performance scalability
Thin connection
HMC HMC HMC HMC
Tofu
AC
CE
CE
HMC HMC HMC HMC
Core Memory Group (CMG) Structure
Copyright 2015 FUJITSU LIMITED
Compute Node Single Linux (coherent memory)
Dedicated memory controller
HMC Memory
Shared L2 $
Compute Engine(CE)
Core Core Core
Core Core Core
Hardware barrier
Core Memory Group Single chip, many core integration by a set of CMGs
Assistant cores
Interconnect controller for Tofu
High performance compute core
Reduced data traffic
6
High Performance Compute Core (SIMD extensions)
DP 3x, SP 6x faster than FX10 in basic kernels High BW memory & Improved L1 cache pipelines contribute to exceed a peak
performance increase of 2.3x
Copyright 2015 FUJITSU LIMITED
Nor
mal
ized
Per
form
ance
Basic Kernels Performance per Core
FX10 FX100 FX100
FX10
0 Si
ng
le
FX10
0 D
oub
le
FX10
Dou
ble
/Sin
gle
7
Copyright 2015 FUJITSU LIMITED
XFILL inst. marks the cache line filled with new data
Sector cache holds specific data loaded by attributed load inst.
Reducing Data Traffic (XFILL & Sector Cache)
Speedup by XFILL & Sector Cache (Problem size: SP 4098x130x130, parallelized by j ) !OCL CACHE_SECTOR_SIZE(18,6)
!OCL CACHE_SUBSECTOR_ASSIGN(…) do k=2,kmax-1 do j=2,jmax-1 do i=2,imax-1 s0=...p(i, j, k)... & ...p(i, j+1, k)... & ...p(i, J-1, k)... … wrk2(i, j, k)=… enddo enddo enddo !OCL END_CACHE_SECTOR_SIZE !OCL END_CACHE_SUBSECTOR
x1.1
x1.2 Specify vars except array p
Himeno’s Benchmark
8
CMG & SMaC Effect on OpenMP Microbenchmark
FX100 outperforms Haswell due to the optimal implementation of the OpenMP library and the hardware inter-core barrier in the CPU A single chip CPU of FX100 is effective (larger gap at 32 vs 28 threads)
Configuration of Haswell: 2chips/node, E5-2697 v3 @ 2.60GHz
FX10
0 is
fas
ter!
Single chip Two chips Single chip Single chip
1
2
4
8
16
32
64
128
PARALLEL DO PARALLEL_DO BARRIER REDUCTION
Haswell(14thr)/FX100(16thr) Haswell(28thr)/FX100(32thr)
FX100 speedup over Haswell at smaller # of threads (Sample size:1000)
Copyright 2015 FUJITSU LIMITED FUJITSU CONFIDENTIAL 9
Overlapping Execution of Non-blocking Comm.
An assistant core is used in the MPI library Boundary data transfer of stencil code
Copyright 2015 FUJITSU LIMITED
Data array
Exchange boundary data
Process i Process i-1 Process i+1
Compute cores Assistant core
Computation
Post send
Wait
Com
m. p
rocessin
g
Normalized execution time
0
0.5
1
1.5
Notused
Used Notused
Used Notused
Used
64KiB 256KiB 1MiB
Ra
tio
of
exe
cuti
on
tim
e
Msg. length & assistant core usage
Comm.
Compute
10
Application Evaluation
CCS QCD Miniapp Quantum chromodynamics, a linear equation solver with a large sparse
coefficient matrix appearing in a lattice QCD problem
NAS parallel benchmarks FT class C by OpenMP parallel Time integration of a 3D partial differential equation using FFT (512^3)
MHD Simulation of Jovian and Kronian magnetosphere and space weather
GT5D Gyrokinetic toroidal 5D Eulerian code
Copyright 2015 FUJITSU LIMITED 11
Copyright 2015 FUJITSU LIMITED
Compiler improves cache hit rate using hints of directive & sector cache, better than FX10
CCS QCD “Miniapp” (OpenMP single node)
!OCL CACHE_SECTOR_SIZE(19,5) !OCL CACHE_SUBSECTOR_ASSIGN(ue,uo,yde,fclinve) !$OMP PARALLEL DO SCHEDULE(STATIC,1) do ix=1,NX do iy=1,NY do iz=1,NZ … gy11=yo(…,iy+1,…)+… … gy11=yo(…,iy-1,…)+… … enddo enddo enddo !$OMP END PARALLEL DO !OCL END_CACHE_SUBSECTOR !OCL END_CACHE_SECTOR_SIZE
Typical implementation to enable sector cache
0
20
40
60
80
100
120
Sector Cachedisabled
Sector Cacheenabled
Sector Cachedisabled
Sector Cacheenabled
FX10 (16 cores) FX100 (32 cores)
Pe
rfo
rman
ce (
Gfl
op
s/n
od
e)
Node Performance of CCS QCD
+6%
+9%
Size: 324
https://github.com/fiber-miniapp/ccs-qcd
Reserve L2$ 2.5MB for sector1
Assign single use data to sector1 Reusable data would be stored in sector0
12
Copyright 2015 FUJITSU LIMITED
NAS Parallel Benchmark FT (OpenMP Single Node)
NAS Parallel Benchmarks Ver. 3.3.1 OpenMP Class C
3.5x Improvement per node over the FX10 256bit SIMD x SMaC high performance node
Scalable thread performance
0
10
20
30
40
50
16 cores 2 way SIMD,16 cores
4 way SIMD,16 cores
4 way SIMD,32 cores
FX10 FX100
Node Performance of NPB FT
91% scalability x1.6
x1.2
x1.8 Memory & $ BW
-43% inst. by recompile G
flop
s/n
ode
13
MHD Sim. of Jovian and Kronian Magnetosphere & Space Weather
Binary of FX10 runs on FX100 (1.8x)
Recompile for FX100 utilizes 4 way SIMD (1.3x)
Source optimization for FX100 attains 3.5x of FX10 at the same # of cores (1,024 cores)
Copyright 2015 FUJITSU LIMITED
x1.8
x1.3 x1.5
x3.5 speedup
Reports 6.02 Tflops
(Evaluated under the collaboration with Dr. Fukazawa @ Kyoto Univ.) 14
Copyright 2015 FUJITSU LIMITED
Assistant core enables overlapping of communication and calculations While computation progresses on compute cores using
local data, the assistant cores handle communication tasks, independently and in parallel, until 'MPI_Waitall'
Minimal code modification retains portability and maintainability, free from the headache of manual code optimization for the same level of performance improvement
GT5D: Gyrokinetic Toroidal 5D Eulerian Code[1]
MPI_Isend MPI_Irecv !$OMP PARALLEL DO do i=… Computation using local data (independent from the communication) enddo !$OMP END PARALLEL DO MPI_Waitall do i=… Computation using received data enddo
Typical implementation of the overlapping of communication and computation
Size :256x256x64x128, 16 threads x 64 process, Range 14dx
14% speedups
By courtesy of JAEA
[1] Y. Idomura et al., Nuclear Fusion 49, 065029 (2009) [2] Y. Idomura et al., Int. J. HPC Appl. 28, 73 (2014)
Same level
Calculation using local data
Calculation using received data
Message wait
[2]
15
Copyright 2015 FUJITSU LIMITED
C RIKEN
FX100 VISIMPACT HPC-ACE Direct network Tofu
SMaC Tofu interconnect 2 HMC & Optical connections
PRIMEHPC Series
VISIMPACT SIMD extension HPC-ACE Direct network Tofu
CY2012~ 236.5GF, 16-core/CPU
CY2015~ 1TF~, 32-core/CPU 128GF, 8-core/CPU
CY2010~
Summary, FX100, and Exascale…
FLAGSHIP 2020 Basic Design has been Completed ・High application performance efficiency by consistent and conclusive approach
Exascale
FX10 K computer
FX100 Detailed Evaluation Unveiled ・Refined architectural concept “SMaC” is presented and evaluated ・Great application compatibility and scalable performance
16
17