FUJITSU HPC: The Next Step Toward Exascale · FUJITSU HPC The Next Step Toward Exascale ... 2011 2012 2013 2014 2015 2016 2017 2018 ... High-density pkg & lower energy App. review

Copyright 2015 FUJITSU LIMITED

FUJITSU HPC The Next Step Toward Exascale

Toshiyuki Shimizu

0

November 17th, 2015

Past, PRIMEHPC FX100, and “Roadmap for Exascale”

2011 2012 2013 2014 2015 2016 2017 2018 2019

PRIMEHPC FX10

1.8x CPU perf. Easier installation

4x(DP) / 8x(SP) CPU & Tofu2 High-density pkg & lower energy

App. review

FS projects

HPCI strategic apps program

Operation of K computer Development

Japan’s National Projects

FUJITSU

FLAGSHIP 2020 Project (Post K computer development)

PRIMEHPC FX100


K Computer and PRIMEHPC FX10 in Operation

Many applications are currently running and being developed for science and various industries

PRIMEHPC FX100 in Operation The CPU and interconnect inherit the K computer architectural concept, featuring state-of-the-art technologies

Towards Exascale RIKEN selected Fujitsu as a partner for the basic design of the Post K computer

Technical Computing Suite (TCS)

System software TCS supports the FX100 with newly developed technologies

Handles millions of parallel jobs FEFS: super scalable file system

MPI: Ultra scalable collective communication libraries

OS: Lower OS jitter w/ assistant core

1

Toward Higher Performance beyond 100PF

Hybrid and hierarchical implementation must be chosen! Four and sixteen thread hybrid parallelization shows the best performance

for most of the applications

# of

th

read

s fo

r op

tim

al p

erfo

rman

ce

Hybrid Implementation on 32-core CPU Nodes

Variety of applications showing highly scalable performance

“sweet spot” of thread parallelization

Copyright 2015 FUJITSU LIMITED 2

Thin connection

Fujitsu’s Approach for FX100 and Beyond


Core Memory Group

Single process multi-threading

Tofu Scalable Architecture SMaC Scalable Architecture

12-node Tofu unit

6D mesh/torus Tofu interconnect

Scalable System Software Architecture: Resource-saving, Flexible, Reliable

HMC

Single chip CPU compute node single Linux on coherent memory

HMC HMC HMC

Tofu

AC

CE

CE

HMC HMC HMC HMC

HMC HMC HMC HMC

Shared L2 $

Compute Engine(CE)

Core Core Core

Core Core Core

Hardware barrier

Using State-of-the-Art SW/HW Technologies for Application Performance System software is also enhanced for long term evolution

A scalable, many-core micro architecture concept, “SMaC,” has been developed

Scalable interconnect “Tofu”

Dedicated memory controller

3

Scalable System Software: Resource-saving, Flexible, Reliable

For Higher Application Performance Thin and low overhead system design

Resource-saving Small system memory footprint Power-saving control

Usability and Compatibility OSS, ISV support, asset protection

Flexibility Supports computer science research and

data analytic science research in addition to computational science

Reliability Stable operation, immediate fault

detection, minimization of downtime through fast recovery

Maintainability System updates occur during operation,

minimizing maintenance time by enabling fault log sampling


K computer & PRIMEHPC

Applications

HPC Portal / Sys Mgmt Portal

Job Mgmt.

System Mgmt.

Linux-based OS

Highly Scalable

File System

FEFS Tools and

math. libraries

Parallel languages

and libraries

Automatic paralleliz-

ation compiler

Technical Computing Suite

4

core core

core core

core core

core core

core core

core core

core core

core core

Assistant

core Assistant

core

core core

core core

core core

core core

core core

core core

core core

core core

Tofu2 interface

Tofu2 controller H

MC in

terface H

MC

inte

rfac

e

L2 cache

L2 cache

PCI interface

MA

C M

AC M

AC

M

AC

PCI controller

SMaC (Scalable Many Core) Concept & Approach


Many core-oriented, long-lasting architecture

Scalable performance improvement by increasing the number of cores Increasing the number of cores would be safe, even in the post-Moore’s Law era

using 3D stacking and alternative newer technologies

FX100 CPU implementation

CMG (Shared L2 cache & Memories)

CMG

Core Memory Group (CMG), many core building block, ccNUMA integration

・Hierarchal structure for hybrid parallel model ・Optimized area and performance

CORE

Middle-sized, general purpose, out-of-order, superscalar processor core ・Good performance for variety of apps ・Low power

Assistant core

Assistant core ・OS jitter reduction and assistance for IO, async MPI ・Highly scalable nodes

5

Cores in the group share the same L2 cache Dedicated memory and memory controller for the CMG provide high BW and low

latency data access Loosely coupled CMGs using tagged coherent protocol share data with small

silicon overhead Hierarchical configuration promises good core/performance scalability

Thin connection

HMC HMC HMC HMC

Tofu

AC

CE

CE

HMC HMC HMC HMC

Core Memory Group (CMG) Structure


Compute Node Single Linux (coherent memory)

Dedicated memory controller

HMC Memory

Shared L2 $

Compute Engine(CE)

Core Core Core

Core Core Core

Hardware barrier

Core Memory Group Single chip, many core integration by a set of CMGs

Assistant cores

Interconnect controller for Tofu

High performance compute core

Reduced data traffic

6

High Performance Compute Core (SIMD extensions)

DP 3x, SP 6x faster than FX10 in basic kernels High BW memory & Improved L1 cache pipelines contribute to exceed a peak

performance increase of 2.3x


Nor

mal

ized

Per

form

ance

Basic Kernels Performance per Core

FX10 FX100 FX100

FX10

0 Si

ng

le

FX10

0 D

oub

le

FX10

Dou

ble

/Sin

gle

7


XFILL inst. marks the cache line filled with new data

Sector cache holds specific data loaded by attributed load inst.

Reducing Data Traffic (XFILL & Sector Cache)

Speedup by XFILL & Sector Cache (Problem size: SP 4098x130x130, parallelized by j ) !OCL CACHE_SECTOR_SIZE(18,6)

!OCL CACHE_SUBSECTOR_ASSIGN(…) do k=2,kmax-1 do j=2,jmax-1 do i=2,imax-1 s0=...p(i, j, k)... & ...p(i, j+1, k)... & ...p(i, J-1, k)... … wrk2(i, j, k)=… enddo enddo enddo !OCL END_CACHE_SECTOR_SIZE !OCL END_CACHE_SUBSECTOR

x1.1

x1.2 Specify vars except array p

Himeno’s Benchmark

8

CMG & SMaC Effect on OpenMP Microbenchmark

FX100 outperforms Haswell due to the optimal implementation of the OpenMP library and the hardware inter-core barrier in the CPU A single chip CPU of FX100 is effective (larger gap at 32 vs 28 threads)

Configuration of Haswell: 2chips/node, E5-2697 v3 @ 2.60GHz

FX10

0 is

fas

ter!

Single chip Two chips Single chip Single chip

1

2

4

8

16

32

64

128

PARALLEL DO PARALLEL_DO BARRIER REDUCTION

Haswell(14thr)/FX100(16thr) Haswell(28thr)/FX100(32thr)

FX100 speedup over Haswell at smaller # of threads (Sample size:1000)

Copyright 2015 FUJITSU LIMITED FUJITSU CONFIDENTIAL 9

Overlapping Execution of Non-blocking Comm.

An assistant core is used in the MPI library Boundary data transfer of stencil code


Data array

Exchange boundary data

Process i Process i-1 Process i+1

Compute cores Assistant core

Computation

Post send

Wait

Com

m. p

rocessin

g

Normalized execution time

0

0.5

1

1.5

Notused

Used Notused

Used Notused

Used

64KiB 256KiB 1MiB

Ra

tio

of

exe

cuti

on

tim

e

Msg. length & assistant core usage

Comm.

Compute

10

Application Evaluation

CCS QCD Miniapp Quantum chromodynamics, a linear equation solver with a large sparse

coefficient matrix appearing in a lattice QCD problem

NAS parallel benchmarks FT class C by OpenMP parallel Time integration of a 3D partial differential equation using FFT (512^3)

MHD Simulation of Jovian and Kronian magnetosphere and space weather

GT5D Gyrokinetic toroidal 5D Eulerian code

Copyright 2015 FUJITSU LIMITED 11


Compiler improves cache hit rate using hints of directive & sector cache, better than FX10

CCS QCD “Miniapp” (OpenMP single node)

!OCL CACHE_SECTOR_SIZE(19,5) !OCL CACHE_SUBSECTOR_ASSIGN(ue,uo,yde,fclinve) !$OMP PARALLEL DO SCHEDULE(STATIC,1) do ix=1,NX do iy=1,NY do iz=1,NZ … gy11=yo(…,iy+1,…)+… … gy11=yo(…,iy-1,…)+… … enddo enddo enddo !$OMP END PARALLEL DO !OCL END_CACHE_SUBSECTOR !OCL END_CACHE_SECTOR_SIZE

Typical implementation to enable sector cache

0

20

40

60

80

100

120

Sector Cachedisabled

Sector Cacheenabled

Sector Cachedisabled

Sector Cacheenabled

FX10 (16 cores) FX100 (32 cores)

Pe

rfo

rman

ce (

Gfl

op

s/n

od

e)

Node Performance of CCS QCD

+6%

+9%

Size: 324

https://github.com/fiber-miniapp/ccs-qcd

Reserve L2$ 2.5MB for sector1

Assign single use data to sector1 Reusable data would be stored in sector0

12


NAS Parallel Benchmark FT (OpenMP Single Node)

NAS Parallel Benchmarks Ver. 3.3.1 OpenMP Class C

3.5x Improvement per node over the FX10 256bit SIMD x SMaC high performance node

Scalable thread performance

0

10

20

30

40

50

16 cores 2 way SIMD,16 cores

4 way SIMD,16 cores

4 way SIMD,32 cores

FX10 FX100

Node Performance of NPB FT

91% scalability x1.6

x1.2

x1.8 Memory & $ BW

-43% inst. by recompile G

flop

s/n

ode

13

MHD Sim. of Jovian and Kronian Magnetosphere & Space Weather

Binary of FX10 runs on FX100 (1.8x)

Recompile for FX100 utilizes 4 way SIMD (1.3x)

Source optimization for FX100 attains 3.5x of FX10 at the same # of cores (1,024 cores)


x1.8

x1.3 x1.5

x3.5 speedup

Reports 6.02 Tflops

(Evaluated under the collaboration with Dr. Fukazawa @ Kyoto Univ.) 14


Assistant core enables overlapping of communication and calculations While computation progresses on compute cores using

local data, the assistant cores handle communication tasks, independently and in parallel, until 'MPI_Waitall'

Minimal code modification retains portability and maintainability, free from the headache of manual code optimization for the same level of performance improvement

GT5D: Gyrokinetic Toroidal 5D Eulerian Code[1]

MPI_Isend MPI_Irecv !$OMP PARALLEL DO do i=… Computation using local data (independent from the communication) enddo !$OMP END PARALLEL DO MPI_Waitall do i=… Computation using received data enddo

Typical implementation of the overlapping of communication and computation

Size :256x256x64x128, 16 threads x 64 process, Range 14dx

14% speedups

By courtesy of JAEA

[1] Y. Idomura et al., Nuclear Fusion 49, 065029 (2009) [2] Y. Idomura et al., Int. J. HPC Appl. 28, 73 (2014)

Same level

Calculation using local data

Calculation using received data

Message wait

[2]

15


C RIKEN

FX100 VISIMPACT HPC-ACE Direct network Tofu

SMaC Tofu interconnect 2 HMC & Optical connections

PRIMEHPC Series

VISIMPACT SIMD extension HPC-ACE Direct network Tofu

CY2012~ 236.5GF, 16-core/CPU

CY2015~ 1TF~, 32-core/CPU 128GF, 8-core/CPU

CY2010~

Summary, FX100, and Exascale…

FLAGSHIP 2020 Basic Design has been Completed ・High application performance efficiency by consistent and conclusive approach

Exascale

FX10 K computer

FX100 Detailed Evaluation Unveiled ・Refined architectural concept “SMaC” is presented and evaluated ・Great application compatibility and scalable performance

16

17

Documents

FUJITSU HPC: The Next Step Toward Exascale · FUJITSU HPC The Next Step Toward Exascale ... 2011 2012 2013 2014 2015 2016 2017 2018 ... High-density pkg & lower energy App. review