15
Copyright 2015 FUJITSU LIMITED FUJITSU Supercomputer PRIMEHPC FX100 0

FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Copyright 2015 FUJITSU LIMITED

FUJITSU Supercomputer PRIMEHPC FX100

0

Page 2: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Copyright 2015 FUJITSU LIMITED

The K computer and the evolution of PRIMEHPC

Fujitsu has been developing supercomputers over 30 years, and will continue its development to deliver the best application performance.

SPARC64 VIIIfx: 8 cores / 128 GF11 PF, 2010~

K computer PRIMEHPC FX10SPARC64 IXfx: 16 cores / 236.5 GF23 PF, 2012~

PRIMEHPC FX100SPARC64 XIfx: 32 cores / over 1 TFOver 100 PF, 2015~

C RIKEN

Post-K computer

1

Page 3: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

PRIMEHPC FX100 design concept

Copyright 2015 FUJITSU LIMITED

Designed for massively parallel supercomputer system

Enhance and inherit K computer features・Many-core CPU-based architecture for application productivity

Introduce new technologies to Exascale computing

・High performance for a wide range of real applications

・HPC-ACE2 : Wide SIMD enhancements・Assistant cores : Dedicated cores for non-calculation operation

・Enhanced VISIMPACT (hardware barrier synchronization, sector cache, etc.)

・HMC : Leading-edge memory technology

2

Page 4: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

SPARC64TM XIfx

Copyright 2015 FUJITSU LIMITED

Over 1 TF high performance processor・32 compute cores

HPC-ACE2: ISA enhancements・Two 256-bit wide SIMD units per core

Tofu2 interface

Tofu2 controller

L2 cache

L2 cache

HM

C interfa

ce

HM

C in

terf

ace

PCI interface

core core

core core

core core

core core

core core

core core

core core

core core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

core

・2 assistant cores:

Assistant core

Assistant core

Daemons, IOs, non-blocking MPI functions, etc.

・Addressing mode (stride load/store, indirect load/store)・Cross lane operation (compress, permutation)

Offloading non-calculation operations

HMC support・480GB/s/node of theoretical memory throughput

3

Page 5: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Tofu interconnect 2

Copyright 2015 FUJITSU LIMITED

Enhanced Tofu interconnect

CPU-integrated interconnect controller・Reduced communication latency

Optical cable connection between chassis

・Improved packaging density and energy efficiency

・Highly scalable, 6-dimensional mesh/torus topology・Increased link bandwidth by 2.5 times to 12.5GB/s

・Enable flexible installation

4

Page 6: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Copyright 2015 FUJITSU LIMITED

Technical Computing Suite

Linux-based OS enhanced for FX100

Technical Computing Suite

Applications

PRIMEHPC FX100

Management software Programming Environment

High Performance File SystemFEFS

System management

Job management

Lustre-based distributed file system(enhanced for FX100)

MPI, OpenMP, XPFortran

Compilers(C,C++,Fortan)Mathematical libraries

Debugging and tuning tools

Enhanced software stack developed by FUJITSU

5

Page 7: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Performance highlights

Copyright 2015 FUJITSU LIMITED

Node performance is 2.9 to 4.7 times higher than previous generation.

Node performance normalized by FX10

0.01.02.03.04.05.0

FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100 FX10 FX100

HPL HPCG NPB FT NTChem Mini NICAM-DCMini

CCS QCD Mini UPACS-Lite

Benchmark programs Applications

Ratio of peak performance

3.1x2.9x3.3x4.0x

3.0x4.2x 4.7x

6

Page 8: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Results of major benchmarks - LINPACK

The first four FX100 systems have been delivered to customers, with >90% efficiency.

Copyright 2015 FUJITSU LIMITED

LINPACK performance and efficiency

Perf

orm

ance

3Pflops

2 Pflops

1 Pflops

0 Pflops

Effi

cien

cy

100 %

90%

80 %

RIKEN1080 nodes

MeteorologicalResearch Institute

1080 nodes

Japan AerospaceExploration Agency

1296 nodes

National Institutefor Fusion Science

2592 nodes

Performance (Pflops) Efficiency (%)

7

Page 9: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Results of major benchmarks - HPCG

Copyright 2015 FUJITSU LIMITED

Solves large sparse linear systems, using Conjugate Gradient method.

Enhanced memory bandwidth enables higher performance, keeping ease-of-use for programming.

Node Performance (Gflops/node)

0102030405060

Mira K computer FX10 FX100 Tianhe-2(MIC x3/node)

Titan(GPU x1/node)

TSUBAME 2.5(GPU x3/node)

3.0x

8

Page 10: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Results of major benchmarks – NPB FT

Copyright 2015 FUJITSU LIMITED

Node performance (Gflops/node) Breakdown of execution time

0.0

0.2

0.4

0.6

0.8

1.0

FX10 (16threads)

FX100 (32threads)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)0

10

20

30

40

FX10 (16threads)

FX100 (32threads)

3.3x

Solves a 3D partial differential equation using FFT.

Improved by higher cache/memory throughput, as well as increased CPU cores and SIMD width.

9

Page 11: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Application performance - NTChem-MINI†

Copyright 2015 FUJITSU LIMITED

† https://github.com/fiber-miniapp/ntchem-mini

0

100

200

300

400

500

FX10(12 nodes)

FX100(6 nodes)

0.0

0.2

0.4

0.6

0.8

1.0

FX10(12 nodes)

FX100(6 nodes)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)

4.0x

A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation

4.0x faster on FX100, close to the ratio of peak performance (4.4x)

Node performance (Gflops/node) Breakdown of execution time

10

Page 12: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Application performance - NICAM-DC-MINI†

Copyright 2015 FUJITSU LIMITED

† https://github.com/fiber-miniapp/nicam-dc-mini

0

10

20

30

40

FX10(8 nodes) FX100(4 nodes)0.0

0.2

0.4

0.6

0.8

1.0

FX10 (8 nodes) FX100(4 nodes)

2-4 inst. commited1 inst. commitedwait (others)wait (instruction)wait (calculation)wait (cache)wait (memory)

232.1 GB/s/node

74.2GB/s/node

2.9x

Node performance (Gflops/node) Breakdown of execution time ("vi_path2" routine)

A mini-app. derived from NICAM-DC, dynamic phase of NICAM (NonhydrostaticICosahedral Atmospheric Model)

3.1x higher memory throughput in “Vertical Implicit” calculation speeds up 1.5x with half the number of FX10 nodes.

1.5x

11

Page 13: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Application performance – CCS QCD-MINI†

Copyright 2015 FUJITSU LIMITED

† https://github.com/fiber-miniapp/ccs-qcd

0

30

60

90

120

Sector cachedisabled

Sector cacheenabled

Sector cachedisabled

Sector cacheenabled

Sector cachedisabled

Sector cacheenabled

FX10 (1 proc. 16 threads) FX100 (1 proc. X 16 threads) FX100 (1proc. X 32 threads)

3.1x

A mini-app. code for a lattice QCD problem (324 size)

Enhanced memory bandwidth boosts the performance and Sector Cache mechanism promotes data reuse on L2$.

Node performance (Gflops/node)

12

Page 14: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

Application performance – UPACS-Lite

Copyright 2015 FUJITSU LIMITED

Calc. time per time step (sec)

Calculation of compressible fluid dynamics, 1603 grid points per node

Enhanced memory bandwidth and other CPU improvementsboost the performance

13

0.0

0.5

1.0

1.5

2.0

2.5

FX10 FX100

4.6x

32.2 GB/s/node

150.9 GB/s/node

(Courtesy of Japan Aerospace Exploration Agency)

Example: simulation for acoustic design of a launch pad

Page 15: FUJITSU Supercomputer PRIMEHPC FX100...A mini-app. of the ab-initio quantum chemistry for the molecular electronic structure calculation 4.0x faster on FX100, close to the ratio of

14