86
Intel’s Xeon Scalable processor Family (Skylake) and AMD’s EPYC processor Application Performance in Chemistry and Materials Science: Martyn Guest , Christine Kitchen & Enguerrand Petit Cardiff University Performance Engineering Group, Atos

Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Intel’s Xeon Scalable processor Family

(Skylake) and AMD’s EPYC processor

Application Performance in

Chemistry and Materials Science:

Martyn Guest≠, Christine Kitchen≠ & Enguerrand Petit†

≠ Cardiff University †Performance Engineering Group, Atos

Page 2: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2Application Performance in Materials Science

Outline

• Measurement of parallel application performance featuring synthetics and

end-user applications across a variety of clusters

¤ Synthetic Codes – STREAM, IMB (interconnect) and HPCC

¤ Variety of End-user Codes – DL_POLY, GROMACS, NAMD, LAMMPS,

GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM

• Focus on Intel’s Xeon Scalable processors (“Skylake”), including

¤ Intel Xeon Gold 6150 Processor (18c, 24.75M Cache, 2.70 GHz)

¤ Intel Xeon Gold 6142 Processor (16c, 22M Cache, 2.60 GHz)

¤ Intel Xeon Gold 6148 Processor (20c, 27.5M Cache, 2.40 GHz)

¤ Intel Xeon Gold 6130 Processor (16c, 22M Cache, 2.10 GHz).

¤ Clusters with dual-socket nodes + Mellanox EDR & Intel’s OPA Interconnects.

• Comparison with a number of HPC systems based on earlier CPUs:

¤ Intel’s Broadwell (16-core E5-2697Av4 2.6GHz) & 14-core E5-2680 v4 2.4

GHz), and Sandy Bridge (E5-2690 & E5-2670) clusters from Bull and Fujitsu

¤ Mellanox IB EDR, and Intel’s Omnipath OPA Interconnects.

¤ IBM’s Power 8 cluster - S822LC

• Preliminary Evaluation of the AMD EPYC 7601 Processor

14 December 2017

Page 3: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

3Application Performance in Materials Science

Contents1. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM, IMB, and HPCC

b. Application Code Performance: DLPOLY, GROMACS,

LAMMPS, NAMD, GAMESS_UK, VASP, Quantum Espresso,

CP2K, and OpenFOAM

2. Selecting Fabrics and Optimising Performance

a. Interconnect Performance: Mellanox EDR Infiniband and

Intel’s Omnipath (OPA)

3. Relative Code Performance: Processor Family and

Interconnect – “core to core” and “node to node” benchmarks.

4. Preliminary Evaluation of the AMD EPYC 7601 Processor

a. represents a fairly radical design departure from what Intel

offers

5. Acknowledgements and Summary

14 December 2017

Page 4: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

The Xeon Skylake Architecture

4Application Performance in Materials Science

• The architecture of Skylake is

very different from that of the prior

“Haswell” and “Broadwell” Xeon

chips

• Three basic variants that now

cover what was formerly the Xeon

E5 and Xeon E7 product lines, with

Intel converging the Xeon E5 and

E7 chips into a single socket.

• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51

variants of the SP chip

• Also custom versions requested by hyperscale and OEM customers.

• All of these chips differ from each other in a number of ways, including

number of cores, clock speed, L3 cache capacity, number and speed of

UltraPath links between sockets, number of sockets supported, main

memory capacity, width of the AVX vector units etc.

14 December 2017

Page 5: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600

4 channels of up to 3

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz

6 channels of up to 2

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W

5Application Performance in Materials Science 14 December 2017

Page 6: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Baseline Cluster Systems

6Application Performance in Materials Science

Cluster Configuration

Intel Sandy Bridge Clusters

“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Supercomputing

Wales

384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Intel Broadwell Clusters

Dell PE R730/R630,

Broadwell EP-2697A v4

2.6 GHz 16C

HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-

node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,

40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR

ATOS Broadwell EP-

2680 v4 2.4 GHz 16C

32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,

145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox

ConnectX-4 EDR; and Intel OPA

IBM Power 8 S822LC

IBM Power 8 S822LC

with Mellanox EDR

20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;

1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;

IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;

Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM

Advance Toolchain 9.0)

14 December 2017

Page 7: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Skylake Cluster Systems

Cluster / Configuration

32 node Dell|EMC cluster running SLURM with separate partitions for each processor

SKU; Mellanox EDR:

Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70GHz), Max Turbo frequency, 3.70

GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3;

Intel® Xeon® Gold 6142 Processor (22M Cache, 2.60GHz), Max Turbo frequency, 3.70

GHz, # cores 16; #threads 32; DDR4-2666; TDP 150W; # of UPI Links 3.

28 node Dell|EMC cluster running SLURM: Intel OPA

Intel® Xeon® Gold 6130 Processor (22M Cache, 2.10GHz), Max Turbo frequency, 3.70

GHz, # cores 16; #threads 32; DDR4-2666; TDP 125W; # of UPI Links 3.

The 6130’s are configured with 12 × 8GB 2666 DIMMs rather than 12 × 16GB, resulting in

somewhat slower memory bandwidth (165 GB/s vs 195 GB/s STREAM Triad).

20 node Bull|ATOS cluster running SLURM;

Intel® Xeon® Gold 6150 Processor (24.75M Cache, 2.70 GHz), Max Turbo frequency, 3.70

GHz. # cores 18; #threads 36; DDR4-2666; TDP 165W; # of UPI Links 3; SMT; Mellanox

EDR.

16 node Intel cluster running SLURM;

Intel® Xeon® Gold 6148 Processor (27.5M Cache, 2.40 GHz), Max Turbo frequency, 3.70

GHz. # cores 20; #threads 40; DDR4-2666; TDP 150W; # of UPI Links 3; SMT; Intel OPA.

7Application Performance in Materials Science 14 December 2017

Page 8: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

The Performance Benchmarks

• The Test suite comprises both synthetics & end-user applications.

Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks

(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and

STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed e.g., memory bandwidth and latency, node

floating point performance and interconnect performance (both latency

and B/W) and sustained I/O performance.

GROMACS, LAMMPS, NAMD, DL_POLY classic & DL_POLY-4

(molecular dynamics)

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP

(ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK

(molecular electronic structure)

8Application Performance in Materials Science 14 December 2017

Page 9: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

74,309

93,486

118,605 114,367

132,035 128,083

165,974

185,863195,122

184,087

279,562

0

50,000

100,000

150,000

200,000

250,000

Bull b510"Raven"Sandy

Bridge e5-2670/2.6GHz IB-

QDR

ClusterVisione5-2650v2

2.6GHz

Dell R730Haswell e5-

2697v3 2.6GHz(T)

Dell OPA32 e5-2660v3 2.6GHz

(T) OPA

Thor Broadwelle5-2697A v42.6GHz (T)

ATOSBroadwell e5-2680v4 2.4GHz

(T) OPA

Dell SkylakeGold 6130

2.1GHz (T) OPA

Dell SkylakeGold 61422.6GHz (T)

Dell SkylakeGold 61482.4GHz (T)

IBM Power8S822LC

2.92GHz IB/EDR

AMD Epyc 76012.2 GHz

Memory B/W –STREAM performance

TRIAD [Rate (MB/s) ]

Ivy Bridge & Haswell

E5-26xx v2,v3

OMP_NUM_THREADS (KMP_AFFINITY=physical

Broadwell

E5-26xx v4

Skylake Gold

6130, 6142, 6148

Application Performance in Materials Science 914 December 2017

Page 10: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

4,644

5,843

4,236

5,718

4,126

4,574

5,187

5,808

4,878

9,204

4,368

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Bull b510"Raven"Sandy

Bridge e5-2670/2.6GHz IB-

QDR

ClusterVisione5-2650v2

2.6GHz

Dell R730Haswell e5-

2697v3 2.6GHz(T)

Dell OPA32 e5-2660v3 2.6GHz

(T) OPA

Thor Broadwelle5-2697A v42.6GHz (T)

ATOSBroadwell e5-2680v4 2.4GHz

(T) OPA

Dell SkylakeGold 6130

2.1GHz (T) OPA

Dell SkylakeGold 61422.6GHz (T)

Dell SkylakeGold 61482.4GHz (T)

IBM Power8S822LC

2.92GHz IB/EDR

AMD Epyc 76012.2 GHz

Memory B/W – STREAM / core performance

TRIAD [Rate (MB/s) ]

OMP_NUM_THREADS (KMP_AFFINITY=physical

Ivy Bridge & Haswell

E5-26xx v2,v3

Broadwell

E5-26xx v4

Skylake Gold

6130, 6142, 6148

Application Performance in Materials Science 1014 December 2017

Page 11: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

3.8

11,466

5,957

1.7

3,694

1,729

1

10

100

1,000

10,000

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Intel SKL Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Thor BDW e5-2697A v4 2.6GHz (T) EDR

Intel BDW e5-2690v4 2.6GHz (T) OPA

Dell OPA32 e5-2660v3 2.6GHz (T) OPA

Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB

Dell R720 e5-2680v2 2.8 GHz (T) connect-IB

Azure A9 WE (e5-2670 2.6 GHz) IB RDMA

Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec

11Application Performance in Materials Science

BE

TT

ER

14 December 2017

Page 12: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

215.4

5.0E+01

5.0E+02

5.0E+03

5.0E+04

5.0E+05

5.0E+06

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6142 2.6GHz (T) EDR

MPI Collectives – Alltoallv (128 PEs)

IMB Benchmark (Intel)

128 PEs

Latency

BE

TT

ER

Message Length (Bytes)

Measured Time (usec)

Time-consuming

messages called by

Alltoall & Alltoallv (IPM)

12Application Performance in Materials Science 14 December 2017

Page 13: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

• Linpack TPP benchmark measures floating point rate of execution for solving a linear system of equations. HPL

• measures floating point rate of execution of double precision real matrix-matrix multiplication. DGEMM

• measures sustainable memory B/W (in GB/s) and the corresponding computation rate for simple vector kernel. STREAM

• parallel matrix transpose - exercises the communicationswhere pairs of processors communicate with each other simultaneously. Useful test of the total communications capacity of the network.

PTRANS

• measures rate of integer random updates of memory (GUPS). RandomAccess

• measures floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform. Performance a combination of flops, memory, and network bandwidth

FFTE

• tests to measure latency and B/W of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).

Communication B/W and latency

HPC Challenge Benchmark (Source - http://icl.cs.utk.edu/hpcc/)

CPU cores N

128 83,000

256 117,000

512 166,000

1024 234,000

13Application Performance in Materials Science 14 December 2017

Page 14: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

0.2

0.4

0.6

0.8

1.0

HPL [Gflops]

G-PTRANS[GB/s]

G-RandomAccess [Gup/s]

G-FFTE [Gflops]

EP-STREAM Sys[GB/s]

EP-STREAMTriad [GB/s]

EP-DGEMM[Gflops]

Random RingBandwidth[Gbytes]

Random RingLatency [usec]

Fujitsu CX250 e5-2670 2.6 GHzQDR

Bull b510 (Raven) e5-2670 2.6GHz IB-QDR

ATOS Genji e5-2680 v4 2.4GHz(T) OPA IMPI

Thor Dell e5-2697A v4 2.6GHz (T)EDR HCPX

Thor Dell e5-2697A v4 2.6GHz (T)OPA

Dell Skylake Gold 6130 2.1GHz(T) OPA

Intel Skylake Gold 6148 2.4GHz(T) OPA

Dell Skylake Gold 6142 2.6GHz(T) EDR

Dell Skylake Gold 6150 2.7GHz(T) EDR

HPCC - 128 Processing Elements

14Application Performance in Materials Science

[Matrix Size 83,000]

14 December 2017

Page 15: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Application Code Performance in Materials

Science, Chemistry and Nanoscience:

DLPOLY, GROMACS, NAMD, LAMMPS, GAMESS,

NWChem, GAMESS-UK, ONETEP, VASP, SIESTA,

CASTEP, Quantum Espresso, CP2K – on a variety of HPC

systems.

Page 16: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

IPM Performance Monitoring

http://ipm-hpc.sourceforge.net/userguide.html

IPM is a profiling tool that helps analyse MPI programs.

• Very easy to use, requires no code modifications (unless you want more

information), and provides a lightweight profiling interface (with very low

overhead <2%).

• Can create html O/P that include graphical representation of the data

To run a program with IPM profiling; There are three ways of using IPM.

1. set an environment variable before you run your program:

$ export LD_PRELOAD=/application/tools/ipm-2.0.6/install-impi/lib/libipm.so

2. Recompile your program with IPM enabled:

$ mpicc -L/path/to/ipm/lib -lipm your_program.c -o your-program

3. Use “export I_MPI_STATS=ipm” if using Intel’s mpirun or mpiexec.hydra

When executing a program with ipm, an xml file is created that can be parsed to text

or html using ''ipm_parse -html xmlfile''.

IPM 2.0.6

16Application Performance in Materials Science 14 December 2017

Page 17: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Allinea Performance Reports

Allinea Performance Reports provides a

mechanism to characterize and understand the

performance of HPC application runs through a

single-page HTML report.

17Application Performance in Materials Science

• Based on Allinea MAP's adaptive sampling technology that keeps data

volumes collected and application overhead low.

• Modest application slowdown (ca. 5%) even with 1000’s of MPI

processes.

• Runs on existing codes: a single command added to execution scripts.

• If submitted through a batch queuing system, then the submission script

is modified to load the Allinea module and add the 'perf-report' command

in front of the required mpiexec command.

• perf-report mpiexec -n 4 $code

• A Report Summary: This characterizes how the application's wallclock

time was spent, broken down into CPU, MPI and I/O

• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)

14 December 2017

Page 18: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

DL_POLY Developed as CCP5 parallel MD code by W. Smith,

T.R. Forester and I. Todorov

UK CCP5 + International user community

DLPOLY_classic (replicated data) and DLPOLY_3 &

_4 (distributed data – domain decomposition)

Areas of application:

liquids, solutions, spectroscopy, ionic solids,

molecular crystals, polymers, glasses, membranes,

proteins, metals, solid and liquid interfaces,

catalysis, clathrates, liquid crystals, biopolymers,

polymer electrolytes.

Molecular Dynamics Codes: AMBER, DL_POLY,

CHARMM, NAMD, LAMMPS, GROMACS etc

Molecular Simulation I. DL_POLY

18Application Performance in Materials Science 14 December 2017

Page 19: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 4 – Distributed data

19Application Performance in Materials Science

Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

14 December 2017

Page 20: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.8

3.1

4.0

4.9

2.7

5.0

6.5

8.4

2.9

5.3

7.0

9.1

0.0

2.0

4.0

6.0

8.0

10.0

64 128 192 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

DL_POLY 4 – Gramicidin Simulation

Number of Processing Elements

Performance

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps

Performance Data (64-256 PEs)

20Application Performance in Materials Science

Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)

14 December 2017

SKL 6142 2.6 GHz ~

1.06 X e5-2697v4 2.6

GHz

Page 21: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

DLPOLY4 – Gramicidin Simulation Performance Report

Smooth Particle Mesh Ewald Scheme

21Application Performance in Materials Science

CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

14 December 2017

“DL_POLY_4 and Xeon Phi: Lessons Learnt”,

Alin Marin Elena , Christian Lalanne, Victor

Gamayunov , Gilles Civario, Michael Lysaght,

and IlianTodorov

Page 22: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Molecular Simulation II. GROMACS

GROMACS v 5.0 (GROningen MAchine for Chemical Simulations) is a molecular dynamics package designed for simulations of proteins,

lipids and nucleic acids [University of Groningen] .

Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-

Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory

and Computation 4 (3): 435–447.

22Application Performance in Materials Science

Ion channel system

• The 142k particle ion channel system is the membrane protein GluCl - a

pentameric chloride channel embedded in a DOPC membrane and

solvated in TIP3P water, using the Amber ff99SB-ILDN force field. This

system is a challenging parallelization case due to the small size, but is

one of the most wanted target sizes for biomolecular simulations.

Lignocellulose

• Gromacs Test Case B from the UEA Benchmark Suite. A model of

cellulose and lignocellulosic biomass in an aqueous solution. This system

of 3.3M atoms is inhomogeneous, and uses reaction-field

electrostatics instead of PME and therefore should scale well.

14 December 2017

Page 23: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

20.6

37.9

54.5

68.8

32.5

60.3

75.5

100.4

36.4

68.5

98.8

123.6

0.0

20.0

40.0

60.0

80.0

100.0

120.0

64 128 192 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

23

GROMACS – Ion Channel Simulation

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

142k particle ion channel system

Application Performance in Materials Science 14 December 2017

Page 24: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

GROMACS – Ion-channel Performance Report

24Application Performance in Materials Science

CPU Time Breakdown

Total Wallclock Time

Breakdown

14 December 2017

Page 25: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.9

1.7

2.5

3.3

1.3

2.6

3.8

5.0

1.3

2.6

3.8

5.0

1.4

2.8

3.9

5.2

1.6

3.1

4.6

6.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

64 128 192 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

25

GROMACS – Lignocellulose

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

3,316,463 atom system using

reaction-field electrostatics instead

of PME

Application Performance in Materials Science 14 December 2017

Page 26: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Performance Data (256 PEs)GROMACS - IPM Reports

26Application Performance in Materials Science

13%

10%

wallclock : 26.8 secs

# mpi_tasks : 256 on 8 nodes

%comm : 34.90%

7%

% of Total Time

ligno-cellulosewallclock : 352.1 secs

# mpi_tasks : 256 on 8 nodes

%comm : 5.81%

Ion-channel

% of Total Time

4%

1%

1%

14 December 2017

Page 27: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

GAMESS-UK - Moving to Distributed Data.

The MPI/ScaLAPACK Implementation

of the GAMESS-UK SCF/DFT module

• Alternative pragmatic approach in which

¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays

¤ All data structures except those required for the Fock matrix build are fully

distributed (F, P)

• Partially distributed model chosen because, in the absence of efficient

one-sided communications it is difficult to efficiently load balance a

distributed Fock matrix build.

• Obvious drawback - some large replicated data structures are required.

¤ These are kept to a minimum. For a closed shell HF or DFT calculation only

2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for

UHF).

27Application Performance in Materials Science

“The GAMESS-UK electronic structure package: algorithms, developments and

applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.

van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.

14 December 2017

Page 28: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.2

1.5

2.7

1.6

2.8

1.2

2.1

1.6

2.9

1.5

2.7

1.8

3.2

1.6

2.8

1.8

3.3

1.9

3.5

1.1

2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Intel Ivy Bridge e5-2690v2 3.0GHz (T) True Scale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK Performance - Zeolite Y cluster

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER

28Application Performance in Materials Science

SKL 6142 2.6 GHz

~ 1.05 X e5-2697v4 2.6 GHz

14 December 2017

Page 29: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

GAMESS-UK.MPI DFT – DFT Performance Report

29Application Performance in Materials Science

Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

14 December 2017

Page 30: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using

pseudopotentials or the projector-augmented wave method and a plane

wave basis set.

• Quantum Espresso – an integrated suite of Open-Source computer codes

for electronic-structure calculations and materials modelling at the

nanoscale. It is based on density-functional theory (DFT), plane waves,

and pseudopotentials

• SIESTA - an O(N) DFT code for electronic structure calculations and ab

initio molecular dynamics simulations for molecules and solids. It uses

norm-conserving pseudopotentials and linear combination of numerical

atomic orbitals (LCAO) basis set.

• CP2K is a program to perform atomistic and molecular simulations of solid

state, liquid, molecular, and biological systems. It provides a framework for

different methods such as e.g., DFT using a mixed Gaussian & plane waves

approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling

code for quantum-mechanical calculations based on DFT.

Computational Materials

Advanced Materials Software

30Application Performance in Materials Science 14 December 2017

Page 31: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Quantum Espresso is an

integrated suite of Open-

Source computer codes

for electronic-structure

calculations and

materials modelling at the

nanoscale. It is based on

density-functional theory,

plane waves, and

pseudopotentials.

Ground-state calculations.

Structural Optimization.

Transition states & minimum energy paths.

Ab-initio molecular dynamics.

Response properties (DFPT).

Spectroscopic properties.

Quantum Transport.

Benchmark Details

DEISA AU112

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT dimensions:

(180, 90, 288)

PRACE

GRIR443

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points, FFT

dimensions: (180, 180, 192)

Quantum Espresso

31Application Performance in Materials Science 14 December 2017

Page 32: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.01.3

2.0 2.0

2.52.9

3.3 3.4

3.3

5.6

5.0

5.7

7.67.8

1.5

2.2

3.2

5.1

4.3

4.9

5.9

1.9

2.7

4.0

6.7

5.7

6.4

8.3

8.8

0.0

2.0

4.0

6.0

8.0

10.0

0 64 128 192 256 320

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Intel Skylake Gold 6148 2.4GHz (T) OPA

Number of Processing Elements

Perf

orm

an

ce

Performance Data (32 - 320 PEs)

BE

TT

ER

Quantum Espresso – Au112

32Application Performance in Materials Science

Relative to the Fujitsu e5-2670

2.6 GHz 8-C (32 PEs)

14 December 2017

Page 33: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

Quantum Espresso – Au112 Performance Report

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT

dimensions: (180, 90, 288)

33Application Performance in Materials Science

CPU Time Breakdown

Total Wallclock Time

Breakdown

14 December 2017

Page 34: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Performance Data (256 PEs)

Quantum Espresso – Au112 IPM Report

34Application Performance in Materials Science

36%

17%

wallclock : 108.3 secs

# mpi_tasks : 256 on 8 nodes

%comm : 72.52%

Time-consuming messages

called by Alltoall are modest

sized (<1KB)

8%

MPI_Alltoall

% of Total Time

14 December 2017

Page 35: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Zeolite Benchmark

• Zeolite with the MFI structure unit cell running

a single point calculation and a planewave cut

off of 400eV using the PBE functional

• 2 k-points; maximum number of plane-

waves: 96,834

• FFT grid; NGX=65, NGY=65, NGZ=43,

giving a total of 181,675 points

Pd-O Benchmark

• Pd-O complex – Pd75O12, 5X4 3-layer

supercell running a single point calculation

and a planewave cut off of 400eV. Uses the

RMM-DIIS algorithm for the SCF and

is calculated in real space.

• 10 k-points; maximum number of plane-

waves: 34,470

• FFT grid; NGX=31, NGY=49, NGZ=45,

giving a total of 68,355 points

VASP – Vienna Ab-initio Simulation Package

Benchmark Details

MFI Zeolite

Zeolite (Si96O192), 2 k-

points, FFT grid: (65,

65, 43); 181,675 points

Pd-O

complex

Palladium-Oxygen

complex (Pd75O12), 10

k-points, FFT grid: (31,

49, 45), 68,355 points

VASP (5.4.1) performs ab-

initio QM molecular

dynamics (MD) simulations

using pseudopotentials or

the projector-augmented

wave method and a plane

wave basis set.

Application Performance in Materials Science 3514 December 2017

Page 36: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.7

2.5

2.1

2.8

4.6

6.5

3.3

5.3

6.1

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.1 – Pd-O Benchmark

36Application Performance in Materials Science 14 December 2017

Page 37: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.7

2.52.1

2.8

4.6

6.5

3.3

5.3

6.1

4.0

6.1

9.2

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA [KPAR=2]

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.1 – Pd-O Benchmark - Parallelisation on k-points

37Application Performance in Materials Science 14 December 2017

NPEs KPAR NPAR

64 2 2

128 2 4

256 2 8

Page 38: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

VASP – Pd-O Benchmark Performance Report

38Application Performance in Materials Science

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

CPU Time Breakdown

Total Wallclock Time

Breakdown

14 December 2017

Page 39: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.0

1.51.7

1.4

2.9

4.7

1.5

2.7

3.9

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC: 4.3

Number of Processing Elements

Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834,

2 k-points, FFT grid: (65, 65, 43); 181,675 points

VASP 5.4.1 – Zeolite Benchmark

39Application Performance in Materials Science

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

14 December 2017

Page 40: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

40

OpenFOAM - The open source CFD toolbox

• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.

Features

• Includes over 80 solver applications that simulate specific problems in engineering mechanics and over 170 utility applications that perform pre- and post-processing tasks, e.g. meshing, data visualisation, etc.

Applications

http://www.openfoam.com/

The OpenFOAM® (Open Field Operation and Manipulation) CFD

Toolbox is a free, open source CFD software package produced by

OpenCFD Ltd.

Application Performance in Materials Science 14 December 2017

• Isothermal, incompressible flow in a 2D square domain. The geometry has all the boundaries of the square are walls. The top wall moves in the x-direction at 1 m/s while the other 3 are stationary. Initially, the flow is assumed laminar and is solved on a uniform mesh using the icoFoam solver.

Lid-driven cavity flow (Cavity 3d)

Geometry of the lid driven cavity

http://www.openfoam.org/docs/user/cavity.php#x5-170002.1.5

Page 41: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.02.2

3.3

7.2

9.3

11.512.4

15.0 15.0

16.9 16.9

19.9

23.4

26.5 26.5

1.4

3.1

4.6

9.7

12.4

15.6

17.6

23.4

29.4

34.5

36.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

0 64 128 192 256 320 384 448 512

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

Intel Skylake Gold 6148 2.4GHz (T) OPA

41

OpenFOAM – Cavity 3d-3M

Number of Processing Elements

Perf

orm

an

ce

Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)

BE

TT

ER

Performance Data (32-512 PEs)

OpenFOAM with lid-driven cavity flow 3d-3M data set

Application Performance in Materials Science 14 December 2017

Page 42: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

OpenFOAM – Cavity 3d-3M Performance Report

42Application Performance in Materials Science

OpenFOAM with lid-driven cavity

flow 3d-3M data set

CPU Time Breakdown

Total Wallclock Time

Breakdown

14 December 2017

Page 43: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Application Code Performance in Materials

Science, Chemistry and Nanoscience:

2. Selecting Fabrics and Optimising

Performance:

Intel MPI, Mellanox HPCX, IPM

and Allinea Performance Report.

Page 44: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

• Intel MPI Library - can select a communication fabric at runtime without having

to recompile the application. By default, it automatically selects the most

appropriate fabric based on both S/W and H/W configuration i.e. in most cases

you do not have to worry about manually selecting a fabric.

• Specifying a particular fabric can boost performance. Can specify fabrics for both

intra-node and inter-node communications. Following fabrics available:

• For inter-node communication, it uses the first available fabric from the default

fabric list. This list is defined automatically for each H/W and S/W configuration

(see I_MPI_FABRICS_LIST).

• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi

Selecting Fabrics – MPI Optimisation

44Application Performance in Materials Science

Fabric Network hardware and software used

shm Shared memory (for intra node communication only).

dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)

and iWarp (through DAPL).

ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).

tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).

tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel

Omni Path Architecture and Myrinet (through TMI).

ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale

Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).

14 December 2017

Page 45: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

MPIFLAGS+="-genv DAT_OVERRIDE /etc/dat.conf "

MPIFLAGS+="-genv I_MPI_DAT_LIBRARY /usr/lib64/libdat2.so "

if [[ "$TRANSPORT" == "DAPL" ]]; then

MPIFLAGS+="-DAPL "

MPIFLAGS+="-genv I_MPI_FABRICS shm:dapl "

MPIFLAGS+="-genv I_MPI_DAPL_UD off "

MPIFLAGS+="-genv I_MPI_DAPL_PROVIDER ofa-v2-$HCA-${HCAPORT}u "

MPIFLAGS+="-genv DAPL_MAX_INLINE 256 "

MPIFLAGS+="-genv I_MPI_DAPL_RDMA_RNDV_WRITE on "

MPIFLAGS+="-genv DAPL_IB_MTU 4096 "

elif [[ "$TRANSPORT" == "OFA" ]]; then

MPIFLAGS+="-IB "

MPIFLAGS+="-genv MV2_USE_APM 0 "

MPIFLAGS+="-genv I_MPI_FABRICS shm:ofa "

MPIFLAGS+="-genv I_MPI_OFA_USE_XRC 1 "

MPIFLAGS+="-genv I_MPI_OFA_NUM_ADAPTERS 1 "

MPIFLAGS+="-genv I_MPI_OFA_ADAPTER_NAME $HCA "

MPIFLAGS+="-genv I_MPI_OFA_NUM_PORTS 1 "

fi

if [ "$NET" == "OPA" ]; then

MPIFLAGS="-PSM2 "

fi

MPIFLAGS+="-genv I_MPI_PIN on "

MPIFLAGS+="-genv I_MPI_DEBUG 4 "

MPIFLAGS+="-genv MALLOC_MMAP_MAX_ 0 -genv MALLOC_TRIM_THRESHOLD_ -1 "

45Application Performance in Materials Science

HCA=mlx5_0

HCAPORT=1

TRANSPORT=DAPL

mpirun -np $NP $MPIFLAGS

Selecting Fabrics – MPI Optimisation

Intel MPI Library

14 December 2017

Page 46: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Mellanox HPC-X Toolkit and Intel MPI

The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC

software suite for HPC environments. Delivers “enhancements to

significantly increase the scalability & performance of message

communications in the network”. Includes:

¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and

FCA acceleration engines

¤ Offload collectives communication from MPI process onto Mellanox

interconnect hardware

¤ Maximize application performance with underlying hardware

architecture. Optimized for Mellanox InfiniBand and VPI interconnects

¤ Increase application scalability and resource efficiency

¤ Multiple transport support including RC, DC and UD

¤ Intra-node shared memory communication

• Performance comparison conducted on the Mellanox HP Proliant- E5-

2697A v4 EDR based cluster

46Application Performance in Materials Science

http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

14 December 2017

Page 47: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Application Performance & Interconnect

Two comparison exercises undertaken:

¤ For each application (and associated data sets) analyse the

performance as a function of interconnect – Mellanox EDR and

Intel OPA – as a function of increasing core count.

• DLPOLY4 & GROMACS – 128-1024 cores

• VASP PdO (128-384 cores) & Zeolite (128-512 cores)

• Quantum ESPRESSO (Au112, 64-512; GRIR443, 128-1024)

• OpenFOAM (64-512 cores)

¤ On the Mellanox HP Proliant- E5-2697A v4 EDR based Thor

cluster, compare for each application (and associated data sets)

the relative performance undertaken using Intel MPI and

Mellanox HPCX i.e.

T HPCX / T Intel-MPI

47Application Performance in Materials Science

http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

14 December 2017

Page 48: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

4.9

8.4

13.8

19.2

5.2

8.9

13.7

19.3

5.3

9.1

13.7

0.0

3.0

6.0

9.0

12.0

15.0

18.0

128 256 512 1024

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

DL_POLY 4 – Gramicidin

Number of Processing Elements

Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER

48Application Performance in Materials Science

Gramicidin 792,960 atoms; 50 time steps

14 December 2017

Performance Data (128-1024 PEs)

Page 49: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2.6

5.0

9.8

18.2

2.5

4.9

9.6

15.4

2.9

5.7

11.0

0.0

4.0

8.0

12.0

16.0

20.0

128 256 512 1024

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

49

GROMACS – Lignocellulose

Number of Processing Elements

Performance (ns /day)

Performance Data (128-1024 PEs)

BE

TT

ER

3,316,463 atom system using reaction-

field electrostatics instead of PME

Application Performance in Materials Science 14 December 2017

Page 50: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.51.7 1.7

2.9

4.7

5.6

3.0

4.3

4.9

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

128 256 512

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE

functional. maximum number of plane-waves: 96,834, 2 k-points,

FFT grid: (65, 65, 43); 181,675 points

VASP 5.4.1 – Zeolite Benchmark

50Application Performance in Materials Science

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

14 December 2017

Page 51: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2.4

4.6

6.5

7.7

2.5

4.4

5.7

5.2

2.8

5.2

6.3

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

128 256 512 1024

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR HPCX

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (128-1024 PEs)Quantum Espresso – GRIR443

51Application Performance in Materials Science

[Re

lative

to

th

e F

ujit

su

e5

-2670

2.6

GH

z 8

-C (

96

PE

s)]

14 December 2017

Page 52: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2.3

8.3

16.9

33.1

2.4

9.0

21.5

39.7

2.5

7.9

15.3

36.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

64 128 256 512

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Bull|ATOS Skylake Gold 6150 2.7GHz (T) EDR

52

OpenFOAM – Cavity 3d-3M

Number of Processing Elements

Perf

orm

an

ce

Relative to the Fujitsu CX250 e5-2670 8-C (32 PEs)

BE

TT

ER

Performance Data (64-512 PEs)

OpenFOAM with lid-driven cavity flow 3d-3M data set

Application Performance in Materials Science 14 December 2017

Page 53: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

DL_POLY 4 – Intel MPI vs. HPCX

53Application Performance in Materials Science

% Intel MPI

Performance vs.

HPCX

Processor Count

Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl test

case at all core counts, and at lower core counts for Gramicidin

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

14 December 2017

Page 54: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

95%

100%

105%

110%

115%

120%

125%

130%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX

54Application Performance in Materials Science

% Intel MPI

Performance vs. HPCX

Processor Count

At no point does the HPC-X implementation

of Gromacs outperform that using Intel MPI

14 December 2017

Page 55: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

50%

60%

70%

80%

90%

100%

110%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.1 – Intel MPI vs. HPCX

55Application Performance in Materials Science

% Intel MPI Performance vs.

HPCX

Processor Count

Significantly different to the classical MD codes – now

HPCX is seen to outperform Intel MPI for the Zeolite

cluster at all core counts, and at larger core counts for

the Palladium complex

14 December 2017

Page 56: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

65%

75%

85%

95%

105%

0 128 256 384 512 640 768

Quantum Espresso - GRIR443

Quantum Espresso - Au112

Quantum Espresso – Intel MPI vs. HPCX

56Application Performance in Materials Science

% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the classical MD codes – as

with VASP, HPCX is seen to outperform Intel MPI for the

larger core counts

14 December 2017

Page 57: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

3. Relative Performance as a Function of

Processor Family and Interconnect.

Application Performance in

Chemistry and Materials Science:

Page 58: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.00

0.20

0.40

0.60

0.80

1.00

DLPOLY-4Gramicidin

DLPOLY-4 NaCl

GROMACS ion-channel

GROMACSlignocellulose

OpenFoam -3d3M

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

BSMBenchBalance

Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR

Dell Skylake Gold 61502.7GHz (T) EDR

Target Codes and Data Sets – 128 PEs

58Application Performance in Materials Science

128 PE Performance [Applications]

14 December 2017

Page 59: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.08

1.19

1.32

1.34

1.38

1.39

1.40

1.42

1.44

1.44

1.49

1.52

1.55

1.64

1.76

2.07

2.12

2.29

2.64

1.00 1.30 1.60 1.90 2.20 2.50

DLPOLYclassic Bench7

OpenFoam - 3d3M

GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

GROMACS ion-channel

NAMD (F1atpase)

GAMESS-UK (valino.A2)

GAMESS-UK (cyc-sporin)

GROMACS…

LAMMPS (3d LJ melt)

NAMD (apoa1)

NAMD (stmv)

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

CP2K H2O.256

VASP Zeolite complex

QE GRIR443

QE Au112

VASP Pd-O complex

Improved Performance of

Dell |EMC Skylake Gold 6130

2.1GHz (T) OPA

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR

59Application Performance in Materials Science

Average Factor = 1.54

SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR

256 cores

14 December 2017

Page 60: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.08

1.19

1.32

1.34

1.38

1.39

1.40

1.42

1.44

1.44

1.49

1.52

1.55

1.64

1.76

2.07

2.12

2.29

2.64

1.00 1.30 1.60 1.90 2.20 2.50

DLPOLYclassic Bench7

OpenFoam - 3d3M

GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

GROMACS ion-channel

NAMD (F1atpase)

GAMESS-UK (valino.A2)

GAMESS-UK (cyc-sporin)

GROMACS…

LAMMPS (3d LJ melt)

NAMD (apoa1)

NAMD (stmv)

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

CP2K H2O.256

VASP Zeolite complex

QE GRIR443

QE Au112

VASP Pd-O complex

Improved Performance of

Dell |EMC Skylake Gold 6150

2.7GHz (T) EDR

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR

60Application Performance in Materials Science

Average Factor = 1.80

SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR

256 cores

14 December 2017

Page 61: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.87

0.98

1.00

1.00

1.01

1.02

1.02

1.04

1.04

1.04

1.11

1.11

1.11

1.12

1.12

1.22

1.33

1.45

1.58

1.59

0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60

OpenFoam - 3d3M

VASP Zeolite complex

DLPOLY-4 NaCl

GAMESS-UK (hf12z)

LAMMPS (3d LJ melt)

GROMACS ion-channel

GAMESS-UK (valino.A2)

DLPOLY-4 Gramicidin

GAMESS-UK (Siosi7)

GAMESS-UK (cyc-sporin)

NAMD (stmv)

NAMD (apoa1)

GROMACS…

NAMD (F1atpase)

VASP Pd-O complex

QE Au112

DLPOLYclassic Bench4

DLPOLYclassic Bench5

QE GRIR443

DLPOLYclassic Bench7

Improved Performance of

Intel Skylake Gold 6148 2.4GHz (T)

OPA

vs.

ATOS Broadwell e5-2680v4 2.4GHz

(T) OPA

61Application Performance in Materials Science

Average Factor = 1.16

SKL “Gold” 6148 2.4 GHz vs. BDW e5-2680v4 2.4 GHz EDR

256 cores

14 December 2017

Page 62: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

62Application Performance in Materials Science

Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous charts based on Core to core comparison i.e.

performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a 6 “node to node” comparison of the following:

¤ Benchmarks based on set of 10 applications & 19 data sets.

1Fujitsu CX250 Sandy Bridge e5-2670/2.6

GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6150 2.7GHz

(T) EDR [216 cores]

2Fujitsu CX250 Sandy Bridge e5-2670/2.6

GHz IB-QDR [96 cores]Dell |EMC Skylake Gold 6142 2.6GHz

(T) EDR [192 cores]

3Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [168 cores]

4Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

“Thor” Broadwell cluster e5-2697A v4

2.6GHz (T) IB EDR [192 cores]

5“Thor” Broadwell cluster e5-2697A v4

2.6GHz (T) IB EDR [192 cores]

Dell |EMC Skylake Gold 6150 2.7GHz

(T) EDR [216 cores]

14 December 2017

Page 63: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.49

2.15

2.22

2.42

2.46

2.46

2.47

2.51

2.58

2.58

2.62

2.67

2.72

2.75

2.79

2.86

2.86

3.98

4.01

1.0 1.5 2.0 2.5 3.0 3.5 4.0

DLPOLYclassic Bench4

GAMESS-UK (cyc-sporin)

GAMESS-UK (hf12z)

GAMESS-UK (Siosi7)

VASP Pd-O complex

GAMESS-UK (valino.A2)

DLPOLY-4 NaCl

CP2K H2O.256

DLPOLY-4 Gramicidin

GROMACS ion-channel

VASP Zeolite complex

NAMD (stmv)

LAMMPS (3d LJ melt)

GROMACS lignocellulose

NAMD (F1atpase)

NAMD (apoa1)

QE Au112

OpenFoam - 3d3M

QE GRIR443

Improved Performance of

Dell |EMC Skylake Gold 6130

2.1GHz (T) OPA [192 cores]

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

63Application Performance in Materials Science

Average Factor = 2.54

SKL “Gold” 6130 2.1 GHz OPA vs. SB e5-2670 2.6 GHz QDR

6 Node Comparison

14 December 2017

Page 64: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.76

2.83

3.08

3.17

3.26

3.28

3.31

3.32

3.34

3.83

4.50

5.23

0.0 1.0 2.0 3.0 4.0 5.0

DLPOLYclassicBench5

QE Au112

DLPOLY-4 NaCl

GROMACS ion-channel

DLPOLY-4 Gramicidin

VASP Zeolite complex

LAMMPS (3d LJ melt)

VASP Pd-O complex

CP2K H2O.256

GROMACSlignocellulose

QE GRIR443

OpenFoam - 3d3M

Improved Performance of

Dell |EMC Skylake Gold 6150

2.7GHz (T) EDR [216 cores]

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

64Application Performance in Materials Science

Average Factor = 3.15

SKL “Gold” 6150 2.7 GHz EDR vs. SB e5-2670 2.6 GHz QDR

6 Node Comparison

14 December 2017

Page 65: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.91

0.91

0.93

0.94

1.00

1.01

1.02

1.02

1.04

1.04

1.05

1.05

1.06

1.08

1.09

1.10

1.11

1.12

1.14

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

VASP Pd-O complex

OpenFoam - 3d3M

DLPOLYclassic Bench7

QE Au112

QE GRIR443

DLPOLY-4 NaCl

CP2K H2O.256

GAMESS-UK (hf12z)

DLPOLY-4 Gramicidin

GAMESS-UK (valino.A2)

GAMESS-UK (Siosi7)

GAMESS-UK (cyc-sporin)

VASP Zeolite complex

LAMMPS (3d LJ melt)

GROMACS ion-channel

NAMD (apoa1)

NAMD (F1atpase)

NAMD (stmv)

GROMACS lignocellulose

65Application Performance in Materials Science

Average Factor = 1.03

6 Node Comparison

SKL “Gold” 6142 2.6 GHz EDR vs. BDW e5-2697Av4 2.4 GHz EDR

Improved Performance of

Dell |EMC Skylake Gold 6142 2.6GHz

(T) EDR [192 cores]

vs.

Thor Dell|EMC e5-2697A v4 2.6GHz

(T) EDR HPCX [192 cores]

14 December 2017

Page 66: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

4. Preliminary Evaluation of the AMD

EPYC 7601 Processor.

Application Performance in

Chemistry and Materials Science:

Page 67: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

• Zen cores

¤ Private L1/L2 cache

• CCX

¤ 4 ZEN cores (or less)

¤ 8MB L3 shared cache

• Zeppelin

¤ 2 CCX (or less)

¤ 2 DDR4 channels

¤ 2 PCIe 16x

• Naples

¤ 4 Zeppelin SoC dies fully

connected by Infinity

Fabric.

¤ 4 Numa Nodes !

067

EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x D

DR

4 C

ha

nn

els

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.

• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in a

dual socket system i.e. different memory latencies depending on which die needs data from

memory that can be attached to that die or another die on the fabric.

• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within

the same socket where Intel stays on a single piece of silicon.

Application Performance in Materials Science 14 December 2017

Page 68: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

68

068

Inter-socket• 4 link (2B) between

sockets in 2 processors configurations.

• Each die connect to peer die

• 2 hop maximum system diameter

• 38GB/s bi-dir BW per link• 152GB/s between two

sockets • Infinity Control Fabric

connected between socket

INFINITY FABRICS and Inter-Socket Connectivity

Intra socket• Fully connect with 4B link• 42,6GB/s per link • 170GB/s

Infinity Fabric is a feat of engineering, but it does mean that

there are significant performance variations as you move off

die and onto the fabric.

Application Performance in Materials Science 14 December 2017

Page 69: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

069

SKU 7601 7551 7501 7451 7401 7351 7301

Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2

Turboboost

All cores

active

2.7 2.6 2.6 2.9 2.8 2.9 2.7

Turboboost

On core

active

3.2 3.0 3.0 3.2 3.0 2.9 2.7

Cores/socket 32 32 32 24 24 16 16

L3 cache size 64 MB

Memory

Channel8

Memory Freq 2667 MT/s

TDP (W) 180 180 155/170 180 155/170 155/170 155/170

AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle

Architecture Sandy Bridge Haswell Skylake EPYC

ISA* AVX AVX2 AVX-512 AVX2

op/cycle2

(1 ADD, 1 MUL)

4

(2 FMA)

4

(2 FMA)

4

(2 ADD, 2 MUL)

Vector size (DP =

64-bits)4 4 8 2

FLOP/cycle 8 16 32 8

* Instruction Set Architecture

Application Performance in Materials Science 14 December 2017

Page 70: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

AMD EPYC 7601 System – 4 x SuperMicro AS-1123US-TR4

– 1U 1 node form-factor

– Bi-socket AMD® Epyc 7601

• x86_64 architecture (Support up to AVX2 ISA)

• Base Frequency 2.2GHz, Turbo core:3.2GHz (one

core active), 2.7GHz (all cores active)

• Per socket: 32 zen-cores/64 threads

– 563.2 MFLOPS in DP64 per socket

– L1-D 32KB, L1-I 64KB, L2 512KB, L3 3MB

– Hyperthreading On, Turbo core On

– TDP: 180 Watt

– Memory sub-system

– 8-Memory channels per socket, up to 2666MT/s

» Theoretical Peak: 170.4GBytes/s per socket

– 16 DIMM 16GB SR (1 per Channel) @ 2666MT/s

» Samsung M393A2K40BB2-CTD

▶ 4 disks ST1000NM0008-2F2

– Seagate Capacity 1TB (3,5“) SATA, 7200rpm

– 1 disk for OS (Ext4)

– 3 disks for /scratch (not configured)

▶ OS : RHEL7.4, kernel 3.10.0-693.el7.x86_64

▶ Interconnected through Mellanox® EDR-4x Fabric

Application Performance in Materials Science 7014 December 2017

Page 71: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

EPYC - Compiler and Run-time Options

Compilation:

INTEL COMPILERS 2018, IntelMPI 2017

Update 3, FFTW-3.3.5

AMD EPYC: –O3 -xAVX2

INTEL SKYLAKE: -O3 –xCORE-AVX2

#

# Preload the amd-cputype library to navigate

# the Intel Genuine cpu test

module use /opt/amd/modulefiles

module load AMD/amd-cputype/1.0

export LD_PRELOAD=$AMD_CPUTYPE_LIB

export OMP_PROC_BIND=true

# export KMP_AFFINITY=granularity=fine

export I_MPI_DEBUG=5

export MKL_DEBUG_CPU_TYPE=5

Application Performance in Materials Science 7114 December 2017

STREAM:source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh

intel64

module load AMD/amd-cputype/1.0

icc -o stream.x stream.c -DSTATIC -Ofast -xCORE-AVX2 -qopenmp -

DSTREAM_ARRAY_SIZE=800000000 \

-mcmodel=large -shared-intel

export OMP_NUM_THREADS=16

export OMP_PROC_BIND=true

export OMP_PLACES="{0:4:1}:16:4” #1 thread per CCX

export OMP_DISPLAY_ENV=true

Page 72: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2.0

2,999

11,567

1

10

100

1,000

10,000

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR

Dell Skylake Gold 6142 2.6GHz (T) EDR

Dell Skylake Gold 6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec

72Application Performance in Materials Science

BE

TT

ER

14 December 2017

Page 73: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR

Dell PE R730 Broadwell e5-2697Av4 2.6GHz (T) OPA

Dell Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6142 2.6GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

MPI Collectives – Alltoallv (128 PEs)

IMB Benchmark (Intel)

128 PEs

Latency

BE

TT

ER

Message Length (Bytes)

Measured Time (usec)

73Application Performance in Materials Science

EPYC performance with

Intel MPI ~ 4-6 × worse

than that with SKL

processors!

14 December 2017

Page 74: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

2.9

5.8

9.9

13.4

3.5

6.8

11.1

15.7

4.1

7.7

12.4

16.0

4.2

7.7

12.4

16.7

4.3

6.4

10.2

13.6

0.0

3.0

6.0

9.0

12.0

15.0

32 64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ClusterVision Ivy Bridge e5-2650v2 2.6GHz True Scale QDR

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6142 2.6GHz (T) EDR

Dell Skylake Gold 6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

DL_POLY Classic – NaCl Simulation

Number of Processing Elements

Performance

Performance Data (32-256 PEs)

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

BE

TT

ER

NaCl 27,000 atoms; 500 time steps

Application Performance in Materials Science 7414 December 2017

Page 75: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.2

2.1

1.5

2.7

1.2

2.1

1.6

2.9

1.5

2.7

1.8

3.2

1.6

2.8

1.7

3.1

1.8

3.3

1.9

3.5

1.1

2.0

1.5

2.7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK MPI/ScaLAPACK code

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER

Application Performance in Materials Science 7514 December 2017

Page 76: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.8

3.1

4.9

2.9

5.1

8.5

2.7

5.0

8.4

3.0

5.4

9.0

2.4

3.6

4.9

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

DL_POLY 4 – Gramicidin Simulation

Number of Processing Elements

Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps

Application Performance in Materials Science 7614 December 2017

Page 77: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.4

2.0

2.6

1.5

2.1

2.7

0.9

1.4

1.6

1.4

1.8

2.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

64 96 128

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)

Number of Processing Elements

Perf

orm

an

ce

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

BE

TT

ER

VASP 5.4.1 – Zeolite Benchmark

Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834, 2 k-

points, FFT grid: (65, 65, 43); 181,675 points

Application Performance in Materials Science 7714 December 2017

Page 78: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.3

1.9

2.42.3

3.2

3.8

2.8

4.0

5.2

2.8

4.0

4.9

1.4

1.82.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

96 128 160

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Number of Processing Elements

Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R

ela

tive

to

th

e F

ujit

su

e5

-

26

70

2.6

GH

z 8

-C (

96

PE

s)]

Application Performance in Materials Science 7814 December 2017

Page 79: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

DLPOLYclassic Bench4

DLPOLY-4Gramicidin

DLPOLY-4NaCl

GROMACSion-channel

GROMACSlignocellulos

e

GAMESS-UK(cyc-sporin)

GAMESS-UK(Siosi7)

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

Fujitsu CX250 SandyBridge e5-2670/2.6 GHz IB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR

Dell Skylake Gold 61502.7GHz (T) EDR

Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2GHz (T) EDR

EPYC - Target Codes and Data Sets – 128 PEs

79Application Performance in Materials Science

128 PE Performance [Applications]

14 December 2017

Page 80: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

80Application Performance in Materials Science

Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous EPYC charts based on Core to core comparison

i.e. performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a “node to node” comparison of the following:

¤ Benchmarks based on set of 6 applications & 15 data sets.

1Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

ATOS AMD EPYC 7601 2.2 GHz (T)

EDR [256 cores]

2Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [168 cores]

ATOS AMD EPYC 7601 2.2 GHz (T)

EDR [256 cores]

14 December 2017

Page 81: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

1.48

1.93

2.09

2.09

2.46

2.49

2.67

2.92

3.37

3.43

3.98

4.05

1.0 1.5 2.0 2.5 3.0 3.5 4.0

DLPOLYclassic Bench7

VASP Zeolite complex

DLPOLYclassic Bench5

VASP Pd-O complex

DLPOLY-4 NaCl

DLPOLYclassic Bench4

DLPOLY-4 Gramicidin

QE Au112

GROMACS ion-channel

GAMESS-UK (cyc-sporin)

GROMACS lignocellulose

GAMESS-UK (valino.A2)

Relative Performance of

ATOS AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [64 cores]

81Application Performance in Materials Science

Average Factor = 2.75

AMD EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR

14 December 2017

4 Node Comparison

Page 82: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

0.70

0.72

0.74

0.76

0.95

1.00

1.06

1.19

1.29

1.43

1.49

1.59

1.70

1.71

1.78

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

QE Au112

QE GRIR443

VASP Pd-O complex

VASP Zeolite complex

DLPOLYclassic Bench7

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

DLPOLYclassic Bench5

DLPOLYclassic Bench4

GAMESS-UK (cyc-sporin)

GROMACS ion-channel

GAMESS-UK (valino.A2)

GAMESS-UK (Siosi7)

GROMACS lignocellulose

GAMESS-UK (hf12z)

Relative Performance of

ATOS AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

vs.

Dell |EMC Skylake Gold 6130

2.1GHz (T) OPA [128 cores]

82Application Performance in Materials Science

Average Factor = 1.21

SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR vs.

14 December 2017

4 Node Comparison

Page 83: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

5. Acknowledgements

• Ludovic Sauge, Enguerrand Petit and Patrick Berghaeger (Bull/ATOS)

for informative performance discussions and access to the Skylake &

EPYC clusters at the Bull HPC Competency Centre.

• Pak Lui, David Cho, Gilad Shainer, Colin Bridger & Steve Davey for

access to the “Thor” cluster at the HPC Advisory Council and

“Hercules” partition in Mellanox.

• Doug Mark, Farid Parpia, John Simpson, Ludovic Enault, Xinghong

He, James Kuchler & Luke Willett for access to and assistance on the

IBM Power8 S822LC cluster in Poughkeepsie.

• David Power for access to two Skylake nodes in Boston /BIOSIT.

• Jamie Wilcox, Bogdan Pop, Toby Smith and Andrew Richardson (Intel)

for past access to and help with a host of processors and for access to

the Swindon clusters.

• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles Civario and

Christopher Huggins for access to, and assistance with, the variety of

Skylake SKUs at the Dell Benchmarking Centre.

Application Performance in Materials Science 8314 December 2017

Page 84: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Summary

• Focus on performance benchmarks and clusters featuring Intel’s

Skylake “Gold” processors (6130, 2.1 GHz [16c]; 6148, 2.4 GHz

[20c], 6142, 2.6 GHz [16c] ; and 6150, 2.7 GHz [18c]).

• Performance comparison with current Sandy Bridge systems and

those based on dual Intel Broadwell processor EP nodes (16-

core, 14-core) with Mellanox EDR and Intel’s Omnipath OPA

interconnects.

• Measurements of parallel application performance based on

synthetic (STREAM, IMB, HPCC and IOR) and end user

applications – DLPOLY, Gromacs, NAMD, LAMMPS, GAMESS-

UK, Quantum ESPRESSO, VASP and CP2K.

¤ Use of IPM and Allinea Performance reports and comparison of

Mellanox’s HPC-X and Intel MPI on EDR-based systems

• Enhanced performance of the Skylake-based clusters is at first

sight modest, particularly when compared with optimised runs

(HPCX) on previous generation Intel Broadwell clusters.

84Application Performance in Materials Science 14 December 2017

Page 85: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Summary II

• A Core-to-Core comparison across 19 data sets (10 applications)

suggests modest speedups (ca. 1.08) when comparing a Skylake

“Gold” 6150 cluster (EDR) to the Broadwell-based “Thor” e5-2697A v4

2.6GHz (T) cluster with EDR and the HPCX environment.

¤ Little difference in application performance based on Mellanox’s IB

EDR interconnect and Intel’s Omnipath (OPA) interconnect, at least

at modest (< 512 core) counts.

¤ A comparison against the Fujitsu CX250 Sandy Bridge 2.6GHz IB-

QDR, shows average performance increase factors of

• 1.54 (256 cores) for clusters based on the Gold 6130 [OPA]

• 1.64 (256 cores) for clusters based on the Gold 6148 [OPA]

• 1.71 (256 cores) for clusters based on the Gold 6142 [EDR]

• 1.88 (256 cores) for clusters based on the Gold 6150 [EDR]

¤ Some applications however show much higher factors e.g.

Quantum Espresso and VASP.

¤ A Node-to-Node comparison typical of the performance when

running a workload is more encouraging.

85Application Performance in Materials Science 14 December 2017

Page 86: Application Performance in Chemistry and Materials Science193.62.125.70/CIUK2017/MartynGuest_Cardiff.pdf · 2018. 2. 6. · Dell Skylake Gold 6130 2.1GHz (T) OPA Dell Skylake Gold

Summary III

• A 6-node benchmark based on examples from 10 application and 19

data sets show the following improvement factors against 6 node runs

on the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz cluster

¤ Skylake “Gold” 6130 cluster (16c) with EDR interconnect : 2.54

¤ Skylake “Gold” 6142 cluster (16c) with EDR interconnect : 2.83

¤ Skylake “Gold” 6150 cluster (18c) with EDR interconnect : 3.15

• Optimum Interconnect performance is a function of both application

and core count.

¤ With the materials-based codes & OpenFOAM, and at high core

count (> 512 cores), EDR exhibits a clear performance

advantage over OPA.

¤ This is not the case for the classical MD codes where OPA

shows a distinct advantage at all but the highest core counts.

• Preliminary studies on the EPYC 2701 shows a complex performance

dependency on EPYC architecture.

¤ Codes with high usage of vector instructions (Gromacs, VASP and

Quantum Espresso) perform at best in somewhat modest fashion.

86Application Performance in Materials Science 14 December 2017