55
Performance of Intel’s Haswell Processor Application Performance in Chemistry, Materials Science and Nanoscience: 25 th Machine Evaluation Workshop Martyn Guest and Christine Kitchen ARCCA & HPC Wales and Hideaki Kuraishi Fujitsu Laboratories of Europe

25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Embed Size (px)

Citation preview

Page 1: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Performance of Intel’s Haswell Processor

Application Performance in Chemistry,

Materials Science and Nanoscience:

25th Machine Evaluation Workshop

Martyn Guest and Christine Kitchen

ARCCA & HPC Wales

and

Hideaki Kuraishi

Fujitsu Laboratories of Europe

Page 2: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

23 December 2014Application Performance in Materials Science

Outline

• Primary focus on Intel’s new Haswell processor, including

¤ 16-core E5-2698 v3 2.3 GHz;

¤ 14-core E5-2697 v3 2.6 GHz and E5-2695 v3 2.3 GHz (HCC)

¤ 12-core E5-2680 v3 2.5 GHz & E5-2690 v3 2.6 GHz (MCC).

¤ Clusters with dual-processor EP nodes featuring Mellanox IB

QDR and FDR, and Intel’s True Scale QDR Interconnects.

¤ Measurements of parallel application performance.

• Comparison with a number of HPC systems based on Intel

processors:

¤ Intel’s Sandy Bridge-based (E5-2690 and E5-2670) Clusters;

¤ Intel’s Ivy Bridge (E5-2697v2 -12-core), E5-2680v2 and E5-

2690v2 (10-core) and E5-2650v2 (8-core);

¤ Mellanox IB QDR and FDR, and Intel’s True Scale QDR

Interconnects.

Page 3: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

3Application Performance in Materials Science

Outline – Performance Benchmarks

• Variety of Parallel Benchmarks

¤ Synthetic (Stream, IMB and HPCC) and application codes

¤ Thirteen applications:

• MD - DLPOLY, Gromacs, NAMD and LAMMPS;

• QC - NWChem, GAMESS and GAMESS-UK;

• CP – QUANTUM ESPRESSO, ONETEP, SIESTA, CP2K,

VASP, and CASTEP.

• Performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

¤ Analysis using IPM and Allinea’s Performance Report

¤ Two comparisons will be considered:

• Core to core comparison i.e. performance for jobs with a

fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production).

3 December 2014

Page 4: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Background: Sandy Bridge and Ivy Bridge

• Sandy Bridge-EP and Ivy Bridge-EP processors shared the same

microarchitecture (codenamed Sandy Bridge)

¤ Core-to-core “floating point performance” is expected to be the same

(8 FP64 operation/cycle/core)

• Processors are pin compatible

• Ivy Bridge-EP is the “Tick” step of the Sandy Bridge microarchitecture

within the Intel’s production model (“CPU Refresh”)

¤ 22nm manufacturing process technology (32nm for Sandy Bridge)

• Thus, for a similar TDP, more cores possible on the same die area

¤ 10 cores for the advanced family (MCC), 12 cores for the optimized

one (HCC)

• For the same core-to-core performance we expect a benefit in terms of

throughput ranging from 25 to 50% (for 10- and 12-cores processors

respectively) . Little throughput improvement expected for the 8-core.

• Other major improvements

¤ Increasing memory speed (1866 MT/sec vs. 1600 MT/sec)

¤ Larger Last Level Cache (L3) – But same size per core, 2.5 MB/sec/core

4Application Performance in Materials Science 3 December 2014

Page 5: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Haswell-EP Platform Highlights

Xeon E5-2600 v2

“Ivy Bridge-EP”

Xeon E5-2600 v3

“Haswell-EP”

Cores Up to 12 Cores Up to 18 Cores

AVX SupportAVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

Memory

4xDDR3 channels

1866 (1DPC), 1600,

1333, 1033

4xDDR4 channels

2133 (1DPC), 1866 (2DPC),

1600, 1333

QPI 8.0 GT/s 9.6 GT/s

TDP Up to 130W Server;

150W Workstation

Up to 145W Server;

160W Workstation

Power

Management

Same P-states for all cores

Same core & uncore frequency

Per-core P-states;

Independent uncore

frequency scaling

5Application Performance in Materials Science 3 December 2014

Page 6: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

6Application Performance in Materials Science

Haswell Xeon E5 Variants - HCC, MCC and LCC

The high core count (HCC) variant has from 14 to 18 cores and has two

switch interconnects linking the rings that in turn bind the cores and L3

cache segments together with the QPI buses, PCI-Express buses, and

memory controllers that also hang off the rings.

The medium core count (MCC) variant has from 10 to 12 cores and

basically cuts off six cores and bends one of the rings back on itself to

link to the second memory controller on the die.

The low core count (LCC) version has from 4 to 8 cores and is basically

the left half of the MCC variant with the two interconnect switches, four of

the cores, and the bent ring removed.

3 December 2014

Page 7: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Haswell Cluster Systems Benchmarked

7Application Performance in Materials Science

Cluster Configuration

Intel Haswell EP-

2697 v3 2.6 GHz 14C

Intel Swindon “Wildcat1” 16 nodes. Node Configuration:

2 x Xeon E5-2697 v3 @ 2.6 GHz, 14 Core (C1 Pre-Production

Processors) 145W TDP, 35MB Cache,128GB DDR4 (8 x 16GB) at

2133MHz.

Interconnect: Intel True Scale – 2 x True Scale HCA’s /node.

Dell Haswell EP-

2697 v3 2.6 GHz 14C

“Thor” at the HPC Advisory Council. Dell PowerEdge R730 32-node

cluster; Dual Socket IntelXeon14-core CPUs E5-2697 v3 @ 2.6 GHz;

Memory: 64GB DDR4 2133MHz RDIMMs per node

Interconnect: Mellanox ConnectX 56Gb/s FDR InfiniBand adapters;

Mellanox SwitchX SX6036 36-Port 56Gb/s FDR IB switch

Bull Haswell EP-

2695 v3 2.3 GHz 14C

Bull “Robin” 32 nodes. Node Configuration:

2 x Xeon E5-2695 v3 @ 2.3 GHz, 14 Core, 120W TDP, 35MB

Cache,128GB DDR4 (8x 16GB) at 2133MHz.

Interconnect: Mellanox Connect IB FDR.

Bull Haswell EP-

2680 v3 (2.5 GHz)

and EP-2690 v3 (2.6

GHz) 12C

Bull “Robin” (32 nodes) and “SID” (85 nodes). Node Configurations:

Robin:2 x Xeon E5-2690 v3 @ 2.6 GHz, 12 Core, 135W TDP

SID: B720 2 x Xeon E5-2680 v3 @ 2.5 GHz, 12 Core, 120W

TDP, 30MB Cache,128GB DDR4 at 2133MHz.

Interconnect: Mellanox Connect IB FDR.

3 December 2014

Page 8: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

39,479

76,97973,490 75,224 76,499

96,278 94,760

87,791

93,486

114,543118,413 118,605

0

20,000

40,000

60,000

80,000

100,000

120,000

Fujitsu BX922Westmere

X5650 2.67GHz

Bullx B510Sandy Bridge E5-

2680/2.7GHz

Fujitsu CX250Sandy Bridge e5-

2670/2.6 GHz

Fujitsu CX250Sandy Bridge e5-

2690/2.9GHz

Fujitsu CX250Sandy Bridge e5-2690/2.9GHz [T]

Intel Ivy Bridgee5-2697v2

2.7GHz

Intel Ivy Bridgee5-2697v2

2.7GHz

Cray XC30 e5-2697v2 2.7GHz

ClusterVision e5-2650v2 2.6GHz

Bull Haswell e5-2695v3 2.3GHz

Intel Haswell e5-2697v3 2.6GHz

Dell R730Haswell e5-

2697v3 2.6GHz(T)

Memory Bandwidth – Total STREAM performance

8Application Performance in Materials Science

TRIAD [Rate (MB/s) ]

Ivy Bridge E5-26xx v2

Sandy Bridge E5-26xx

Haswell E5-26xx v3

3 December 2014

Page 9: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

3,290

4,8114,593

4,702 4,781

4,012 3,948

3,658

5,843

4,0914,229 4,236

0

1,000

2,000

3,000

4,000

5,000

6,000

Fujitsu BX922Westmere X5650

2.67GHz

Bullx B510 SandyBridge E5-

2680/2.7GHz

Fujitsu CX250Sandy Bridge e5-

2670/2.6 GHz

Fujitsu CX250Sandy Bridge e5-

2690/2.9GHz

Fujitsu CX250Sandy Bridge e5-2690/2.9GHz [T]

Intel Ivy Bridgee5-2697v2 2.7GHz

Intel Ivy Bridgee5-2697v2 2.7GHz

Cray XC30 e5-2697v2 2.7GHz

ClusterVision e5-2650v2 2.6GHz

Bull Haswell e5-2695v3 2.3GHz

Intel Haswell e5-2697v3 2.6GHz

Dell R730 Haswelle5-2697v3 2.6GHz

(T)

Memory Bandwidth – STREAM / core performance

9Application Performance in Materials Science

TRIAD [Rate (MB/s) ]

Ivy Bridge E5-26xx v2

Sandy Bridge E5-26xx

Haswell E5-26xx v3

3 December 2014

Page 10: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

9,281

3.5

5,992

2,954

2.3

1,729

0.4

1,099

112.0

0

1

10

100

1,000

10,000

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]

Intel Haswell e5-2697v3 2.6 GHz Truescale QDR

Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB

Intel Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR

ClusterVision e5-2650v2 2.6 GHz Truescale QDR

Fujitsu CX250 e5-2690/2.9 GHz IB-QDR

Fujitsu CX250 e5-2670/2.6 GHz IB-QDR

Fujitsu BX922 X5650 2.66 GHz + IB

Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)

Intel Xeon E5570 2.93GHz QC + 10GBitE (Intel MPI 3.2)

Fujitsu BX922 X5650 2.66 GHz + GBitE (Tier2A)

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec

10Application Performance in Materials Science

BE

TT

ER

3 December 2014

Page 11: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

96.8

226.5

136.6

426.8

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

QLogic NDC X5670 2.93GHz 6-C + QDR (Qlogic MPI)

QLogic NDC X5670 2.93GHz 6-C + QDR (mvapich2-1.6)

QLogic NDC X5670 2.93GHz 6-C + QDR (impi-4.0.0)

Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR (HTC)

Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz [T] IB-QDR

Intel Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR

Intel Ivy Bridge e5-2697v2 2.7 GHz True Scale PSM

Bull B710 Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR

Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB

Intel Ivy Bridge e5-2690v2 3.0 GHz True Scale QDR

ClusterVision e5-2650v2 2.6 GHz Truescale QDR

Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]

Intel Haswell e5-2697v3 2.6 GHz Truescale QDR

MPI Collectives – Alltoallv (128 PEs)

IMB Benchmark (Intel)

128 PEs

Latency

BE

TT

ER

Message Length (Bytes)

Measured Time (usec)

Determines CASTEP

Performance

11Application Performance in Materials Science 3 December 2014

Page 12: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

• Linpack TPP benchmark measures floating point rate of execution for solving a linear system of equations. HPL

• measures floating point rate of execution of double precision real matrix-matrix multiplication. DGEMM

• measures sustainable memory B/W (in GB/s) and the corresponding computation rate for simple vector kernel. STREAM

• parallel matrix transpose - exercises the communicationswhere pairs of processors communicate with each other simultaneously. Useful test of the total communications capacity of the network.

PTRANS

• measures rate of integer random updates of memory (GUPS). RandomAccess

• measures floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform. Performance a combination of flops, memory, and network bandwidth

FFTE

• tests to measure latency and B/W of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).

Communication B/W and latency

HPC Challenge Benchmark (Source - http://icl.cs.utk.edu/hpcc/)

CPU cores N

128 83,000

256 117,000

512 166,000

1024 234,000

12Application Performance in Materials Science 3 December 2014

Page 13: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

0.2

0.4

0.6

0.8

1.0

HPL[Gflops]

G-PTRANS[GB/s]

G-RandomAccess[Gup/s]

G-FFTE[Gflops]

EP-STREAM

Sys [GB/s]

EP-STREAM

Triad[GB/s]

EP-DGEMM[Gflops]

RandomRing

Bandwidth[Gbytes]

RandomRing

Latency[usec]

Fujitsu BX922 Westmere X5650 2.67GHz+ IB-QDR

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz (T) IB-QDR

Dell R720 Ivy Bridge e5-2680v2 2.8 GHz(T) IB-connect

Intel Ivy Bridge e5-2690v2 3.0 GHz (T)True Scale QDR

ClusterVision e5-2650v2 2.6 GHz TrueScale QDR

Bull Haswell e5-2695v3 2.3 GHz (T)Connect IB

Intel Haswell e5-2697v3 2.6 GHz (T) TrueScale QDR

Bull Haswell e5-2690v3 2.6 GHz (T)Connect IB

Bull Haswell e5-2680v3 2.5 GHz (T)Connect IB

HPCC - 128 Processing Elements

Notable issue with memory bandwidth

13Application Performance in Materials Science

Haswell clusters only dominant in HPL and

DGEMM (AVX-2); outperformed in many of

the other benchmarks. Weak STREAM

performance.

[Matrix Size 83,000]

3 December 2014

Page 14: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

0.2

0.4

0.6

0.8

1.0

HPL[Gflops]

G-PTRANS[GB/s]

G-RandomAccess[Gup/s]

G-FFTE[Gflops]

EP-STREAM

Sys [GB/s]

EP-STREAM

Triad[GB/s]

EP-DGEMM[Gflops]

RandomRing

Bandwidth[Gbytes]

RandomRing

Latency[usec]

Fujitsu BX922 Westmere X56502.67GHz + IB-QDR

Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz (T) IB-QDR

Dell R720 Ivy Bridge e5-2680v22.8 GHz (T) IB-connect

Intel Ivy Bridge e5-2690v2 3.0 GHz(T) True Scale QDR

ClusterVision e5-2650v2 2.6 GHzTrue Scale QDR

Bull Haswell e5-2695v3 2.3 GHz(T) Connect IB

Intel Haswell e5-2697v3 2.6 GHz(T) True Scale QDR

Bull Haswell e5-2690v3 2.6 GHz(T) Connect IB

Bull Haswell e5-2680v3 2.5 GHz(T) Connect IB

HPCC - 256 Processing Elements

14Application Performance in Materials Science

[Matrix Size 117,000]

Haswell clusters only dominant in HPL and

DGEMM (AVX-2); outperformed in many of

the other benchmarks. Weak STREAM

performance. Notable issue with memory bandwidth

3 December 2014

Page 15: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

The Test Suite

• The Test suite comprises both synthetics & end-user applications.

Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks

(http://software.intel.com/en-us/articles/intel-mpi-benchmarks):

• Variety of “open source” and commercial end-user application codes have

been selected:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed e.g., memory bandwidth and latency, node

floating point performance and interconnect performance (both latency

and B/W) and sustained I/O performance.

GROMACS, LAMMPS, NAMD, DL_POLY classic & DL_POLY-4

(molecular dynamics)

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP

(ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK

(molecular electronic structure)

15Application Performance in Materials Science 3 December 2014

Page 16: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Allinea Performance Reports

Allinea Performance Reports provides a

mechanism to characterize and understand

the performance of HPC application runs

through a single-page HTML report.

16Application Performance in Materials Science

• Based on Allinea MAP's low-overhead adaptive sampling

technology that keeps data volumes collected and application

overhead low.

• Modest application slowdown (typically 5%) even with thousands of

MPI processes.

• Runs transparently on existing codes by adding a single command

to execution scripts.

• If submitted through a batch queuing system, then the submission

script is modified to load the Allinea module and add the 'perf-report'

command in front of the required mpiexec command.

• perf-report mpiexec -n 256 $code

3 December 2014

Page 17: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Allinea Performance Reports

A Report Summary: This characterizes how the application's wallclock

time was spent, broken down into CPU, MPI and I/O. e.g.

CPU - Time spent computing. % of wall-clock time spent in application and in

library code, excluding time spent in MPI calls and I/O calls.

MPI - Time spent communicating. % of wall-clock time spent in MPI calls such as

MPI - Send, MPI Reduce and MPI Barrier.

I/O - Time spent reading from and writing to the filesystem. % of wall-clock time

spent in system library calls such as read, write and close.

CPU Breakdown

Breaks down the time spent further by analyzing the kinds of instructions that this

time was spent on.

Scalar numeric ops - %-time spent executing arithmetic operations - does not

include time spent using the more efficient vectorized versions of operations.

Vector numeric ops - %-time spent executing vectorized arithmetic operations

such as Intel's SSE / AVX extensions. (2x – 4x performance improvements).

Memory accesses - %-time spent in memory access operations. A high figure

shows the application is memory-bound and is not able to make full use of the

CPU's resources. 17Application Performance in Materials Science 3 December 2014

Page 18: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

2.4

5.1

7.9

13.0

5.3

9.3

16.0

6.0

10.3

17.6

4.3

7.6

13.1

5.4

9.6

16.0

5.9

9.9

16.5

0.0

4.0

8.0

12.0

16.0

20.0

64 128 256

Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz [T] IB-QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Dell R720 Ivy Bridge e5-2680v2 2.8GHz (T) connect-IB

Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) Truescale QDR

18

I. DL_POLY 4 – NaCl Simulation

Number of Processing Elements

Performance

Performance Data (64-256 PEs)

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

BE

TT

ER

NaCl 216,000 atoms; 200 time steps

Application Performance in Materials Science 3 December 2014

Page 19: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

II. GROMACS and the DPPC Benchmarks

DPPC Benchmarks

¤ DPPC: A phospholipid membrane, consisting of 1024

dipalmitoylphosphatidylcholine (DPPC) lipids in a bilayer

configuration with 23 water molecules per lipid, for a total of

121,856 atoms. Two data sets - DPPC-cutoff and DPPC-pme

• Cut-off; Twin range cut-off’s with neighborlist cut-off rlist and

Coulomb cut-off rcoulomb, where rcoulomb ≥ rlist. Time step

for integration = 0.002 fs. Pair lists updated every 10 steps

• PME; Fast Particle-Mesh Ewald electrostatics. Direct space is similar

to the Ewald sum, while the reciprocal part is performed with FFTs.

All test cases run 5,000 steps * 2fs = 10ps simulation time.

GROMACS v 4.6.1 (GROningen MAchine for Chemical

Simulations) is a molecular dynamics package designed for simulations

of proteins, lipids and nucleic acids [University of Groningen] .

Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-

Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory

and Computation 4 (3): 435–447.

19Application Performance in Materials Science 3 December 2014

Page 20: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

2.3

4.5

8.7

16.4

30.6

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

0 32 64 96 128 160 192 224 256

Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

ClusterVision e5-2650v2 2.6GHz True Scale QDR

Dell R720 Ivy Bridge e5-2680v2 2.8GHz (T) Connect-IB

Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR

Bull Haswell E5-2695v3 2.3GHz Connect-IB

Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB

Bull Haswell E5-2690v3 2.6GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

GROMACS – DPPC-pme

Number of Processing Elements

Performance

Performance Data (16-256 PEs)

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

BE

TT

ER

121,856 atoms; 5,000 time steps

Coulomb type = PME

20Application Performance in Materials Science 3 December 2014

Page 21: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

GROMACS – DPPC-pme Performance Report

PME; Fast Particle-Mesh Ewald

electrostatics

21Application Performance in Materials Science

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 22: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

II. GROMACS - Ion Channel Benchmark

• The membrane protein GluCl, a pentameric chloride channel

embedded in a POPC lipid bilayer. This system contains a modest

141,677 atoms, one of the key target sizes for biomolecular simulations

given the importance of these proteins for pharmaceutical applications.

• It is particularly challenging due to a highly inhomogeneous and

anisotropic environment in the membrane, which poses hard

challenges for load balancing with domain decomposition.

• The simulation uses (i) a 2.5 femtosecond time step and constrained

bonds, typical for many biomolecular systems, and (ii) PME

electrostatics, with 3D-FFTs that are difficult to parallelize efficiently.

22Application Performance in Materials Science

-s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000

maxh : Terminate after 0.99 times this time (hours) i.e. terminate after ~30 min

resethwat : Reset Timer counters at half steps i.e. reported walltime and

performance refers to the last half steps of simulation.

noconfout : Do not save output coordinates/velocities at the end.

nsteps : Run this number of steps, no matter what is requested in the input file

3 December 2014

Page 23: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.0

1.8

2.3

3.0

3.73.9

1.3

2.4

3.5

4.3

6.1

7.4

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

64 128 192 256 384 512

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR

ClusterVision e5-2650v2 2.6GHz True Scale QDR

Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB

23

II. GROMACS - Ion Channel Benchmark

Number of Processing Elements

Performance

Performance Data (64-512 PEs)

Relative to the Fujitsu CX250 e5-2670 8-C (64PEs)

BE

TT

ER

Ion Channel 141,677 atoms; 10,000 time steps

Application Performance in Materials Science 3 December 2014

Page 24: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

II. GROMACS - Lignocellulose Benchmark

24Application Performance in Materials Science

Model of cellulose and lignocellulosic biomass in an aqueous solution.

• This system of 3.3M atoms is inhomogeneous, but since it uses

reaction-field electrostatics instead of PME it scales reasonably well.

• Reference: Scaling of Multimillion-Atom Biological Molecular

Dynamics Simulation on a Petascale Supercomputer Roland

Schulz, Benjamin Lindner, Loukas Petridis and Jeremy C. Smith,

J. Chem. Theory Comput. 2009, 5, 2798-2808

-s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000

maxh : Terminate after 0.99 times this time (hours) i.e. terminate after ~30 min

resethwat : Reset Timer counters at half steps i.e. reported walltime and

performance refers to the last half steps of simulation.

noconfout : Do not save output coordinates/velocities at the end.

nsteps : Run this number of steps, no matter what is requested in the input file

3 December 2014

Page 25: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.0

1.7

2.5

3.3

4.7

6.0

1.1

2.1

3.1

4.1

6.1

8.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

64 128 192 256 384 512

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR

ClusterVision e5-2650v2 2.6GHz True Scale QDR

Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB

25

II. GROMACS - Lignocellulose Benchmark

Number of Processing Elements

Performance

Performance Data (64-512 PEs)

Relative to the Fujitsu CX250 e5-2670 8-C (64PEs)

BE

TT

ER

Lignocellulose 3,316,463 atoms; 10,000 time steps

Application Performance in Materials Science 3 December 2014

Page 26: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

• LAMMPS is a classical molecular dynamics code, and an acronym for

Large-scale Atomic/Molecular Massively Parallel Simulator.

• LAMMPS has potentials for soft materials (biomolecules, polymers) and

solid-state materials (metals, semiconductors) and coarse-grained or

mesoscopic systems. It can be used to model atoms or, more generically, as

a parallel particle simulator at the atomic, meso, or continuum scale.

• LAMMPS runs on single processors or in parallel using message-passing

techniques and a spatial-decomposition of the simulation domain. The code

is designed to be easy to modify or extend with new functionality.

http://lammps.sandia.gov/index.html

S. Plimpton, Fast Parallel

Algorithms for Short-Range

Molecular Dynamics, J

Comp Phys, 117, 1-19

(1995).

III. LAMMPS

26Application Performance in Materials Science 3 December 2014

Page 27: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.01.9

3.8

5.5

7.0

8.8

10.4

13.312.2

14.2

18.5

9.3

11.6

13.8

17.9

1.6

3.0

5.8

8.5

11.0

13.6

16.0

21.0

0.0

5.0

10.0

15.0

20.0

25.0

0 32 64 96 128 160 192 224 256

Fujitsu BX922 Westmere X5650 2.67GHz IB-QDR

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz (T) IB-QDR

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Number of Processing Elements

Performance Data (16-256 PEs)

256,000 atoms; 5,000 time steps

BE

TT

ER

Pe

rfo

rma

nc

eLAMMPS – Atomic fluid with Lennard-Jones Potential

27Application Performance in Materials Science 3 December 2014

Page 28: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

LAMMPS – Lennard-Jones Fluid - Performance Report

256,000 atoms; 5,000 time steps

28Application Performance in Materials Science

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 29: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

• NAMD, is a parallel molecular dynamics code designed for high-

performance simulation of large bio-molecular systems. Based on Charm++

parallel objects, NAMD scales to hundreds of cores for typical simulations

and beyond 200,000 cores for the largest simulations.

• NAMD uses the popular molecular graphics program VMD for simulation

setup and trajectory analysis, but is also file-compatible with AMBER,

CHARMM, and X-PLOR. NAMD is distributed free of charge with source

code.

• Tutorials show you how to use NAMD and VMD for biomolecular modeling.

• The 2005 reference paper has over 3000 citations as of July 2013.

http://www.ks.uiuc.edu/Research/namd/

James C. Phillips et al.,

Scalable molecular dynamics

with NAMD , J Comp Chem, 26,

1781-1792 (2005).

The Force Field Toolkit (ffTK),

that greatly reduces these

limitations by facilitating the

development of parameters

directly from first principles.

ffTK,

IV. NAMD

29Application Performance in Materials Science 3 December 2014

Page 30: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.0

1.6

2.9

3.6

4.0

4.5

4.9

1.3

2.2

3.6

4.5

5.0

5.8

6.4

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

0 32 64 96 128 160 192 224 256

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR [T]

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Perf

orm

an

ce

Performance Data (16-256 PEs)

BE

TT

ER

NAMD – STMV (virus) Benchmark

Satellite Tobacco Mosaic

Virus (STMV) benchmark

(1,066,628 atoms, periodic,

PME), 500 time steps

Rela

tive to t

he F

ujit

su C

X250

e5

-

26

70

/2.6

GH

z (1

6 P

Es)

30Application Performance in Materials Science

Number of Processing Elements

3 December 2014

Page 31: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.01.9

3.9

5.7

7.3

10.4

13.3

12.0

15.8

1.3

2.6

5.1

7.6

9.9

14.4

19.3

0.0

5.0

10.0

15.0

20.0

0 32 64 96 128 160 192 224 256

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR [T]

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Perf

orm

an

ce

Performance Data (16-256 PEs)

BE

TT

ER

NAMD – STMV (virus) Benchmark – days/ns

STMV (virus) benchmark (1,066,628 atoms,

periodic, PME), 500 time steps

Rela

tive to t

he F

ujit

su C

X250

e5

-

26

70

/2.6

GH

z (1

6 P

Es)

31Application Performance in Materials Science

Number of Processing Elements

Performance is measured in “days/ns”;

“days/ns” shows the number of compute

days required to simulate 1 nanosecond

of real-time i.e. lower the day/ns required

the better.

3 December 2014

Page 32: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.5

0.5

0.3

0.20.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32PEs

64PEs

128PEs

256PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128PEs

256PEs

CPU (%)

MPI (%) Performance Data (32-256 PEs)

NAMD – STMV (virus) Performance Report

STMV (virus) benchmark (1,066,628 atoms,

periodic, PME), 500 time steps

32Application Performance in Materials Science

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 33: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

• GAMESS is a program for ab initio molecular quantum chemistry.

GAMESS can compute SCF wavefunctions ranging from RHF, ROHF,

UHF, GVB, & MCSCF.

• Correlation corrections include Configuration Interaction, second order

perturbation Theory, and Coupled-Cluster approaches, as well as the

Density Functional Theory approximation.

• Excited states can be computed by CI, EOM, or TD-DFT procedures.

• Nuclear gradients are available, for automatic geometry optimization,

transition state searches, or reaction path following.

• Computation of the energy hessian permits prediction of vibrational

frequencies, with IR or Raman intensities.

• Solvent effects may be modelled by the discrete Effective Fragment

potentials, or continuum models e.g. the Polarizable Continuum Model.

The GAMESS-US Software - Capabilities

"Advances in electronic structure theory: GAMESS a decade later" M.S.Gordon,

M.W.Schmidt pp. 1167-1189, in "Theory and Applications of Computational

Chemistry: the first forty years" C.E.Dykstra, G.Frenking, K.S.Kim, G.E.Scuseria

(editors), Elsevier, Amsterdam, 2005.

33Application Performance in Materials Science 3 December 2014

Page 34: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.9

3.5

5.9

3.0

5.4

9.0

3.6

6.5

10.6

6.1

9.9

13.4

2.9

5.2

8.6

3.7

6.6

10.8

4.0

7.2

11.9

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

64 128 256

Fujitsu BX922 Westmere X5650 2.67GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR (T)

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR

Bull Haswell E5-2695v3 2.3GHz Connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Basis: CCQ

C2H2S2 MP2 CCQ (370 GTOs),

geom. Optimisation [7 gradient

calculations]

GAMESS-US Performance

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67GHz 6-C (32 PEs)

BE

TT

ER

34Application Performance in Materials Science

Distributed Data Interface (DDI)

3 December 2014

Page 35: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.032 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Application Performance in Materials Science

Performance Data (32-256 PEs)

GAMESS-US MP2 Performance Report

C2H2S2 MP2 CCQ (370 GTOs),

geom. Optimisation [7 gradient

calculations]

35

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 36: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

NWChem ArchitectureR

un

-tim

e d

ata

base

DFT energy, gradient, …

MD, NMR, Solvation, …

Optimize, Dynamics, …

SCF energy, gradient, …

Inte

gra

l A

PI

Geo

me

try O

bje

ct

Bas

is S

et

Ob

jec

t

...

Pe

IGS

...

Global Arrays

Memory Allocator

Parallel IO

Molecular

Modeling

Toolkit

Molecular

Calculation

Modules

Molecular

Software

Development

Toolkit

GenericTasksEnergy, structure, …

• Developed as part of the

construction of the

Environmental Molecular

Sciences Laboratory

(EMSL)

• Envisioned to be used as

an integrated component in

solving DOE’s Grand

Challenge environmental

restoration problems

• Designed and developed

to be a highly efficient and

portable MPP

computational chemistry

package

• Provides computational

chemistry solutions that are

scalable with respect to

chemical system size as

well as MPP hardware size

Application Performance in Materials Science 363 December 2014

Page 37: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.0

1.6

1.8

1.4

1.7

2.1

1.3

2.32.4

1.8

2.6

3.1

2.1

2.7

3.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

64 128 256

Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR

Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR

Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz [T] IB-QDR

Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]

Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB

Bull Haswell E5-2680v3 2.50GHz (T) Connect-IB

Zeolite Y cluster SiOSi7 DFT -

DZVP (Si,O,H); charge density

fitting, S-VWN (3,554 GTOs)

NWChem - Performance of the DFT Code

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67GHz 6-C (64 PEs)

BE

TT

ER

37Application Performance in Materials Science 3 December 2014

Page 38: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

10.0

20.0

30.0

40.0

50.0

64 PEs

128 PEs

256 PEs

512 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-512 PEs)

NWChem DFT Performance Report

38Application Performance in Materials Science

Zeolite Y cluster SiOSi7 DFT -

DZVP (Si,O,H); charge density

fitting, S-VWN (3,554 GTOs)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

64 PEs

128 PEs

256 PEs

512 PEs

CPU (%)

MPI (%)

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 39: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using

pseudopotentials or the projector-augmented wave method and a plane

wave basis set.

• Quantum Espresso – an integrated suite of Open-Source computer codes

for electronic-structure calculations and materials modelling at the

nanoscale. It is based on density-functional theory (DFT), plane waves,

and pseudopotentials

• SIESTA - an O(N) DFT code for electronic structure calculations and ab

initio molecular dynamics simulations for molecules and solids. It uses

norm-conserving pseudopotentials and linear combination of numerical

atomic orbitals (LCAO) basis set.

• CP2K is a program to perform atomistic and molecular simulations of solid

state, liquid, molecular, and biological systems. It provides a framework for

different methods such as e.g., DFT using a mixed Gaussian & plane waves

approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling

code for quantum-mechanical calculations based on DFT.

Computational Materials

Advanced Materials Software

39Application Performance in Materials Science 3 December 2014

Page 40: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

VASP (5.3) performs ab-initio QM molecular dynamics (MD)

simulations using pseudopotentials or the projector-augmented

wave method and a plane wave basis set. Approach is based

on the (finite-temperature) local-density approximation with the

free energy as variational quantity & an exact evaluation of the

instantaneous electronic ground state at each MD time step.

VASP uses matrix diagonalisation schemes and an

efficient Pulay/Broyden charge density mixing.

The interaction between ions and electrons is

described by ultra-soft Vanderbilt

pseudopotentials (US-PP) or by the projector-

augmented wave (PAW) method (allowing for a

considerable reduction of the number of plane-

waves per atom for transition metals and first row

elements). Forces and the full stress tensor can be

calculated with VASP and used to relax atoms into

their instantaneous ground-state.

http://cms.mpi.univie.ac.at/vasp/vasp/vasp.html

VASP 5.3

40Application Performance in Materials Science 3 December 2014

Page 41: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.9

3.2

4.6

2.7

4.1 4.0

3.3

5.6

6.8

3.2

5.1

6.0

3.9

6.26.5

3.5

5.6

7.3

3.6

6.0

7.8

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

64 128 256

Fujitsu Westmere X5650 2.67GHz IB-QDR

Fujitsu CX250 e5-2670 2.6GHz QDR

Fujitsu CX250 e5-2690 2.9GHz QDR

ClusterVision e5-2650v2 2.6GHz True Scale QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB

Bull Haswell e5-2690v3 2.6GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR

Number of Processing Elements

Performance Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.3 – Pd-O Benchmark

41Application Performance in Materials Science 3 December 2014

Page 42: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

VASP 5.3 – Pd-O Benchmark Performance Report

42Application Performance in Materials Science

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 43: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Quantum Espresso is an

integrated suite of Open-

Source computer codes

for electronic-structure

calculations and

materials modelling at the

nanoscale. It is based on

density-functional theory,

plane waves, and

pseudopotentials.

Ground-state calculations.

Structural Optimization.

Transition states & minimum energy paths.

Ab-initio molecular dynamics.

Response properties (DFPT).

Spectroscopic properties.

Quantum Transport.

Benchmark Details

DEISA AU112

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT dimensions:

(180, 90, 288)

PRACE

GRIR443

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points, FFT

dimensions: (180, 180, 192)

Quantum Espresso

43Application Performance in Materials Science 3 December 2014

Page 44: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.0

1.5

1.8

2.3

2.62.5

3.2

1.5

2.4

3.0

3.63.8

4.5

5.2

3.3

4.2

4.7

2.2

3.3

3.8

4.4

6.7

2.8

4.1

5.0

5.5

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

64 128 192 256 320 384 448 512

HTC Westmere X5650 2.67GHz QDR

Kay Bullx B510 e5-2680 2.7GHz QDR

Fujitsu CX250 e5-2670 2.6GHz QDR

Fujitsu CX250 e5-2690 2.9GHz QDR

Dell R720 Ivy Bridge e5-2680v2 2.8GHz connect-IB

ClusterVision e5-2650v2 2.6GHz Truescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB

Bull Haswell e5-2690v3 2.6GHz Connect-IB

Intel Haswell e5-2697v3 2.6GHz Truescale QDR (T)

Number of Processing Elements

Pe

rfo

rma

nc

e

Relative to the Fujitsu X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER

Performance Data (128-512 PEs)Quantum Espresso – GRIR443

44Application Performance in Materials Science

Strong Performance

of Bull e5-2680v3

3 December 2014

Page 45: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

64 PEs

128 PEs

256 PEs

512 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

Quantum Espresso – C200Ir243 Performance Report

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points,

FFT dimensions: (180, 180, 192)

45Application Performance in Materials Science

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

64 PEs

128 PEs

256 PEs

512 PEs

CPU (%)

MPI (%)

CPU Time Breakdown

Total Wallclock Time

Breakdown

3 December 2014

Page 46: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

0.0

0.2

0.4

0.6

0.8

1.0

GROMACSDPPC /

cutoff.mdp

GROMACSDPPC

pme.mdp

DLPOLYclassicNaCl

DLPOLY-4Nacl

DLPOLY-4Ar

QuantumEspresso -

Au112

CP2K 512H2O

GAMESS-UK

DFT.valino.A2.DZVP2

128 PE Performance [Applications]

Fujitsu BX922 Westmere X56502.67GHz + IB-QDR

Fujitsu CX250 e5-2690/2.9GHz IB-QDR

Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR

Fujitsu CX250 e5-2670/2.6GHz IB-QDR

Intel Ivy Bridge e5-2690v2 3.0GHz (T)True Scale QDR

Dell R720 Ivy Bridge e5-2680v22.8GHz(T) Connect-IB

ClusterVision e5-2650v2 2.6GHzTruescale QDR

Cray XC30 e5-2697v2 2.7GHz ARIES[Archer]

Bull Haswell e5-2695v3 2.3GHz ConnectIB

Intel Haswell e5-2697v3 2.6GHz (T) TrueScale QDR

Bull Haswell E5-2680v3 2.5GHz (T)Connect-IB

Target Codes and Data Sets – 128 PEs

46Application Performance in Materials Science 3 December 2014

Page 47: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.50

1.42

1.29

1.15

1.15

1.14

1.13

1.11

1.10

1.10

1.09

1.09

1.09

1.08

1.07

1.05

1.04

1.03

1.01

0.98

0.87

0.50 0.75 1.00 1.25 1.50

QE Au112

QE GRIR443

SIESTA

VASP Example3

CP2K H2O.256

LAMMPS - LJ atoms

DLPOLY-4 Ar

VASP Example4

GROMACS DPPC / cutoff.mdp

OpenFoam - 3d3M

GAMESS-UK - cyclo.6-31G-dp

GAMESS-UK - siosi7.3975

GAMESS-UK - valino.A2.DZVP2

CASTEP MnO2

GROMACS DPPC pme.mdp

DLPOLY classic NaCl

GAMESS-UK - hf12z-shell4

DLPOLY-4 Nacl

ONETEP

CASTEP IDZ

NAMD stmv

Target Codes and Data Sets – 256 PEs

Improved Performance of

Intel Haswell e5-2697v3 2.6

GHz Truescale QDR

vs.

Ivy Bridge e5-2650v2 2.6

GHz True Scale QDR

47Application Performance in Materials Science

Average Factor = 1.12

3 December 2014

Page 48: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.65

1.49

1.41

1.33

1.11

1.10

1.09

1.08

1.07

1.06

1.03

1.03

1.03

1.02

1.00

1.00

0.99

0.96

0.96

0.95

0.88

0.50 0.75 1.00 1.25 1.50

QE Au112

SIESTA

QE GRIR443

CP2K H2O.256

VASP Example4

GROMACS DPPC / cutoff.mdp

DLPOLY classic NaCl

OpenFoam - 3d3M

GROMACS DPPC pme.mdp

ONETEP

LAMMPS - LJ atoms

GAMESS-UK - cyclo.6-31G-dp

GAMESS-UK - valino.A2.DZVP2

GAMESS-UK - siosi7.3975

DLPOLY-4 Nacl

GAMESS-UK - hf12z-shell4

CASTEP IDZ

NAMD stmv

VASP Example3

DLPOLY-4 Ar

CASTEP MnO2

Average Factor = 1.11

Target Codes and Data Sets – 256 PEs

Improved Performance of

Bull Haswell E5-2680v3

2.5GHz (T) Connect-IB

vs.

Ivy Bridge e5-2650v2 2.6

GHz True Scale QDR

48Application Performance in Materials Science 3 December 2014

Page 49: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

2.67

1.89

1.84

1.77

1.68

1.66

1.59

1.58

1.53

1.51

1.47

1.38

1.35

0.91

0.50 1.00 1.50 2.00 2.50 3.00

OpenFOAM [Cavity 3D-3M]

DLPOLY4 [Ar LJ]

Gromacs 4.6.1 [dppc-pme]

GAMESS-UK (DZVP2)

GAMESS-UK (sios7)

GAMESS-UK (DFT-6-31G dp)

Gromacs 4.6.1 [dppc-cutoff]

CASTEP [MnO2]

GAMESS-UK (hf12z)

DLPOLY Classic [NaCl]

DLPOLY4 [NaCl]

CASTEP [IDZ]

VASP

SIESTA

Node to Node Comparison – 6-node Performance I.

T (96 cores of NOC cluster - Ivy

Bridge e5-2650v2 2.6 GHz True

Scale QDR) /

T (168 cores of intel cluster -

Intel Haswell e5-2697v3 2.6 GHz

Truescale QDR)

49Application Performance in Materials Science

Average = 1.63

3 December 2014

Page 50: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

1.70

1.65

1.53

1.49

1.46

1.44

1.43

1.43

1.43

1.42

1.41

1.31

1.26

1.22

0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90

OpenFOAM [Cavity 3D-3M]

Gromacs 4.6.1 [dppc-pme]

Gromacs 4.6.1 [dppc-cutoff]

GAMESS-UK (DZVP2)

DLPOLY4 [Ar LJ]

GAMESS-UK (sios7)

GAMESS-UK (DFT-6-31G dp)

VASP

GAMESS-UK (hf12z)

DLPOLY4 [NaCl]

DLPOLY Classic [NaCl]

CASTEP [MnO2]

CASTEP [IDZ]

SIESTA

Node to Node Comparison – 6-node Performance II.

T (96 cores of NOC cluster - Ivy

Bridge e5-2650v2 2.6 GHz True

Scale QDR) /

T (144 cores of the Bull cluster

– B720 nodes with Haswell E5-

2680v3 2.5GHz (T) Connect-IB)

50Application Performance in Materials Science

Average = 1.44

3 December 2014

Page 51: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Node to Node Comparison – Six node Performance

T (96 cores of Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR /

T (168 cores of intel cluster - Intel Haswell e5-2697v3 2.6 GHz Truescale QDR)

51Application Performance in Materials Science

3.49

2.27

3.42

2.25

1.92

1.97

2.81

1.72

1.11

0.0 1.0 2.0 3.0 4.0

1

SIESTA

DLPOLY Classic [NaCl]

DLPOLY4 [Ar LJ]

DLPOLY4 [NaCl]

Gromacs 4.6.1 [dppc-cutoff]

Gromacs 4.6.1 [dppc-pme]

OpenFOAM [Cavity 3D-3M]

CASTEP [IDZ]

CASTEP [MnO2]

Average = 2.33

3 December 2014

Page 52: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Performance Report Summary

52Application Performance in Materials Science

Application Code Data Set

Wallclock Breakdown

(%)

CPU Time Breakdown (%) MPI Breakdown (%)

CPU MPI

Scalar

numeric

ops

Vector

numeric

ops

CPU

Memory

accesses

Collective

call

Point-to-

point calls

GROMACS DPPC / cutoff.mdp 66 34 38 22 40 10 90

DPPC pme.mdp 47 53 38 24 38 26 74

LAMMPS LJ atoms 92 8 63 0 37 5 96

NAMD stmv 43 56 30 0 70 0 100

GAMESS-UK (GAs) DFT valino DZVP2 90 11 50 3 48 70 30

DFT 2nd Derivs. 82 18 38 15 47 99 1

GAMESS-UK

(ScaLAPACK/MPI)

cyclo.6-31G-dp 72 28 41 5 54 93 8

Siosi7 - 3975GTOs 86 14 45 4 51 94 6

GAMESS-USHF 34 66 38 2 59 26 74

MP2 41 59 33 5 63 10 90

NWChem siosi7 55 45 46 8 47 59 41

CASTEP (V5)IDZ 46 53 34 31 34 87 13

MnO2 17 83 28 25 47 99 1

CASTEP (V7) MnO2 18 82 28 25 46 100 0

CP2K (H2O) 256 46 52 40 18 42 47 53

Quantum Espresso Au112 41 59 40 35 25 89 11

GRIR443 61 32 44 42 15 96 4

SIESTA AU-HNC 19 81 34 6 61 59 41

VASP PdO 53 47 32 29 39 92 8

Zeolite 54 46 29 27 44 99 1

3 December 2014

Page 53: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

53Application Performance on Multi-core Clusters

Acknowledgements

• Intel – Jamie and Victor for access to and help with a host of

processors and for access to the Wildcat clusters.

• Ludovic Sauge, Johann Peyrard and Peter Ingram (Bull) for

informative performance discussions and access to the Ivy

Bridge and Haswell clusters “SID” and “Robin” at the Bull

Benchmarking Centre.

• Pak Lui and Gilad Shainer for access to the “Jupiter” and

“Thor” clusters at the HPC Advisory Council.

• Allinea – Avtar, Jacques and Patrick and Sid Kashyap (HPC

Wales) for access to MAP and the Performance Report tools.

• Staff at Fujitsu for providing Hideaki Kuraishi with access to Ivy

Bridge and Haswell clusters.

3 December 2014

Page 54: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Summary

• Primary focus on Intel’s new Haswell processor. Cluster benchmarks

used the Intel Swindon 14-core e5-2697 v3, 2.6 GHz dual EP-node

cluster (True Scale interconnect) and three systems in Bull [the 12-core

e5-2690v3, 2.6 GHz, e5-2680v3, 2.5 GHz and the 14-core e5-2695v3,

2.3 GHz (all with Mellanox IB FDR interconnect]. Measurements are

reported for parallel performance.

• Comparison with a number of HPC systems based on Intel’s Ivy Bridge

(E5-2697v2 , E5-2680v2 and E5-2650v2) and Sandy Bridge-based (E5-

2690 and E5-2670) Clusters.

¤ Variety of Parallel Benchmarks - Synthetic (IMB and HPCC) and 12

application codes - DLPOLY, Gromacs, NAMD, LAMMPS,

GAMESS, GAMESS-UK, NWChem, QuantumEspresso, ONETEP,

Siesta, CP2K, Vasp, CASTEP

¤ Performance Metrics across a variety of data sets, featuring “Core

to core” and “node to node” workload comparisons, with the

analysis using IPM and Allinea’s Performance Report].

54Application Performance in Materials Science 3 December 2014

Page 55: 25th Machine Evaluation Workshop - STFC EMS - Home · PDF fileHaswell Xeon E5 Variants - HCC, MCC and LCC The high core count (HCC) variant has from 14 to 18 cores and has two

Summary II.

• Enhanced Performance of the Haswell clusters are at first sight rather

modest.

¤ A Core-to-Core comparison of the Intel Haswell e5-2697v3 2.6

GHz Truescale cluster across 21 data sets (and 13 applications)

suggests only modest speedups (< 1.2) in 18 of these comparisons

compared to the Ivy Bridge e5-2650v2 2.6 GHz True Scale

cluster. Only the Quantum Espresso application shows speed-ups >

1.4.

¤ A Node-to-Node comparison typical of the performance when

running a workload is more encouraging,

¤ 6-node Intel Haswell e5-2697v3 2.6 GHz benchmarks show a

performance enhancement by factors of

• 1.63 × the Ivy Bridge e5-2650v2 2.6 GHz cluster

• 2.33 × the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz

cluster.

• Valuable insight provided by Allinea’s Performance Report

55Application Performance in Materials Science 3 December 2014