Upload
tranmien
View
212
Download
0
Embed Size (px)
Citation preview
Performance of Intel’s Haswell Processor
Application Performance in Chemistry,
Materials Science and Nanoscience:
25th Machine Evaluation Workshop
Martyn Guest and Christine Kitchen
ARCCA & HPC Wales
and
Hideaki Kuraishi
Fujitsu Laboratories of Europe
23 December 2014Application Performance in Materials Science
Outline
• Primary focus on Intel’s new Haswell processor, including
¤ 16-core E5-2698 v3 2.3 GHz;
¤ 14-core E5-2697 v3 2.6 GHz and E5-2695 v3 2.3 GHz (HCC)
¤ 12-core E5-2680 v3 2.5 GHz & E5-2690 v3 2.6 GHz (MCC).
¤ Clusters with dual-processor EP nodes featuring Mellanox IB
QDR and FDR, and Intel’s True Scale QDR Interconnects.
¤ Measurements of parallel application performance.
• Comparison with a number of HPC systems based on Intel
processors:
¤ Intel’s Sandy Bridge-based (E5-2690 and E5-2670) Clusters;
¤ Intel’s Ivy Bridge (E5-2697v2 -12-core), E5-2680v2 and E5-
2690v2 (10-core) and E5-2650v2 (8-core);
¤ Mellanox IB QDR and FDR, and Intel’s True Scale QDR
Interconnects.
3Application Performance in Materials Science
Outline – Performance Benchmarks
• Variety of Parallel Benchmarks
¤ Synthetic (Stream, IMB and HPCC) and application codes
¤ Thirteen applications:
• MD - DLPOLY, Gromacs, NAMD and LAMMPS;
• QC - NWChem, GAMESS and GAMESS-UK;
• CP – QUANTUM ESPRESSO, ONETEP, SIESTA, CP2K,
VASP, and CASTEP.
• Performance Metrics across a variety of data sets
¤ “Core to core” and “node to node” workload comparisons
¤ Analysis using IPM and Allinea’s Performance Report
¤ Two comparisons will be considered:
• Core to core comparison i.e. performance for jobs with a
fixed number of cores
• Node to Node comparison typical of the performance when
running a workload (real life production).
3 December 2014
Background: Sandy Bridge and Ivy Bridge
• Sandy Bridge-EP and Ivy Bridge-EP processors shared the same
microarchitecture (codenamed Sandy Bridge)
¤ Core-to-core “floating point performance” is expected to be the same
(8 FP64 operation/cycle/core)
• Processors are pin compatible
• Ivy Bridge-EP is the “Tick” step of the Sandy Bridge microarchitecture
within the Intel’s production model (“CPU Refresh”)
¤ 22nm manufacturing process technology (32nm for Sandy Bridge)
• Thus, for a similar TDP, more cores possible on the same die area
¤ 10 cores for the advanced family (MCC), 12 cores for the optimized
one (HCC)
• For the same core-to-core performance we expect a benefit in terms of
throughput ranging from 25 to 50% (for 10- and 12-cores processors
respectively) . Little throughput improvement expected for the 8-core.
• Other major improvements
¤ Increasing memory speed (1866 MT/sec vs. 1600 MT/sec)
¤ Larger Last Level Cache (L3) – But same size per core, 2.5 MB/sec/core
4Application Performance in Materials Science 3 December 2014
Haswell-EP Platform Highlights
Xeon E5-2600 v2
“Ivy Bridge-EP”
Xeon E5-2600 v3
“Haswell-EP”
Cores Up to 12 Cores Up to 18 Cores
AVX SupportAVX 1.0
8 DP Flops/Clock
AVX 2.0
16 DP Flops/Clock
Memory
4xDDR3 channels
1866 (1DPC), 1600,
1333, 1033
4xDDR4 channels
2133 (1DPC), 1866 (2DPC),
1600, 1333
QPI 8.0 GT/s 9.6 GT/s
TDP Up to 130W Server;
150W Workstation
Up to 145W Server;
160W Workstation
Power
Management
Same P-states for all cores
Same core & uncore frequency
Per-core P-states;
Independent uncore
frequency scaling
5Application Performance in Materials Science 3 December 2014
6Application Performance in Materials Science
Haswell Xeon E5 Variants - HCC, MCC and LCC
The high core count (HCC) variant has from 14 to 18 cores and has two
switch interconnects linking the rings that in turn bind the cores and L3
cache segments together with the QPI buses, PCI-Express buses, and
memory controllers that also hang off the rings.
The medium core count (MCC) variant has from 10 to 12 cores and
basically cuts off six cores and bends one of the rings back on itself to
link to the second memory controller on the die.
The low core count (LCC) version has from 4 to 8 cores and is basically
the left half of the MCC variant with the two interconnect switches, four of
the cores, and the bent ring removed.
3 December 2014
Haswell Cluster Systems Benchmarked
7Application Performance in Materials Science
Cluster Configuration
Intel Haswell EP-
2697 v3 2.6 GHz 14C
Intel Swindon “Wildcat1” 16 nodes. Node Configuration:
2 x Xeon E5-2697 v3 @ 2.6 GHz, 14 Core (C1 Pre-Production
Processors) 145W TDP, 35MB Cache,128GB DDR4 (8 x 16GB) at
2133MHz.
Interconnect: Intel True Scale – 2 x True Scale HCA’s /node.
Dell Haswell EP-
2697 v3 2.6 GHz 14C
“Thor” at the HPC Advisory Council. Dell PowerEdge R730 32-node
cluster; Dual Socket IntelXeon14-core CPUs E5-2697 v3 @ 2.6 GHz;
Memory: 64GB DDR4 2133MHz RDIMMs per node
Interconnect: Mellanox ConnectX 56Gb/s FDR InfiniBand adapters;
Mellanox SwitchX SX6036 36-Port 56Gb/s FDR IB switch
Bull Haswell EP-
2695 v3 2.3 GHz 14C
Bull “Robin” 32 nodes. Node Configuration:
2 x Xeon E5-2695 v3 @ 2.3 GHz, 14 Core, 120W TDP, 35MB
Cache,128GB DDR4 (8x 16GB) at 2133MHz.
Interconnect: Mellanox Connect IB FDR.
Bull Haswell EP-
2680 v3 (2.5 GHz)
and EP-2690 v3 (2.6
GHz) 12C
Bull “Robin” (32 nodes) and “SID” (85 nodes). Node Configurations:
Robin:2 x Xeon E5-2690 v3 @ 2.6 GHz, 12 Core, 135W TDP
SID: B720 2 x Xeon E5-2680 v3 @ 2.5 GHz, 12 Core, 120W
TDP, 30MB Cache,128GB DDR4 at 2133MHz.
Interconnect: Mellanox Connect IB FDR.
3 December 2014
39,479
76,97973,490 75,224 76,499
96,278 94,760
87,791
93,486
114,543118,413 118,605
0
20,000
40,000
60,000
80,000
100,000
120,000
Fujitsu BX922Westmere
X5650 2.67GHz
Bullx B510Sandy Bridge E5-
2680/2.7GHz
Fujitsu CX250Sandy Bridge e5-
2670/2.6 GHz
Fujitsu CX250Sandy Bridge e5-
2690/2.9GHz
Fujitsu CX250Sandy Bridge e5-2690/2.9GHz [T]
Intel Ivy Bridgee5-2697v2
2.7GHz
Intel Ivy Bridgee5-2697v2
2.7GHz
Cray XC30 e5-2697v2 2.7GHz
ClusterVision e5-2650v2 2.6GHz
Bull Haswell e5-2695v3 2.3GHz
Intel Haswell e5-2697v3 2.6GHz
Dell R730Haswell e5-
2697v3 2.6GHz(T)
Memory Bandwidth – Total STREAM performance
8Application Performance in Materials Science
TRIAD [Rate (MB/s) ]
Ivy Bridge E5-26xx v2
Sandy Bridge E5-26xx
Haswell E5-26xx v3
3 December 2014
3,290
4,8114,593
4,702 4,781
4,012 3,948
3,658
5,843
4,0914,229 4,236
0
1,000
2,000
3,000
4,000
5,000
6,000
Fujitsu BX922Westmere X5650
2.67GHz
Bullx B510 SandyBridge E5-
2680/2.7GHz
Fujitsu CX250Sandy Bridge e5-
2670/2.6 GHz
Fujitsu CX250Sandy Bridge e5-
2690/2.9GHz
Fujitsu CX250Sandy Bridge e5-2690/2.9GHz [T]
Intel Ivy Bridgee5-2697v2 2.7GHz
Intel Ivy Bridgee5-2697v2 2.7GHz
Cray XC30 e5-2697v2 2.7GHz
ClusterVision e5-2650v2 2.6GHz
Bull Haswell e5-2695v3 2.3GHz
Intel Haswell e5-2697v3 2.6GHz
Dell R730 Haswelle5-2697v3 2.6GHz
(T)
Memory Bandwidth – STREAM / core performance
9Application Performance in Materials Science
TRIAD [Rate (MB/s) ]
Ivy Bridge E5-26xx v2
Sandy Bridge E5-26xx
Haswell E5-26xx v3
3 December 2014
9,281
3.5
5,992
2,954
2.3
1,729
0.4
1,099
112.0
0
1
10
100
1,000
10,000
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]
Intel Haswell e5-2697v3 2.6 GHz Truescale QDR
Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB
Intel Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR
ClusterVision e5-2650v2 2.6 GHz Truescale QDR
Fujitsu CX250 e5-2690/2.9 GHz IB-QDR
Fujitsu CX250 e5-2670/2.6 GHz IB-QDR
Fujitsu BX922 X5650 2.66 GHz + IB
Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)
Intel Xeon E5570 2.93GHz QC + 10GBitE (Intel MPI 3.2)
Fujitsu BX922 X5650 2.66 GHz + GBitE (Tier2A)
MPI Performance – PingPong
IMB Benchmark (Intel)
1 PE / node
Latency
Message Length (Bytes)
Mb
yte
s/s
ec
10Application Performance in Materials Science
BE
TT
ER
3 December 2014
96.8
226.5
136.6
426.8
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
QLogic NDC X5670 2.93GHz 6-C + QDR (Qlogic MPI)
QLogic NDC X5670 2.93GHz 6-C + QDR (mvapich2-1.6)
QLogic NDC X5670 2.93GHz 6-C + QDR (impi-4.0.0)
Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR (HTC)
Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz [T] IB-QDR
Intel Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR
Intel Ivy Bridge e5-2697v2 2.7 GHz True Scale PSM
Bull B710 Ivy Bridge e5-2697v2 2.7 GHz Mellanox FDR
Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB
Intel Ivy Bridge e5-2690v2 3.0 GHz True Scale QDR
ClusterVision e5-2650v2 2.6 GHz Truescale QDR
Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]
Intel Haswell e5-2697v3 2.6 GHz Truescale QDR
MPI Collectives – Alltoallv (128 PEs)
IMB Benchmark (Intel)
128 PEs
Latency
BE
TT
ER
Message Length (Bytes)
Measured Time (usec)
Determines CASTEP
Performance
11Application Performance in Materials Science 3 December 2014
• Linpack TPP benchmark measures floating point rate of execution for solving a linear system of equations. HPL
• measures floating point rate of execution of double precision real matrix-matrix multiplication. DGEMM
• measures sustainable memory B/W (in GB/s) and the corresponding computation rate for simple vector kernel. STREAM
• parallel matrix transpose - exercises the communicationswhere pairs of processors communicate with each other simultaneously. Useful test of the total communications capacity of the network.
PTRANS
• measures rate of integer random updates of memory (GUPS). RandomAccess
• measures floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform. Performance a combination of flops, memory, and network bandwidth
FFTE
• tests to measure latency and B/W of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).
Communication B/W and latency
HPC Challenge Benchmark (Source - http://icl.cs.utk.edu/hpcc/)
CPU cores N
128 83,000
256 117,000
512 166,000
1024 234,000
12Application Performance in Materials Science 3 December 2014
0.0
0.2
0.4
0.6
0.8
1.0
HPL[Gflops]
G-PTRANS[GB/s]
G-RandomAccess[Gup/s]
G-FFTE[Gflops]
EP-STREAM
Sys [GB/s]
EP-STREAM
Triad[GB/s]
EP-DGEMM[Gflops]
RandomRing
Bandwidth[Gbytes]
RandomRing
Latency[usec]
Fujitsu BX922 Westmere X5650 2.67GHz+ IB-QDR
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz (T) IB-QDR
Dell R720 Ivy Bridge e5-2680v2 2.8 GHz(T) IB-connect
Intel Ivy Bridge e5-2690v2 3.0 GHz (T)True Scale QDR
ClusterVision e5-2650v2 2.6 GHz TrueScale QDR
Bull Haswell e5-2695v3 2.3 GHz (T)Connect IB
Intel Haswell e5-2697v3 2.6 GHz (T) TrueScale QDR
Bull Haswell e5-2690v3 2.6 GHz (T)Connect IB
Bull Haswell e5-2680v3 2.5 GHz (T)Connect IB
HPCC - 128 Processing Elements
Notable issue with memory bandwidth
13Application Performance in Materials Science
Haswell clusters only dominant in HPL and
DGEMM (AVX-2); outperformed in many of
the other benchmarks. Weak STREAM
performance.
[Matrix Size 83,000]
3 December 2014
0.0
0.2
0.4
0.6
0.8
1.0
HPL[Gflops]
G-PTRANS[GB/s]
G-RandomAccess[Gup/s]
G-FFTE[Gflops]
EP-STREAM
Sys [GB/s]
EP-STREAM
Triad[GB/s]
EP-DGEMM[Gflops]
RandomRing
Bandwidth[Gbytes]
RandomRing
Latency[usec]
Fujitsu BX922 Westmere X56502.67GHz + IB-QDR
Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz (T) IB-QDR
Dell R720 Ivy Bridge e5-2680v22.8 GHz (T) IB-connect
Intel Ivy Bridge e5-2690v2 3.0 GHz(T) True Scale QDR
ClusterVision e5-2650v2 2.6 GHzTrue Scale QDR
Bull Haswell e5-2695v3 2.3 GHz(T) Connect IB
Intel Haswell e5-2697v3 2.6 GHz(T) True Scale QDR
Bull Haswell e5-2690v3 2.6 GHz(T) Connect IB
Bull Haswell e5-2680v3 2.5 GHz(T) Connect IB
HPCC - 256 Processing Elements
14Application Performance in Materials Science
[Matrix Size 117,000]
Haswell clusters only dominant in HPL and
DGEMM (AVX-2); outperformed in many of
the other benchmarks. Weak STREAM
performance. Notable issue with memory bandwidth
3 December 2014
The Test Suite
• The Test suite comprises both synthetics & end-user applications.
Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks
(http://software.intel.com/en-us/articles/intel-mpi-benchmarks):
• Variety of “open source” and commercial end-user application codes have
been selected:
• These stress various aspects of the architectures under consideration
and should provide a level of insight into why particular levels of
performance are observed e.g., memory bandwidth and latency, node
floating point performance and interconnect performance (both latency
and B/W) and sustained I/O performance.
GROMACS, LAMMPS, NAMD, DL_POLY classic & DL_POLY-4
(molecular dynamics)
Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP
(ab initio Materials properties)
NWChem, GAMESS-US and GAMESS-UK
(molecular electronic structure)
15Application Performance in Materials Science 3 December 2014
Allinea Performance Reports
Allinea Performance Reports provides a
mechanism to characterize and understand
the performance of HPC application runs
through a single-page HTML report.
16Application Performance in Materials Science
• Based on Allinea MAP's low-overhead adaptive sampling
technology that keeps data volumes collected and application
overhead low.
• Modest application slowdown (typically 5%) even with thousands of
MPI processes.
• Runs transparently on existing codes by adding a single command
to execution scripts.
• If submitted through a batch queuing system, then the submission
script is modified to load the Allinea module and add the 'perf-report'
command in front of the required mpiexec command.
• perf-report mpiexec -n 256 $code
3 December 2014
Allinea Performance Reports
A Report Summary: This characterizes how the application's wallclock
time was spent, broken down into CPU, MPI and I/O. e.g.
CPU - Time spent computing. % of wall-clock time spent in application and in
library code, excluding time spent in MPI calls and I/O calls.
MPI - Time spent communicating. % of wall-clock time spent in MPI calls such as
MPI - Send, MPI Reduce and MPI Barrier.
I/O - Time spent reading from and writing to the filesystem. % of wall-clock time
spent in system library calls such as read, write and close.
CPU Breakdown
Breaks down the time spent further by analyzing the kinds of instructions that this
time was spent on.
Scalar numeric ops - %-time spent executing arithmetic operations - does not
include time spent using the more efficient vectorized versions of operations.
Vector numeric ops - %-time spent executing vectorized arithmetic operations
such as Intel's SSE / AVX extensions. (2x – 4x performance improvements).
Memory accesses - %-time spent in memory access operations. A high figure
shows the application is memory-bound and is not able to make full use of the
CPU's resources. 17Application Performance in Materials Science 3 December 2014
2.4
5.1
7.9
13.0
5.3
9.3
16.0
6.0
10.3
17.6
4.3
7.6
13.1
5.4
9.6
16.0
5.9
9.9
16.5
0.0
4.0
8.0
12.0
16.0
20.0
64 128 256
Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz [T] IB-QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Dell R720 Ivy Bridge e5-2680v2 2.8GHz (T) connect-IB
Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) Truescale QDR
18
I. DL_POLY 4 – NaCl Simulation
Number of Processing Elements
Performance
Performance Data (64-256 PEs)
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)
BE
TT
ER
NaCl 216,000 atoms; 200 time steps
Application Performance in Materials Science 3 December 2014
II. GROMACS and the DPPC Benchmarks
DPPC Benchmarks
¤ DPPC: A phospholipid membrane, consisting of 1024
dipalmitoylphosphatidylcholine (DPPC) lipids in a bilayer
configuration with 23 water molecules per lipid, for a total of
121,856 atoms. Two data sets - DPPC-cutoff and DPPC-pme
• Cut-off; Twin range cut-off’s with neighborlist cut-off rlist and
Coulomb cut-off rcoulomb, where rcoulomb ≥ rlist. Time step
for integration = 0.002 fs. Pair lists updated every 10 steps
• PME; Fast Particle-Mesh Ewald electrostatics. Direct space is similar
to the Ewald sum, while the reciprocal part is performed with FFTs.
All test cases run 5,000 steps * 2fs = 10ps simulation time.
GROMACS v 4.6.1 (GROningen MAchine for Chemical
Simulations) is a molecular dynamics package designed for simulations
of proteins, lipids and nucleic acids [University of Groningen] .
Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-
Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory
and Computation 4 (3): 435–447.
19Application Performance in Materials Science 3 December 2014
2.3
4.5
8.7
16.4
30.6
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
0 32 64 96 128 160 192 224 256
Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
ClusterVision e5-2650v2 2.6GHz True Scale QDR
Dell R720 Ivy Bridge e5-2680v2 2.8GHz (T) Connect-IB
Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR
Bull Haswell E5-2695v3 2.3GHz Connect-IB
Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB
Bull Haswell E5-2690v3 2.6GHz Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
GROMACS – DPPC-pme
Number of Processing Elements
Performance
Performance Data (16-256 PEs)
Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)
BE
TT
ER
121,856 atoms; 5,000 time steps
Coulomb type = PME
20Application Performance in Materials Science 3 December 2014
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
GROMACS – DPPC-pme Performance Report
PME; Fast Particle-Mesh Ewald
electrostatics
21Application Performance in Materials Science
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
II. GROMACS - Ion Channel Benchmark
• The membrane protein GluCl, a pentameric chloride channel
embedded in a POPC lipid bilayer. This system contains a modest
141,677 atoms, one of the key target sizes for biomolecular simulations
given the importance of these proteins for pharmaceutical applications.
• It is particularly challenging due to a highly inhomogeneous and
anisotropic environment in the membrane, which poses hard
challenges for load balancing with domain decomposition.
• The simulation uses (i) a 2.5 femtosecond time step and constrained
bonds, typical for many biomolecular systems, and (ii) PME
electrostatics, with 3D-FFTs that are difficult to parallelize efficiently.
22Application Performance in Materials Science
-s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000
maxh : Terminate after 0.99 times this time (hours) i.e. terminate after ~30 min
resethwat : Reset Timer counters at half steps i.e. reported walltime and
performance refers to the last half steps of simulation.
noconfout : Do not save output coordinates/velocities at the end.
nsteps : Run this number of steps, no matter what is requested in the input file
3 December 2014
1.0
1.8
2.3
3.0
3.73.9
1.3
2.4
3.5
4.3
6.1
7.4
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
64 128 192 256 384 512
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR
ClusterVision e5-2650v2 2.6GHz True Scale QDR
Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB
23
II. GROMACS - Ion Channel Benchmark
Number of Processing Elements
Performance
Performance Data (64-512 PEs)
Relative to the Fujitsu CX250 e5-2670 8-C (64PEs)
BE
TT
ER
Ion Channel 141,677 atoms; 10,000 time steps
Application Performance in Materials Science 3 December 2014
II. GROMACS - Lignocellulose Benchmark
24Application Performance in Materials Science
Model of cellulose and lignocellulosic biomass in an aqueous solution.
• This system of 3.3M atoms is inhomogeneous, but since it uses
reaction-field electrostatics instead of PME it scales reasonably well.
• Reference: Scaling of Multimillion-Atom Biological Molecular
Dynamics Simulation on a Petascale Supercomputer Roland
Schulz, Benjamin Lindner, Loukas Petridis and Jeremy C. Smith,
J. Chem. Theory Comput. 2009, 5, 2798-2808
-s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000
maxh : Terminate after 0.99 times this time (hours) i.e. terminate after ~30 min
resethwat : Reset Timer counters at half steps i.e. reported walltime and
performance refers to the last half steps of simulation.
noconfout : Do not save output coordinates/velocities at the end.
nsteps : Run this number of steps, no matter what is requested in the input file
3 December 2014
1.0
1.7
2.5
3.3
4.7
6.0
1.1
2.1
3.1
4.1
6.1
8.1
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
64 128 192 256 384 512
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR
ClusterVision e5-2650v2 2.6GHz True Scale QDR
Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB
25
II. GROMACS - Lignocellulose Benchmark
Number of Processing Elements
Performance
Performance Data (64-512 PEs)
Relative to the Fujitsu CX250 e5-2670 8-C (64PEs)
BE
TT
ER
Lignocellulose 3,316,463 atoms; 10,000 time steps
Application Performance in Materials Science 3 December 2014
• LAMMPS is a classical molecular dynamics code, and an acronym for
Large-scale Atomic/Molecular Massively Parallel Simulator.
• LAMMPS has potentials for soft materials (biomolecules, polymers) and
solid-state materials (metals, semiconductors) and coarse-grained or
mesoscopic systems. It can be used to model atoms or, more generically, as
a parallel particle simulator at the atomic, meso, or continuum scale.
• LAMMPS runs on single processors or in parallel using message-passing
techniques and a spatial-decomposition of the simulation domain. The code
is designed to be easy to modify or extend with new functionality.
http://lammps.sandia.gov/index.html
S. Plimpton, Fast Parallel
Algorithms for Short-Range
Molecular Dynamics, J
Comp Phys, 117, 1-19
(1995).
III. LAMMPS
26Application Performance in Materials Science 3 December 2014
1.01.9
3.8
5.5
7.0
8.8
10.4
13.312.2
14.2
18.5
9.3
11.6
13.8
17.9
1.6
3.0
5.8
8.5
11.0
13.6
16.0
21.0
0.0
5.0
10.0
15.0
20.0
25.0
0 32 64 96 128 160 192 224 256
Fujitsu BX922 Westmere X5650 2.67GHz IB-QDR
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz (T) IB-QDR
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
Number of Processing Elements
Performance Data (16-256 PEs)
256,000 atoms; 5,000 time steps
BE
TT
ER
Pe
rfo
rma
nc
eLAMMPS – Atomic fluid with Lennard-Jones Potential
27Application Performance in Materials Science 3 December 2014
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
LAMMPS – Lennard-Jones Fluid - Performance Report
256,000 atoms; 5,000 time steps
28Application Performance in Materials Science
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
• NAMD, is a parallel molecular dynamics code designed for high-
performance simulation of large bio-molecular systems. Based on Charm++
parallel objects, NAMD scales to hundreds of cores for typical simulations
and beyond 200,000 cores for the largest simulations.
• NAMD uses the popular molecular graphics program VMD for simulation
setup and trajectory analysis, but is also file-compatible with AMBER,
CHARMM, and X-PLOR. NAMD is distributed free of charge with source
code.
• Tutorials show you how to use NAMD and VMD for biomolecular modeling.
• The 2005 reference paper has over 3000 citations as of July 2013.
http://www.ks.uiuc.edu/Research/namd/
James C. Phillips et al.,
Scalable molecular dynamics
with NAMD , J Comp Chem, 26,
1781-1792 (2005).
The Force Field Toolkit (ffTK),
that greatly reduces these
limitations by facilitating the
development of parameters
directly from first principles.
ffTK,
IV. NAMD
29Application Performance in Materials Science 3 December 2014
1.0
1.6
2.9
3.6
4.0
4.5
4.9
1.3
2.2
3.6
4.5
5.0
5.8
6.4
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
0 32 64 96 128 160 192 224 256
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR [T]
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Perf
orm
an
ce
Performance Data (16-256 PEs)
BE
TT
ER
NAMD – STMV (virus) Benchmark
Satellite Tobacco Mosaic
Virus (STMV) benchmark
(1,066,628 atoms, periodic,
PME), 500 time steps
Rela
tive to t
he F
ujit
su C
X250
e5
-
26
70
/2.6
GH
z (1
6 P
Es)
30Application Performance in Materials Science
Number of Processing Elements
3 December 2014
1.01.9
3.9
5.7
7.3
10.4
13.3
12.0
15.8
1.3
2.6
5.1
7.6
9.9
14.4
19.3
0.0
5.0
10.0
15.0
20.0
0 32 64 96 128 160 192 224 256
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR [T]
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
Dell R730 Haswell e5-2697v3 2.6GHz (T) connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Perf
orm
an
ce
Performance Data (16-256 PEs)
BE
TT
ER
NAMD – STMV (virus) Benchmark – days/ns
STMV (virus) benchmark (1,066,628 atoms,
periodic, PME), 500 time steps
Rela
tive to t
he F
ujit
su C
X250
e5
-
26
70
/2.6
GH
z (1
6 P
Es)
31Application Performance in Materials Science
Number of Processing Elements
Performance is measured in “days/ns”;
“days/ns” shows the number of compute
days required to simulate 1 nanosecond
of real-time i.e. lower the day/ns required
the better.
3 December 2014
0.5
0.5
0.3
0.20.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
32PEs
64PEs
128PEs
256PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
32 PEs
64 PEs
128PEs
256PEs
CPU (%)
MPI (%) Performance Data (32-256 PEs)
NAMD – STMV (virus) Performance Report
STMV (virus) benchmark (1,066,628 atoms,
periodic, PME), 500 time steps
32Application Performance in Materials Science
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
• GAMESS is a program for ab initio molecular quantum chemistry.
GAMESS can compute SCF wavefunctions ranging from RHF, ROHF,
UHF, GVB, & MCSCF.
• Correlation corrections include Configuration Interaction, second order
perturbation Theory, and Coupled-Cluster approaches, as well as the
Density Functional Theory approximation.
• Excited states can be computed by CI, EOM, or TD-DFT procedures.
• Nuclear gradients are available, for automatic geometry optimization,
transition state searches, or reaction path following.
• Computation of the energy hessian permits prediction of vibrational
frequencies, with IR or Raman intensities.
• Solvent effects may be modelled by the discrete Effective Fragment
potentials, or continuum models e.g. the Polarizable Continuum Model.
The GAMESS-US Software - Capabilities
"Advances in electronic structure theory: GAMESS a decade later" M.S.Gordon,
M.W.Schmidt pp. 1167-1189, in "Theory and Applications of Computational
Chemistry: the first forty years" C.E.Dykstra, G.Frenking, K.S.Kim, G.E.Scuseria
(editors), Elsevier, Amsterdam, 2005.
33Application Performance in Materials Science 3 December 2014
1.9
3.5
5.9
3.0
5.4
9.0
3.6
6.5
10.6
6.1
9.9
13.4
2.9
5.2
8.6
3.7
6.6
10.8
4.0
7.2
11.9
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
64 128 256
Fujitsu BX922 Westmere X5650 2.67GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9GHz IB-QDR (T)
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
Intel Ivy Bridge e5-2697v2 3.0GHz (T) True Scale QDR
Bull Haswell E5-2695v3 2.3GHz Connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
Basis: CCQ
C2H2S2 MP2 CCQ (370 GTOs),
geom. Optimisation [7 gradient
calculations]
GAMESS-US Performance
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67GHz 6-C (32 PEs)
BE
TT
ER
34Application Performance in Materials Science
Distributed Data Interface (DDI)
3 December 2014
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.032 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Application Performance in Materials Science
Performance Data (32-256 PEs)
GAMESS-US MP2 Performance Report
C2H2S2 MP2 CCQ (370 GTOs),
geom. Optimisation [7 gradient
calculations]
35
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
NWChem ArchitectureR
un
-tim
e d
ata
base
DFT energy, gradient, …
MD, NMR, Solvation, …
Optimize, Dynamics, …
SCF energy, gradient, …
Inte
gra
l A
PI
Geo
me
try O
bje
ct
Bas
is S
et
Ob
jec
t
...
Pe
IGS
...
Global Arrays
Memory Allocator
Parallel IO
Molecular
Modeling
Toolkit
Molecular
Calculation
Modules
Molecular
Software
Development
Toolkit
GenericTasksEnergy, structure, …
• Developed as part of the
construction of the
Environmental Molecular
Sciences Laboratory
(EMSL)
• Envisioned to be used as
an integrated component in
solving DOE’s Grand
Challenge environmental
restoration problems
• Designed and developed
to be a highly efficient and
portable MPP
computational chemistry
package
• Provides computational
chemistry solutions that are
scalable with respect to
chemical system size as
well as MPP hardware size
Application Performance in Materials Science 363 December 2014
1.0
1.6
1.8
1.4
1.7
2.1
1.3
2.32.4
1.8
2.6
3.1
2.1
2.7
3.1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
64 128 256
Fujitsu BX922 Westmere X5650 2.67GHz + IB-QDR
Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz IB-QDR
Fujitsu CX250 Sandy Bridge e5-2690/2.9 GHz [T] IB-QDR
Cray XC30 e5-2697v2 2.7 GHz Dragonfly/ARIES [Archer]
Dell R720 Ivy Bridge e5-2680v2 2.8 GHz connect-IB
Bull Haswell E5-2680v3 2.50GHz (T) Connect-IB
Zeolite Y cluster SiOSi7 DFT -
DZVP (Si,O,H); charge density
fitting, S-VWN (3,554 GTOs)
NWChem - Performance of the DFT Code
Performance
Number of Processing Elements
Relative to the Fujitsu HTC X5650 2.67GHz 6-C (64 PEs)
BE
TT
ER
37Application Performance in Materials Science 3 December 2014
0.0
10.0
20.0
30.0
40.0
50.0
64 PEs
128 PEs
256 PEs
512 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-512 PEs)
NWChem DFT Performance Report
38Application Performance in Materials Science
Zeolite Y cluster SiOSi7 DFT -
DZVP (Si,O,H); charge density
fitting, S-VWN (3,554 GTOs)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
64 PEs
128 PEs
256 PEs
512 PEs
CPU (%)
MPI (%)
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
• VASP – performs ab-initio QM molecular dynamics (MD) simulations using
pseudopotentials or the projector-augmented wave method and a plane
wave basis set.
• Quantum Espresso – an integrated suite of Open-Source computer codes
for electronic-structure calculations and materials modelling at the
nanoscale. It is based on density-functional theory (DFT), plane waves,
and pseudopotentials
• SIESTA - an O(N) DFT code for electronic structure calculations and ab
initio molecular dynamics simulations for molecules and solids. It uses
norm-conserving pseudopotentials and linear combination of numerical
atomic orbitals (LCAO) basis set.
• CP2K is a program to perform atomistic and molecular simulations of solid
state, liquid, molecular, and biological systems. It provides a framework for
different methods such as e.g., DFT using a mixed Gaussian & plane waves
approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling
code for quantum-mechanical calculations based on DFT.
Computational Materials
Advanced Materials Software
39Application Performance in Materials Science 3 December 2014
VASP (5.3) performs ab-initio QM molecular dynamics (MD)
simulations using pseudopotentials or the projector-augmented
wave method and a plane wave basis set. Approach is based
on the (finite-temperature) local-density approximation with the
free energy as variational quantity & an exact evaluation of the
instantaneous electronic ground state at each MD time step.
VASP uses matrix diagonalisation schemes and an
efficient Pulay/Broyden charge density mixing.
The interaction between ions and electrons is
described by ultra-soft Vanderbilt
pseudopotentials (US-PP) or by the projector-
augmented wave (PAW) method (allowing for a
considerable reduction of the number of plane-
waves per atom for transition metals and first row
elements). Forces and the full stress tensor can be
calculated with VASP and used to relax atoms into
their instantaneous ground-state.
http://cms.mpi.univie.ac.at/vasp/vasp/vasp.html
VASP 5.3
40Application Performance in Materials Science 3 December 2014
1.9
3.2
4.6
2.7
4.1 4.0
3.3
5.6
6.8
3.2
5.1
6.0
3.9
6.26.5
3.5
5.6
7.3
3.6
6.0
7.8
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
64 128 256
Fujitsu Westmere X5650 2.67GHz IB-QDR
Fujitsu CX250 e5-2670 2.6GHz QDR
Fujitsu CX250 e5-2690 2.9GHz QDR
ClusterVision e5-2650v2 2.6GHz True Scale QDR
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Bull Haswell e5-2680v3 2.5GHz (T) Connect-IB
Bull Haswell e5-2690v3 2.6GHz Connect-IB
Intel Haswell e5-2697v3 2.6GHz (T) True Scale QDR
Number of Processing Elements
Performance Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (32 PEs)
BE
TT
ER
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
VASP 5.3 – Pd-O Benchmark
41Application Performance in Materials Science 3 December 2014
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
VASP 5.3 – Pd-O Benchmark Performance Report
42Application Performance in Materials Science
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
32 PEs
64 PEs
128 PEs
256 PEs
CPU (%)
MPI (%)
Palladium-Oxygen complex (Pd75O12), 8 k-
points, FFT grid: (31, 49, 45), 68,355 points
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
Quantum Espresso is an
integrated suite of Open-
Source computer codes
for electronic-structure
calculations and
materials modelling at the
nanoscale. It is based on
density-functional theory,
plane waves, and
pseudopotentials.
Ground-state calculations.
Structural Optimization.
Transition states & minimum energy paths.
Ab-initio molecular dynamics.
Response properties (DFPT).
Spectroscopic properties.
Quantum Transport.
Benchmark Details
DEISA AU112
Au complex (Au112), 2,158,381 G-
vectors, 2 k-points, FFT dimensions:
(180, 90, 288)
PRACE
GRIR443
Carbon-Iridium complex (C200Ir243),
2,233,063 G-vectors, 8 k-points, FFT
dimensions: (180, 180, 192)
Quantum Espresso
43Application Performance in Materials Science 3 December 2014
1.0
1.5
1.8
2.3
2.62.5
3.2
1.5
2.4
3.0
3.63.8
4.5
5.2
3.3
4.2
4.7
2.2
3.3
3.8
4.4
6.7
2.8
4.1
5.0
5.5
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
64 128 192 256 320 384 448 512
HTC Westmere X5650 2.67GHz QDR
Kay Bullx B510 e5-2680 2.7GHz QDR
Fujitsu CX250 e5-2670 2.6GHz QDR
Fujitsu CX250 e5-2690 2.9GHz QDR
Dell R720 Ivy Bridge e5-2680v2 2.8GHz connect-IB
ClusterVision e5-2650v2 2.6GHz Truescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES [Archer]
Bull Haswell e5-2695v3 2.3GHz Connect-IB
Bull Haswell E5-2680v3 2.5GHz (T) Connect-IB
Bull Haswell e5-2690v3 2.6GHz Connect-IB
Intel Haswell e5-2697v3 2.6GHz Truescale QDR (T)
Number of Processing Elements
Pe
rfo
rma
nc
e
Relative to the Fujitsu X5650 2.67 GHz 6-C (128 PEs)
BE
TT
ER
Performance Data (128-512 PEs)Quantum Espresso – GRIR443
44Application Performance in Materials Science
Strong Performance
of Bull e5-2680v3
3 December 2014
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
64 PEs
128 PEs
256 PEs
512 PEs
CPU Scalar numeric ops (%)
CPU Vector numeric ops (%)
CPU Memory accesses (%)
Performance Data (32-256 PEs)
Quantum Espresso – C200Ir243 Performance Report
Carbon-Iridium complex (C200Ir243),
2,233,063 G-vectors, 8 k-points,
FFT dimensions: (180, 180, 192)
45Application Performance in Materials Science
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
64 PEs
128 PEs
256 PEs
512 PEs
CPU (%)
MPI (%)
CPU Time Breakdown
Total Wallclock Time
Breakdown
3 December 2014
0.0
0.2
0.4
0.6
0.8
1.0
GROMACSDPPC /
cutoff.mdp
GROMACSDPPC
pme.mdp
DLPOLYclassicNaCl
DLPOLY-4Nacl
DLPOLY-4Ar
QuantumEspresso -
Au112
CP2K 512H2O
GAMESS-UK
DFT.valino.A2.DZVP2
128 PE Performance [Applications]
Fujitsu BX922 Westmere X56502.67GHz + IB-QDR
Fujitsu CX250 e5-2690/2.9GHz IB-QDR
Fujitsu CX250 e5-2690/2.9GHz [T] IB-QDR
Fujitsu CX250 e5-2670/2.6GHz IB-QDR
Intel Ivy Bridge e5-2690v2 3.0GHz (T)True Scale QDR
Dell R720 Ivy Bridge e5-2680v22.8GHz(T) Connect-IB
ClusterVision e5-2650v2 2.6GHzTruescale QDR
Cray XC30 e5-2697v2 2.7GHz ARIES[Archer]
Bull Haswell e5-2695v3 2.3GHz ConnectIB
Intel Haswell e5-2697v3 2.6GHz (T) TrueScale QDR
Bull Haswell E5-2680v3 2.5GHz (T)Connect-IB
Target Codes and Data Sets – 128 PEs
46Application Performance in Materials Science 3 December 2014
1.50
1.42
1.29
1.15
1.15
1.14
1.13
1.11
1.10
1.10
1.09
1.09
1.09
1.08
1.07
1.05
1.04
1.03
1.01
0.98
0.87
0.50 0.75 1.00 1.25 1.50
QE Au112
QE GRIR443
SIESTA
VASP Example3
CP2K H2O.256
LAMMPS - LJ atoms
DLPOLY-4 Ar
VASP Example4
GROMACS DPPC / cutoff.mdp
OpenFoam - 3d3M
GAMESS-UK - cyclo.6-31G-dp
GAMESS-UK - siosi7.3975
GAMESS-UK - valino.A2.DZVP2
CASTEP MnO2
GROMACS DPPC pme.mdp
DLPOLY classic NaCl
GAMESS-UK - hf12z-shell4
DLPOLY-4 Nacl
ONETEP
CASTEP IDZ
NAMD stmv
Target Codes and Data Sets – 256 PEs
Improved Performance of
Intel Haswell e5-2697v3 2.6
GHz Truescale QDR
vs.
Ivy Bridge e5-2650v2 2.6
GHz True Scale QDR
47Application Performance in Materials Science
Average Factor = 1.12
3 December 2014
1.65
1.49
1.41
1.33
1.11
1.10
1.09
1.08
1.07
1.06
1.03
1.03
1.03
1.02
1.00
1.00
0.99
0.96
0.96
0.95
0.88
0.50 0.75 1.00 1.25 1.50
QE Au112
SIESTA
QE GRIR443
CP2K H2O.256
VASP Example4
GROMACS DPPC / cutoff.mdp
DLPOLY classic NaCl
OpenFoam - 3d3M
GROMACS DPPC pme.mdp
ONETEP
LAMMPS - LJ atoms
GAMESS-UK - cyclo.6-31G-dp
GAMESS-UK - valino.A2.DZVP2
GAMESS-UK - siosi7.3975
DLPOLY-4 Nacl
GAMESS-UK - hf12z-shell4
CASTEP IDZ
NAMD stmv
VASP Example3
DLPOLY-4 Ar
CASTEP MnO2
Average Factor = 1.11
Target Codes and Data Sets – 256 PEs
Improved Performance of
Bull Haswell E5-2680v3
2.5GHz (T) Connect-IB
vs.
Ivy Bridge e5-2650v2 2.6
GHz True Scale QDR
48Application Performance in Materials Science 3 December 2014
2.67
1.89
1.84
1.77
1.68
1.66
1.59
1.58
1.53
1.51
1.47
1.38
1.35
0.91
0.50 1.00 1.50 2.00 2.50 3.00
OpenFOAM [Cavity 3D-3M]
DLPOLY4 [Ar LJ]
Gromacs 4.6.1 [dppc-pme]
GAMESS-UK (DZVP2)
GAMESS-UK (sios7)
GAMESS-UK (DFT-6-31G dp)
Gromacs 4.6.1 [dppc-cutoff]
CASTEP [MnO2]
GAMESS-UK (hf12z)
DLPOLY Classic [NaCl]
DLPOLY4 [NaCl]
CASTEP [IDZ]
VASP
SIESTA
Node to Node Comparison – 6-node Performance I.
T (96 cores of NOC cluster - Ivy
Bridge e5-2650v2 2.6 GHz True
Scale QDR) /
T (168 cores of intel cluster -
Intel Haswell e5-2697v3 2.6 GHz
Truescale QDR)
49Application Performance in Materials Science
Average = 1.63
3 December 2014
1.70
1.65
1.53
1.49
1.46
1.44
1.43
1.43
1.43
1.42
1.41
1.31
1.26
1.22
0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90
OpenFOAM [Cavity 3D-3M]
Gromacs 4.6.1 [dppc-pme]
Gromacs 4.6.1 [dppc-cutoff]
GAMESS-UK (DZVP2)
DLPOLY4 [Ar LJ]
GAMESS-UK (sios7)
GAMESS-UK (DFT-6-31G dp)
VASP
GAMESS-UK (hf12z)
DLPOLY4 [NaCl]
DLPOLY Classic [NaCl]
CASTEP [MnO2]
CASTEP [IDZ]
SIESTA
Node to Node Comparison – 6-node Performance II.
T (96 cores of NOC cluster - Ivy
Bridge e5-2650v2 2.6 GHz True
Scale QDR) /
T (144 cores of the Bull cluster
– B720 nodes with Haswell E5-
2680v3 2.5GHz (T) Connect-IB)
50Application Performance in Materials Science
Average = 1.44
3 December 2014
Node to Node Comparison – Six node Performance
T (96 cores of Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR /
T (168 cores of intel cluster - Intel Haswell e5-2697v3 2.6 GHz Truescale QDR)
51Application Performance in Materials Science
3.49
2.27
3.42
2.25
1.92
1.97
2.81
1.72
1.11
0.0 1.0 2.0 3.0 4.0
1
SIESTA
DLPOLY Classic [NaCl]
DLPOLY4 [Ar LJ]
DLPOLY4 [NaCl]
Gromacs 4.6.1 [dppc-cutoff]
Gromacs 4.6.1 [dppc-pme]
OpenFOAM [Cavity 3D-3M]
CASTEP [IDZ]
CASTEP [MnO2]
Average = 2.33
3 December 2014
Performance Report Summary
52Application Performance in Materials Science
Application Code Data Set
Wallclock Breakdown
(%)
CPU Time Breakdown (%) MPI Breakdown (%)
CPU MPI
Scalar
numeric
ops
Vector
numeric
ops
CPU
Memory
accesses
Collective
call
Point-to-
point calls
GROMACS DPPC / cutoff.mdp 66 34 38 22 40 10 90
DPPC pme.mdp 47 53 38 24 38 26 74
LAMMPS LJ atoms 92 8 63 0 37 5 96
NAMD stmv 43 56 30 0 70 0 100
GAMESS-UK (GAs) DFT valino DZVP2 90 11 50 3 48 70 30
DFT 2nd Derivs. 82 18 38 15 47 99 1
GAMESS-UK
(ScaLAPACK/MPI)
cyclo.6-31G-dp 72 28 41 5 54 93 8
Siosi7 - 3975GTOs 86 14 45 4 51 94 6
GAMESS-USHF 34 66 38 2 59 26 74
MP2 41 59 33 5 63 10 90
NWChem siosi7 55 45 46 8 47 59 41
CASTEP (V5)IDZ 46 53 34 31 34 87 13
MnO2 17 83 28 25 47 99 1
CASTEP (V7) MnO2 18 82 28 25 46 100 0
CP2K (H2O) 256 46 52 40 18 42 47 53
Quantum Espresso Au112 41 59 40 35 25 89 11
GRIR443 61 32 44 42 15 96 4
SIESTA AU-HNC 19 81 34 6 61 59 41
VASP PdO 53 47 32 29 39 92 8
Zeolite 54 46 29 27 44 99 1
3 December 2014
53Application Performance on Multi-core Clusters
Acknowledgements
• Intel – Jamie and Victor for access to and help with a host of
processors and for access to the Wildcat clusters.
• Ludovic Sauge, Johann Peyrard and Peter Ingram (Bull) for
informative performance discussions and access to the Ivy
Bridge and Haswell clusters “SID” and “Robin” at the Bull
Benchmarking Centre.
• Pak Lui and Gilad Shainer for access to the “Jupiter” and
“Thor” clusters at the HPC Advisory Council.
• Allinea – Avtar, Jacques and Patrick and Sid Kashyap (HPC
Wales) for access to MAP and the Performance Report tools.
• Staff at Fujitsu for providing Hideaki Kuraishi with access to Ivy
Bridge and Haswell clusters.
3 December 2014
Summary
• Primary focus on Intel’s new Haswell processor. Cluster benchmarks
used the Intel Swindon 14-core e5-2697 v3, 2.6 GHz dual EP-node
cluster (True Scale interconnect) and three systems in Bull [the 12-core
e5-2690v3, 2.6 GHz, e5-2680v3, 2.5 GHz and the 14-core e5-2695v3,
2.3 GHz (all with Mellanox IB FDR interconnect]. Measurements are
reported for parallel performance.
• Comparison with a number of HPC systems based on Intel’s Ivy Bridge
(E5-2697v2 , E5-2680v2 and E5-2650v2) and Sandy Bridge-based (E5-
2690 and E5-2670) Clusters.
¤ Variety of Parallel Benchmarks - Synthetic (IMB and HPCC) and 12
application codes - DLPOLY, Gromacs, NAMD, LAMMPS,
GAMESS, GAMESS-UK, NWChem, QuantumEspresso, ONETEP,
Siesta, CP2K, Vasp, CASTEP
¤ Performance Metrics across a variety of data sets, featuring “Core
to core” and “node to node” workload comparisons, with the
analysis using IPM and Allinea’s Performance Report].
54Application Performance in Materials Science 3 December 2014
Summary II.
• Enhanced Performance of the Haswell clusters are at first sight rather
modest.
¤ A Core-to-Core comparison of the Intel Haswell e5-2697v3 2.6
GHz Truescale cluster across 21 data sets (and 13 applications)
suggests only modest speedups (< 1.2) in 18 of these comparisons
compared to the Ivy Bridge e5-2650v2 2.6 GHz True Scale
cluster. Only the Quantum Espresso application shows speed-ups >
1.4.
¤ A Node-to-Node comparison typical of the performance when
running a workload is more encouraging,
¤ 6-node Intel Haswell e5-2697v3 2.6 GHz benchmarks show a
performance enhancement by factors of
• 1.63 × the Ivy Bridge e5-2650v2 2.6 GHz cluster
• 2.33 × the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz
cluster.
• Valuable insight provided by Allinea’s Performance Report
55Application Performance in Materials Science 3 December 2014