12
LAMMPS Performance Benchmark and Profiling November 2020

LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

  • Upload
    others

  • View
    39

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

LAMMPS

Performance Benchmark and Profiling

November 2020

Page 2: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

2

Note

• The following research was performed under the HPC Advisory Council activities

– HPCAI-AC - Iris cluster

– Dell – Zenith cluster

• The following was done to provide best practices

– LAMMPS performance overview over Intel based platforms

– Understanding LAMMPS MPI communication patterns

• More info on LAMMPS

– https://lammps.sandia.gov/

Page 3: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

3

LAMMPS

• Large-scale Atomic/Molecular Massively Parallel Simulator

– Classical molecular dynamics code which can model:

– Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems

• LAMMPS-KOKKOS package contains

– Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library

• LAMMPS runs efficiently in parallel using message-passing techniques

– Developed at Sandia National Laboratories

– An open-source code, distributed under GNU Public License

• More information on LAMMPS can be found at the LAMMPS web site:

http://lammps.sandia.gov

Page 4: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

4

Cluster Configuration

• HPC-AI AC Cluster Center – Iris cluster

– Dual Socket Intel Gold 6148 CPU @ 2.40GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

• Dell Cluster Center – Zenith cluster

– Dual Socket Intel Gold 6248 CPU @ 2.50GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

Page 5: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

5

LAMMPS Inputs

• AF_lennard-jones_2.5

– Problem: https://lammps.sandia.gov/bench/in.lj.txt

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 0 every 20 check no

– Iterations: 1000

• EAM

– Problem:

https://github.com/lammps/lammps/blob/master/bench/POTENTIALS/in.eam

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 1 every 5 check yes

– Iterations: 1000

– thermo 100

– thermo 100

• Tersoff

– Problem: https://lammps.sandia.gov/bench/in.tersoff.txt

– region: box block 0 200 0 200 0 200

– Iterations: 1000

• Gay-Berne

– Problem: https://lammps.sandia.gov/bench/in.gb.txt

– region: box block 0 320 0 320 0 320

– set type 1 mass 1.5

– set type 1 shape 1 1.5 2

– neigh_modify: delay 1 every 5 check yes

– Iterations: 1000

– thermo 100

• Rhodopsin

– Problem:

https://github.com/lammps/lammps/blob/master/bench/in.rhodo

– replicate: 1 1 1

– atom_modify map array

– Iterations: 1000

• SNAP

– Problem:

– region: box block 0 5 0 8 0 32

– Iterations: 1000

Page 6: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

6

LAMMPS Performance – Scalability

Higher is better

100% 100%92%

92% 100%97%

* Bigger problem size

Page 7: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

7

LAMMPS Performance – AVX2/AVX512

Higher is better

Page 8: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

8

LAMMPS Performance – CPU

Higher is better

Page 9: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

9

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

Page 10: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

10

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

Page 11: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

11

Summary

• LAMMPS can be scalable, per the problem size defined. The problem should suit the CPU

architecture and cluster size. With InfiniBand the scalability is above 92% for the

demonstrated cased

• AVX512 helps five out of the six input benchmarks, and up to 2x improvment

• Intel Gold 6248, 2.5GHz (40 cores per node) demonstrated up 38% of performance

improvements comparing to Intel Gold 6148 @2.4GHz (40 cores per node)

• MPI Profile shows up to 30% communication time mostly on point to point and MPI

AllReduce operations. Rhodopsin input also showing also MPI alltoallv as well

Page 12: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

12

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information

contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You