Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of...

• Hybrid version enables higher scalability versus pure MPI version – Hybrid version delivers better scalability after 4 nodes – MPI processes would spawn OpenMP threads for computation on CPU cores

– Streamline and reduce communication endpoints to improve scalability

Intel E5-2680 V2 FDR InfiniBand

327%

*Performance Rating = Jobs/Day

16

• Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads

– Which uses “-genv I_MPI_PIN_DOMAIN node” as specified in runhyd_par script

• For best performance on 2P Platforms:

– Use “-genv I_MPI_PIN_DOMAIN socket” in runhyd_par script

– 2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads

– The flag allows to each MPI process to spawn threads within its own socket

– Instead of both MPI processes sharing the same socket

FLOW-3D/MP Performance – Hybrid Mode

110%

178%

FDR InfiniBand FLOW-3D/MP 5.0

17

• 2PPN of 8 threads can provide better performance than 1PPN of 16 threads

– Threads of the MPI process causes threads to spawn within the same socket

– With the “I_MPI_PIN_DOMAIN=socket” specified in the runhyd_par script

• Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads

– With the “I_MPI_PIN_DOMAIN=node” specified in the runhyd_par script

• The flag is modified to “socket” to allow spawning of threads within a socket

– For the case of 2PPN of 8 threads

FLOW-3D/MP Performance – Hybrid Mode

FDR InfiniBand

9% 35%

FLOW-3D/MP 4.2

18

FLOW-3D/MP Performance – Processors

• Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster

– Performs 70% better than X5670 cluster at 16 nodes

• System components used:

– Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks

– Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk

FLOW-3D/MP 4.2

70%

FDR InfiniBand

19

FLOW-3D/MP Performance – Processors

• Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster

– Performs 8% better than X5670 cluster at 16 nodes

• System components used:

– Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz

– Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz

8%

5%

FDR InfiniBand FLOW-3D/MP 5.0

20

FLOW-3D/MP Performance – Software Versions

• FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode

– Provides up to 37% faster in Hybrid mode at 16-node

• Running in MPI mode shows slightly longer runtime than previous version

– Appears to be caused by a change in communication algorithm

– More MPI collective operations are being used compared to prior version

16 MPI Processes/Node

37%

21

FLOW-3D/MP Performance – Network

• InfiniBand FDR provides better scalability performance than Ethernet

– Scalability gap widens as more nodes involved in simulation

– FDR InfiniBand provides up to 246% better performance than 1GbE

– FDR InfiniBand delivers up to 39% better performance than 10GbE

– Hybrid mode is shown

39%

16 MPI Processes/Node

246%

22

FLOW-3D/MP Profiling – Time Ratio

• InfiniBand FDR reduces the communication time at scale

– InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job

– 10GbE consumes 39% of total time, while 1GbE consumes about 75%

• IB RDMA technology allows communication to bypass CPU involvement

– Reduces CPU overhead in handling communication

– Which leaves more time for application processing

23

FLOW-3D/MP Profiling – # of MPI Calls

• Overall runtime reduces as more nodes take part of the MPI job

– More compute nodes reduce runtime by spreading out the workload

• Computation time drops while the communication time stays flat

– As cluster scales, MPI time stays constantly at the same level

16 MPI Processes/Node Pure MPI Mode

24

FLOW-3D/MP Profiling – Communication Time

• The most time consumed MPI functions are:

– FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%)

• InfiniBand reduces more time in Collective Operations than Ethernet

– Collective communications are most used in FLOW-3D v5.0

– Those communications account for the highest communication time

25

FLOW-3D/MP Profiling – # of MPI Calls

• There is a wide range of message sizes seen: – MPI_Allreduce: Concentration between 4B to 16B – MPI_Waitall: Around 4MB – MPI_Bcast: Around 1-4MB

26

FLOW-3D/MP Performance – File System

• Storing data files on local FS or tmpfs would improve performance

– Scalability is limited by NFS when running at scale after 8 nodes

– NFS used in this case is over 1GbE network

Higher is better InfiniBand FDR

25%

27

FLOW-3D/MP Profiling – Disk IO Time

• File IO access occurs during certain period during the MPI solver

– Large spikes for writing the restart and spatial data

– Files are directed to write to local instead of NFS to avoid IO bottleneck

28

FLOW-3D/MP – Summary

• Scalability

– FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version

• Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2

• Performance

– Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores

– Allocate MPI process to “proper” socket in Hybrid mode allows performance to jump 178%

– Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode

• Network

– InfiniBand FDR allows the best scalability performance with 56Gbps rate

• Outperforms by 246% over 1GbE at 16-node (320 cores)

• Outperforms by 39% over 10GbE at 16-node (320 cores)

– RDMA technology in InfiniBand allows bypassing CPU for network transfer

• This offload reduces CPU overhead in handling communication; thus CPU can focus on application

• Profiling

– MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time

– InfiniBand can process Collective Operations in network faster than Ethernet

– Large concentration on small messages, typical for latency sensitive HPC applications

29 29

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

• Special thanks to Anup Gokarn of Flow Science

• Questions?

– Pak Lui

– pak@hpcadvisorycouncil.com

mailto:pak@hpcadvisorycouncil.com

Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of...

Documents

Performance and Scalability. Performance and Scalability Challenges Optimizing PerformanceScaling UpScaling Out

High performance and scalability

Performance and Scalability

Scalability of DL POLY on High Performance Computing Platform - … · 2018. 7. 31. · Mabakane, M.S., Moeketsi, D.M. and Lopis, A.S.: Scalability of DL_POLY on HPC platform 83 For

OSCon - Performance vs Scalability

Web Performance & Scalability Tools

ECC 6 1 Performance Scalability

ArcGIS Server Performance and Scalability - Optimization ... · ArcGIS Server Performance and Scalability ... (WebTestWebTest or Unit Test), ... ArcGIS Server Performance and Scalability

OEFLER HPC for ML and ML for HPC Scalability ...htor.inf.ethz.ch/publications/img/mlhpc-keynote.pdfspcl.inf.ethz.ch @spcl_eth T. HOEFLER HPC for ML and ML for HPC -Scalability, Communication,

Achieving Performance, Scalability, and Availability ...€¦ · Achieving Performance, Scalability, and Availability Objectives ... Average latency ... Achieving Performance, Scalability,

Performance and Scalability Benchmark

A Performance Comparison using HPC Benchmarks: Windows HPC

Scalability issues : HPC Applications & Performance Toolsspscicomp.org/wordpress/wp-content/uploads/2011/05/sur... · 2011-05-13 · Top 500 : Some statistics Top 500 - Domains Scalability

Drupal performance and scalability

The Effect of HPC Cluster Architecture on the Scalability Performance ...€¦ · The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations Pak Lui

Driving HPC Performance Efficiency with Heterogeneous ... · Driving HPC Performance Efficiency with Heterogeneous Computing Leif Nordlund HPC EMEA, AMD June 19, 2011 ... Performance

Performance scalability brandonlyon

BPPM Best Practices Performance Scalability

Performance and Scalability Overview

How Scalable is your SMB? - SNIA...Scalability is the performance exposure. There is no real performance without scalability. “Scalability is the crust of performance…” says