33
Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster Pak Lui [email protected] May 7, 2015

"Performance Evaluation, Scalability Analysis, and Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

  • Upload
    altair

  • View
    511

  • Download
    0

Embed Size (px)

Citation preview

Page 1: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Performance Evaluation,

Scalability Analysis, and

Optimization Tuning of Altair HyperWorks

on a Modern HPC Compute Cluster

Pak Lui

[email protected]

May 7, 2015

Page 2: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

Agenda

• Introduction to HPC Advisory Council

• Benchmark Configuration

• Performance Benchmark Testing and Results

• MPI Profiling

• Summary

• Q&A / For More Information

Page 3: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• World-wide HPC organization (400+ members)

• Bridges the gap between HPC usage and its full potential

• Provides best practices and a support/development center

• Explores future technologies and future developments

• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage

• Leading edge solutions and technology demonstrations

The HPC Advisory Council

Page 4: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

HPC Advisory Council Members

Page 5: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• HPC Advisory Council Chairman

Gilad Shainer - [email protected]

• HPC Advisory Council Media Relations and Events Director

Brian Sparks - [email protected]

• HPC Advisory Council China Events Manager

Blade Meng - [email protected]

• Director of the HPC Advisory Council, Asia

Tong Liu - [email protected]

• HPC Advisory Council HPC|Works SIG Chair

and Cluster Center Manager

Pak Lui - [email protected]

• HPC Advisory Council Director of Educational Outreach

Scot Schultz – [email protected]

• HPC Advisory Council Programming Advisor

Tarick Bedeir - [email protected]

• HPC Advisory Council HPC|Scale SIG Chair

Richard Graham – [email protected]

• HPC Advisory Council HPC|Cloud SIG Chair

William Lu – [email protected]

• HPC Advisory Council HPC|GPU SIG Chair

Sadaf Alam – [email protected]

• HPC Advisory Council India Outreach

Goldi Misra – [email protected]

• Director of the HPC Advisory Council Switzerland Center of

Excellence and HPC|Storage SIG Chair

Hussein Harake – [email protected]

• HPC Advisory Council Workshop Program Director

Eric Lantz – [email protected]

• HPC Advisory Council Research Steering Committee

Director

Cydney Stevens - [email protected]

HPC Council Board

Page 6: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

HPC Advisory Council HPC CenterInfiniBand-based Storage (Lustre) Juniper Heimdall

Plutus Janus Athena

VestaThor Mala

Lustre FS 640 cores

456 cores 192 cores

704 cores 896 cores

280 cores

16 GPUs

80 cores

Page 7: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• HPC|Scale

• To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary

based

supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing

services.

• HPC|Cloud

• To explore usage of HPC components as part of the creation of external/public/internal/private cloud

computing environments.

• HPC|Works

• To provide best practices for building balanced and scalable HPC systems, performance tuning and

application guidelines.

• HPC|Storage

• To demonstrate how to build high-performance storage solutions and their affect on application performance

and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions,

and to expose more users to the potential of Lustre over high-speed networks.

• HPC|GPU

• To explore usage models of GPU components as part of next generation compute environments and potential

optimizations for GPU based computing.

• HPC|FSI

• To explore the usage of high-performance computing solutions for low latency trading,

more productive simulations (such as Monte Carlo) and overall more efficient financial services.

Special Interest Subgroups Missions

Page 8: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• HPC Advisory Council (HPCAC)

• 400+ members

• http://www.hpcadvisorycouncil.com/

• Application best practices, case studies (Over 150)

• Benchmarking center with remote access for users

• World-wide workshops

• Value add for your customers to stay up to date

and in tune to HPC market

• 2015 Workshops

• USA (Stanford University) – February 2015

• Switzerland – March 2015

• Brazil – August 2015

• Spain – September 2015

• China (HPC China) – Oct 2015

• For more information

• www.hpcadvisorycouncil.com

[email protected]

HPC Advisory Council

Page 9: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• University-based teams to compete and demonstrate the incredible

capabilities of state-of- the-art HPC systems and applications on the

2015 ISC High Performance Conference show-floor

• The Student Cluster Competition is designed to introduce the next

generation of students to the high performance computing world and

community

2015 ISC High Performance Conference

– Student Cluster Competition

Page 10: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Research performed under the HPC Advisory Council activities

• Participating vendors: Intel, Dell, Mellanox

• Compute resource - HPC Advisory Council Cluster Center

• Objectives

• Give overview of RADIOSS performance

• Compare different MPI libraries, network interconnects and others

• Understand RADIOSS communication patterns

• Provide best practices to increase RADIOSS productivity

RADIOSS Performance Study

Page 11: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Compute-intensive simulation software for Manufacturing

• For 20+ years an established standard for automotive crash and impact

• Differentiated by its high scalability, quality and robustness

• Supports multiphysics simulation and advanced materials

• Used across all industries to improve safety and manufacturability

• Companies use RADIOSS to simulate real-world scenarios (crash tests,

climate effects, etc.) to test the performance of a product

About RADIOSS

Page 12: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Dell™ PowerEdge™ R730 32-node (896-core) “Thor” cluster

• Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Static max Perf in BIOS)

• OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack

• Memory: 64GB memory, DDR3 2133 MHz

• Hard Drives: 1TB 7.2 RPM SATA 2.5”

• Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch

• Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters

• Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters

• Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch

• MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0

• Application: Altair RADIOSS 13.0

• Benchmark datasets:

• Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated

Test Cluster Configuration

Page 13: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Intel® Cluster Ready systems make it practical to use a cluster to

increase your simulation and modeling productivity

• Simplifies selection, deployment, and operation of a cluster

• A single architecture platform supported by many OEMs, ISVs, cluster

provisioning vendors, and interconnect providers

• Focus on your work productivity, spend less management time on the cluster

• Select Intel Cluster Ready

• Where the cluster is delivered ready to run

• Hardware and software are integrated and configured together

• Applications are registered, validating execution on the Intel Cluster Ready

architecture

• Includes Intel® Cluster Checker tool, to verify functionality and periodically

check cluster health

• RADIOSS is Intel Cluster Ready

About Intel® Cluster Ready

Page 14: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Performance and efficiency

• Intelligent hardware-driven systems management

with extensive power management features

• Innovative tools including automation for

parts replacement and lifecycle manageability

• Broad choice of networking technologies from GbE to IB

• Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

• Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

• Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

• Large memory footprint (Up to 768GB / 24 DIMMs)

• High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

PowerEdge R730Massive flexibility for data intensive operations

Page 15: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

RADIOSS Performance – Interconnect (MPP)

28 Processes/Node

4.8x

Intel MPI

• EDR InfiniBand provides better scalability performance than Ethernet

• 70 times better performance than 1GbE at 16 nodes / 448 cores

• 4.8x better performance than 10GbE at 16 nodes / cores

• Ethernet solutions does not scale beyond 4 nodes with pure MPI

Higher is better

70x

Page 16: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

RADIOSS Performance – Interconnect (MPP)

28 Processes/Node

25%28%

Intel MPI

• EDR InfiniBand provides better scalability performance

• EDR InfiniBand improves over QDR IB by 28% at 16 nodes / 488 cores

• Similarly, EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes

Higher is better

Page 17: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Running more cores per node generally improves overall performance

• Seen improvement of 18% from 20 to 28 cores per node at 8 nodes

• Improvement seems not as consistent at higher node counts

RADIOSS Performance – CPU Cores

6%

Intel MPIHigher is better

18%

Page 18: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Increasing simulation time can increase the run time

• Increasing a 8ms simulation to 80ms can result in much longer runtime

• 10x longer simulation run can result in a 14x in the runtime

RADIOSS Performance – Simulation Time

Intel MPIHigher is better

14x

14x

13x

Page 19: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Tuning MPI collective algorithm can improve performance

• MPI profile shows about 20% of runtime spent on MPI_Allreduce communications

• Default algorithm in Intel MPI is Recursive Doubling

• The default algorithm is the best among all tested for MPP

RADIOSS Performance – Intel MPI Tuning for MPP

8 Threads/MPI proc

Intel MPI

Higher is better

Page 20: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Highly parallel code

• Multi-level parallelization

• Domain decomposition MPI parallelization

• Multithreading OpenMP

• Enhanced performance

• Best scalability in the marketplace

• High efficiency on large HPC clusters

• Unique, proven method for rich scalability over thousands of cores for FEA

• Flexibility -- easy tuning of MPI & OpenMP

• Robustness -- parallel arithmetic allows perfect repeatability in parallel

RADIOSS Hybrid MPP Parallelization

Page 21: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Enabling Hybrid MPP mode unlocks the RADIOSS scalability

• At larger scale, productivity improves as more threads involves

• As more threads involved, amount of communications by processes are reduced

• At 32 nodes (or 896 cores), the best configuration is 2 PPN with 14 threads each

• The following environment setting and tuned flags are used:

• Intel MPI flags: I_MPI_PIN_DOMAIN auto

• I_MPI_ADJUST_BCAST=1

• I_MPI_ADJUST_ALLREDUCE=5

• KMP_AFFINITY=compact

• KMP_STACKSIZE=400m

• User: ulimit -s unlimited

RADIOSS Performance – Hybrid MPP version

EDR InfiniBand

Intel MPI Higher is better

3.7x

32%70%

Page 22: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Single precision job runs faster double precision

• SP provides 47% speedup than DP

• Similar scalability is seen for double precision tests

RADIOSS Performance – Floating Point Precision

47%

Intel MPI

Higher is better 2 PPN / 14 OpenMP

Page 23: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Increasing CPU core frequency enables higher job efficiency

• 18% of performance jump from 2.3GHz to 2.6GHz (13% increase in clock speed)

• 29% of performance jump from 2.0GHz to 2.6GHz (30% increase in clock speed)

• Increase in performance gain exceeds the increase CPU frequencies

• CPU bound application see higher benefit of using CPU with higher frequencies

RADIOSS Performance – CPU Frequency

29%

Intel MPI

Higher is better 2 PPN / 14 OpenMP

18%

Page 24: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• RADIOSS utilizes point-to-point communications in most data transfers

• The most time MPI consuming calls is MPI_Waitany() and MPI_Wait()

• MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%)

RADIOSS Profiling – % Time Spent on MPI

16 Processes/NodePure MPP

Page 25: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP

• For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default

• For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default

• Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5

RADIOSS Performance – Intel MPI Tuning (DP)

27%

Intel MPI

Higher is better 2 PPN / 14 OpenMP

44%

Page 26: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• The most time consuming MPI communications are:

• MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B

• MPI_Waitany: Messages are: 48B, 8B, 384B

• MPI_Allreduce: Most message sizes appears at 80B

RADIOSS Profiling – MPI Message Sizes

28 Processes/NodePure MPP

Page 27: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• EDR InfiniBand provides better scalability performance than Ethernet

• 214% better performance than 1GbE at 16 nodes

• 104% better performance than 10GbE at 16 nodes

• InfiniBand typically outperforms other interconnect in collective operations

RADIOSS Performance – Interconnect (HMPP)

214%

104%

2 PPN / 14 OpenMP

Intel MPI

Higher is better

Page 28: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• EDR InfiniBand provides better scalability performance than FDR IB

• EDR IB outperforms FDR IB by 27% at 32 nodes

• Improvement for EDR InfiniBand occurs at high node count

RADIOSS Performance – Interconnect (HMPP)

27%

2 PPN / 14 OpenMP

Intel MPI

Higher is better

Page 29: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• RADIOSS 12.0 utilizes most non-blocking calls for communications

• MPI_Wait, MPI_Waitany, MPI_Irecv and MPI_Isend are almost used exclusively

• RADIOSS 13.0 appears to use the same calls for most part

• The MPI_Bcast calls seem to be replaced by MPI_Allreduce calls instead

RADIOSS Profiling – Number of MPI Calls

Pure MPP At 32 Nodes

RADIOSS 13.0RADIOSS 12.0

Page 30: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• Intel E5-2680v3 (Haswell) cluster outperforms prior generations

• Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes

• System components used:

• Thor: 2-socket Intel [email protected], 2133MHz DIMMs, EDR IB, v13.0

• Jupiter: 2-socket Intel [email protected], 1600MHz DIMMs, FDR IB, v12.0

• Janus: 2-socket Intel [email protected], 1333MHz DIMMs, QDR IB, v12.0

RADIOSS Performance – System Generations

Single Precision

100%

238%

Page 31: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• The memory required to run this workload is around 5GB per node

• Considered as a small workload but good enough to observe application behavior

RADIOSS Profiling – Memory Required

28 Processes/NodePure MPP

Page 32: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

• RADIOSS is designed to perform at large scale HPC environment

• Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP

• Hybrid MPP version enhanced RADIOSS scalability

• 2 MPI processes per socket, 14 threads each

• Additional CPU cores generally accelerating time to solution performance

• Intel E5-2680v3 (Haswell) cluster outperforms prior generations

• Performs faster by 100% vs “Sandy Bridge”, by 238% vs “Westmere “ at 16 nodes

• Network and MPI Tuning

• EDR InfiniBand outperforms other Ethernet-based interconnects in scalability

• EDR InfiniBand delivers higher scalability performance than FDR and QDR IB

• Tuning environment parameters is important to maximize performance

• Tuning MPI collective ops helps RADIOSS to achieve even better scalability

RADIOSS – Summary

Page 33: "Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of Altair HyperWorks on a Modern HPC Compute Cluster "

Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

33

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

• Questions?

Pak Lui

[email protected]