View
6
Download
0
Category
Preview:
Citation preview
Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment
Pak Lui, Application Performance Manager
HPC Advisory Council
2
Agenda
• Introduction to HPC Advisory Council
• Benchmark Configuration
• Performance Benchmark Testing and Results
• Summary
• Q&A / For More Information
3
The HPC Advisory Council
• World-wide HPC organization (360+ members)
• Bridges the gap between HPC usage and its full potential
• Provides best practices and a support/development center
• Explores future technologies and future developments
• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage
• Leading edge solutions and technology demonstrations
4
HPC Advisory Council Members
5
HPC Council Board
• HPC Advisory Council Chairman
Gilad Shainer - gilad@hpcadvisorycouncil.com
• HPC Advisory Council Media Relations and Events
Director
Brian Sparks - brian@hpcadvisorycouncil.com
• HPC Advisory Council China Events Manager
Blade Meng - blade@hpcadvisorycouncil.com
• Director of the HPC Advisory Council, Asia
Tong Liu - tong@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Works SIG Chair
and Cluster Center Manager
Pak Lui - pak@hpcadvisorycouncil.com
• HPC Advisory Council Director of Educational
Outreach
Scot Schultz – scot@hpcadvisorycouncil.com
• HPC Advisory Council Programming Advisor
Tarick Bedeir - Tarick@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Scale SIG Chair
Richard Graham – richard@hpcadvisorycouncil.com
• HPC Advisory Council HPC|Cloud SIG Chair
William Lu – william@hpcadvisorycouncil.com
• HPC Advisory Council HPC|GPU SIG Chair
Sadaf Alam – sadaf@hpcadvisorycouncil.com
• HPC Advisory Council India Outreach
Goldi Misra – goldi@hpcadvisorycouncil.com
• Director of the HPC Advisory Council Switzerland
Center of Excellence and HPC|Storage SIG Chair
Hussein Harake – hussein@hpcadvisorycouncil.com
• HPC Advisory Council Workshop Program Director
Eric Lantz – eric@hpcadvisorycouncil.com
• HPC Advisory Council Research Steering Committee
Director
Cydney Stevens - cydney@hpcadvisorycouncil.com
6
HPC Advisory Council HPC Center
InfiniBand-based Storage (Lustre) Juniper Dodecas
Plutus Janus Athena
Vesta Mercury Mala
Lustre FS 512 cores
456 cores 192 cores
704 cores 384 cores
256 cores
16 GPUs
64 cores
7
Special Interest Subgroups Missions
• HPC|Scale
– To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and
proprietary based supercomputers with networks and clusters of microcomputers acting in unison to
deliver high-end computing services.
• HPC|Cloud
– To explore usage of HPC components as part of the creation of external/public/internal/private cloud
computing environments.
• HPC|Works
– To provide best practices for building balanced and scalable HPC systems, performance tuning and
application guidelines.
• HPC|Storage
– To demonstrate how to build high-performance storage solutions and their affect on application
performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore
Lustre based solutions, and to expose more users to the potential of Lustre over high-speed
networks.
• HPC|GPU
– To explore usage models of GPU components as part of next generation compute environments
and potential optimizations for GPU based computing.
• HPC|FSI
– To explore the usage of high-performance computing solutions for low latency trading,
more productive simulations (such as Monte Carlo) and overall more efficient financial services.
8
HPC Advisory Council
• HPC Advisory Council (HPCAC)
– 360+ members
– http://www.hpcadvisorycouncil.com/
– Application best practices, case studies (Over 150)
– Benchmarking center with remote access for users
– World-wide workshops
– Value add for your customers to stay up to date
and in tune to HPC market
• 2013 Workshops
– USA (Stanford University) – January 2013
– Switzerland – March 2013
– Germany (ISC’13) – June 2013
– Spain – Sep 2013
– China (HPC China) – Oct 2013
• For more information
– www.hpcadvisorycouncil.com
– info@hpcadvisorycouncil.com
9
ISC’14 – Student Cluster Competition
• University-based teams to compete and demonstrate the
incredible capabilities of state-of- the-art HPC systems and
applications on the ISC’14 show-floor
• The Student Cluster Competition is designed to introduce the next
generation of students to the high performance computing world
and community
10
FLOW-3D/MP Performance Study
• Research performed under the HPC Advisory Council
activities
– Participating vendors: Intel, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• Objectives
– Give overview of FLOW-3D/MP performance
– Compare different MPI libraries, network interconnects and
others
– Understand FLOW-3D/MP communication patterns
– Provide best practices to increase FLOW-3D/MP productivity
11
FLOW-3D/MP
• FLOW-3D/MP is a powerful and highly-accurate CFD software
– Provides engineers valuable insight into many physical flow processes
• FLOW-3D/MP is the ideal computational fluid dynamics software
– To use in the design phase as well as in improving production processes
– Provides special capabilities for accurately predicting free-surface flows
• FLOW-3D/MP is a standalone, all-inclusive CFD package
– Includes an integrated GUI that ties components from problem setup to post-
processing
12
Test Cluster Configuration
• Dell™ PowerEdge™ R720xd 32-node “Jupiter” cluster
– 16-node Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs
– 16-node Dual-Socket Ten-Core Intel E5-2680 V2 @ 2.80 GHz CPUs
– Memory: 64GB memory, DDR3 1600 MHz
– OS: RHEL 6.2, OFED 1.5.3-3.1.0 (for v4.2) and 2.0-3.0.0 (for v5.0) InfiniBand SW stack
– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0
• Mellanox Connect-IB FDR InfiniBand adapters and ConnectX-3 Ethernet adapters
• Mellanox SwitchX SX6036 InfiniBand switch
• MPI (vendor provided): Intel MPI 3.2.0.011 (v4.2) and 4.0.0.025 (v5.0)
• Application: FLOW-3D/MP 4.2 and 5.0
• Benchmarks:
– Lid Driven Cavity Flow - Completely filled with fluid (in this case air)
– P2 Engine Block - A casting application with free-surface (one-fluid simulation with sharp
interface tracking)
13
About Intel® Cluster Ready
• Intel® Cluster Ready systems make it practical to use a cluster to increase
your simulation and modeling productivity
– Simplifies selection, deployment, and operation of a cluster
• A single architecture platform supported by many OEMs, ISVs, cluster
provisioning vendors, and interconnect providers
– Focus on your work productivity, spend less management time on the cluster
• Select Intel Cluster Ready
– Where the cluster is delivered ready to run
– Hardware and software are integrated and configured together
– Applications are registered, validating execution on the Intel Cluster Ready
architecture
– Includes Intel® Cluster Checker tool, to verify functionality and periodically check
cluster health
• FLOW-3D/MP is Intel Cluster Ready certified
14
PowerEdge R720xd Massive flexibility for data intensive operations
• Performance and efficiency
– Intelligent hardware-driven systems management
with extensive power management features
– Innovative tools including automation for
parts replacement and lifecycle manageability
– Broad choice of networking technologies from 1GbE to IB
– Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits
– Designed for performance workloads
• from big data analytics, distributed storage or distributed computing
where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities
– Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
– Large memory footprint (Up to 768GB / 24 DIMMs)
– High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server
• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
15
FLOW-3D/MP Performance – MPI vs Hybrid
• FLOW-3D/MP supports OpenMP Hybrid mode – Up to 327% of higher performance than MPI mode at 16 nodes
• Hybrid version enables higher scalability versus pure MPI version – Hybrid version delivers better scalability after 4 nodes – MPI processes would spawn OpenMP threads for computation on CPU cores
– Streamline and reduce communication endpoints to improve scalability
Intel E5-2680 V2 FDR InfiniBand
327%
*Performance Rating = Jobs/Day
16
• Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads
– Which uses “-genv I_MPI_PIN_DOMAIN node” as specified in runhyd_par script
• For best performance on 2P Platforms:
– Use “-genv I_MPI_PIN_DOMAIN socket” in runhyd_par script
– 2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads
– The flag allows to each MPI process to spawn threads within its own socket
– Instead of both MPI processes sharing the same socket
FLOW-3D/MP Performance – Hybrid Mode
110%
178%
FDR InfiniBand FLOW-3D/MP 5.0
17
• 2PPN of 8 threads can provide better performance than 1PPN of 16 threads
– Threads of the MPI process causes threads to spawn within the same socket
– With the “I_MPI_PIN_DOMAIN=socket” specified in the runhyd_par script
• Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads
– With the “I_MPI_PIN_DOMAIN=node” specified in the runhyd_par script
• The flag is modified to “socket” to allow spawning of threads within a socket
– For the case of 2PPN of 8 threads
FLOW-3D/MP Performance – Hybrid Mode
FDR InfiniBand
9% 35%
FLOW-3D/MP 4.2
18
FLOW-3D/MP Performance – Processors
• Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster
– Performs 70% better than X5670 cluster at 16 nodes
• System components used:
– Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks
– Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk
FLOW-3D/MP 4.2
70%
FDR InfiniBand
19
FLOW-3D/MP Performance – Processors
• Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster
– Performs 8% better than X5670 cluster at 16 nodes
• System components used:
– Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz
– Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz
8%
5%
FDR InfiniBand FLOW-3D/MP 5.0
20
FLOW-3D/MP Performance – Software Versions
• FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode
– Provides up to 37% faster in Hybrid mode at 16-node
• Running in MPI mode shows slightly longer runtime than previous version
– Appears to be caused by a change in communication algorithm
– More MPI collective operations are being used compared to prior version
16 MPI Processes/Node
37%
*Performance Rating = Jobs/Day
21
FLOW-3D/MP Performance – Network
• InfiniBand FDR provides better scalability performance than Ethernet
– Scalability gap widens as more nodes involved in simulation
– FDR InfiniBand provides up to 246% better performance than 1GbE
– FDR InfiniBand delivers up to 39% better performance than 10GbE
– Hybrid mode is shown
39%
16 MPI Processes/Node
246%
*Performance Rating = Jobs/Day
22
FLOW-3D/MP Profiling – Time Ratio
• InfiniBand FDR reduces the communication time at scale
– InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job
– 10GbE consumes 39% of total time, while 1GbE consumes about 75%
• IB RDMA technology allows communication to bypass CPU involvement
– Reduces CPU overhead in handling communication
– Which leaves more time for application processing
23
FLOW-3D/MP Profiling – # of MPI Calls
• Overall runtime reduces as more nodes take part of the MPI job
– More compute nodes reduce runtime by spreading out the workload
• Computation time drops while the communication time stays flat
– As cluster scales, MPI time stays constantly at the same level
16 MPI Processes/Node Pure MPI Mode
24
FLOW-3D/MP Profiling – Communication Time
• The most time consumed MPI functions are:
– FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%)
• InfiniBand reduces more time in Collective Operations than Ethernet
– Collective communications are most used in FLOW-3D v5.0
– Those communications account for the highest communication time
25
FLOW-3D/MP Profiling – # of MPI Calls
• There is a wide range of message sizes seen: – MPI_Allreduce: Concentration between 4B to 16B – MPI_Waitall: Around 4MB – MPI_Bcast: Around 1-4MB
26
FLOW-3D/MP Performance – File System
• Storing data files on local FS or tmpfs would improve performance
– Scalability is limited by NFS when running at scale after 8 nodes
– NFS used in this case is over 1GbE network
Higher is better InfiniBand FDR
25%
27
FLOW-3D/MP Profiling – Disk IO Time
• File IO access occurs during certain period during the MPI solver
– Large spikes for writing the restart and spatial data
– Files are directed to write to local instead of NFS to avoid IO bottleneck
28
FLOW-3D/MP – Summary
• Scalability
– FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version
• Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2
• Performance
– Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores
– Allocate MPI process to “proper” socket in Hybrid mode allows performance to jump 178%
– Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode
• Network
– InfiniBand FDR allows the best scalability performance with 56Gbps rate
• Outperforms by 246% over 1GbE at 16-node (320 cores)
• Outperforms by 39% over 10GbE at 16-node (320 cores)
– RDMA technology in InfiniBand allows bypassing CPU for network transfer
• This offload reduces CPU overhead in handling communication; thus CPU can focus on application
• Profiling
– MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time
– InfiniBand can process Collective Operations in network faster than Ethernet
– Large concentration on small messages, typical for latency sensitive HPC applications
29 29
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You
• Special thanks to Anup Gokarn of Flow Science
• Questions?
– Pak Lui
– pak@hpcadvisorycouncil.com
Recommended