Click here to load reader

Altair OptiStruct 13.0 Performance Benchmark and · PDF file4 OptiStruct by Altair • Altair OptiStruct – OptiStruct is an industry proven, modern structural analysis solver •

  • View
    240

  • Download
    8

Embed Size (px)

Text of Altair OptiStruct 13.0 Performance Benchmark and · PDF file4 OptiStruct by Altair •...

  • Altair OptiStruct 13.0Performance Benchmark and Profiling

    May 2015

  • 2

    Note

    The following research was performed under the HPC Advisory Council activities

    Participating vendors: Intel, Dell, Mellanox

    Compute resource - HPC Advisory Council Cluster Center

    The following was done to provide best practices

    OptiStruct performance overview

    Understanding OptiStruct communication patterns

    Ways to increase OptiStruct productivity

    MPI libraries comparisons

    For more info please refer to

    http://www.altair.com

    http://www.dell.com

    http://www.intel.com

    http://www.mellanox.com

    http://www.altair.com/http://www.hp.com/go/hpchttp://www.hp.com/go/hpchttp://www.mellanox.com/

  • 3

    Objectives

    The following was done to provide best practices OptiStruct performance benchmarking

    Interconnect performance comparisons

    MPI performance comparison

    Understanding OptiStruct communication patterns

    The presented results will demonstrate The scalability of the compute environment to provide nearly linear

    application scalability

    The capability of OptiStruct to achieve scalable productivity

  • 4

    OptiStruct by Altair

    Altair OptiStruct

    OptiStruct is an industry proven, modern structural analysis solver

    Solve for linear and non-linear structural problems under static and dynamic loadings

    Market-leading solution for structural design and optimization

    Helps designers and engineers to analyze and optimize structures

    Optimize for strength, durability and NVH (Noise, Vibration, Harshness) characteristics

    Help to rapidly develop innovative, lightweight and structurally efficient designs

    Based on finite-element and multi-body dynamics technology

  • 5

    Test Cluster Configuration

    Dell PowerEdge R730 32-node (896-core) Thor cluster

    Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Early Snoop, Max Perf in BIOS)

    OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack

    Memory: 64GB memory, DDR3 2133 MHz

    Hard Drives: 1TB 7.2 RPM SATA 2.5

    Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch

    Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch

    Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters

    Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters

    MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0

    Application: Altair OptiStruct 13.0

    Benchmark datasets:

    Engine Assembly

  • 6

    PowerEdge R730Massive flexibility for data intensive operations

    Performance and efficiency

    Intelligent hardware-driven systems management

    with extensive power management features

    Innovative tools including automation for

    parts replacement and lifecycle manageability

    Broad choice of networking technologies from GbE to IB

    Built in redundancy with hot plug and swappable PSU, HDDs and fans

    Benefits

    Designed for performance workloads

    from big data analytics, distributed storage or distributed computing

    where local storage is key to classic HPC and large scale hosting environments

    High performance scale-out compute and low cost dense storage in one package

    Hardware Capabilities

    Flexible compute platform with dense storage capacity

    2S/2U server, 6 PCIe slots

    Large memory footprint (Up to 768GB / 24 DIMMs)

    High I/O performance and optional storage configurations

    HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

    Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

  • 7

    OptiStruct Performance CPU Cores

    Running more cores per node generally improves overall performance

    The -nproc parameter specified the number of threads spawned per MPI process

    Guideline: 6 threads per MPI process yields the best performance

    Ideal threads to be spawned appears to be 6 threads per MPI process (either 2/4 PPN)

    Having 6 threads spawned by each MPI process performs best among all other tested

    Higher is better

  • 8

    OptiStruct Performance Interconnect

    EDR InfiniBand provides superior scalability performance over Ethernet

    11 times better performance than 1GbE at 24 nodes

    90% better performance than 10GbE at 24 nodes

    Ethernet solutions does not scale beyond 4 nodes

    2 PPN / 6 ThreadsHigher is better

    90%11x

  • 9

    OptiStruct Profiling Number of MPI Calls

    For 1GbE, communication time is mostly spent on point-to-point transfer

    MPI_Iprobe and MPI_Test are the tests for non-blocking transfers

    Overall runtime is significantly longer compared to faster interconnects

    For 10GbE, communication time is consumed by data transfer

    Amount of time for non-blocking transfers still significant

    Overall runtime reduces compared to 1GbE

    While time for data transfer reduces, collective operations has higher ratio as in overall

    For InfiniBand, overall runtime reduces

    Time consumed by MPI_Allreduce is more significant compared to data transfer

    Overall runtime reduces significantly compared to Ethernet

    10GbE1GbE EDR IB

  • 10

    OptiStruct Profiling Number of MPI Calls

    For 1GbE, communication time is mostly spent on point-to-point transfer

    MPI_Iprobe and MPI_Test are the tests for non-blocking transfers

    Overall runtime is significantly longer compared to faster interconnects

    For 10GbE, communication time is consumed by data transfer

    Amount of time for non-blocking transfers still significant

    Overall runtime reduces compared to 1GbE

    While time for data transfer reduces, collective operations has higher ratio as in overall

    For InfiniBand, overall runtime reduces

    Time consumed by MPI_Allreduce is more significant compared to data transfer

    Overall runtime reduces significantly compared to Ethernet

    10GbE1GbE EDR IB

  • 11

    OptiStruct Profiling MPI Message Sizes

    The most time consuming MPI communications are:

    MPI_Allreduce: Messages concentrated at 8B

    MPI_Iprobe and MPI_Test have volume of calls that test for completion of messages

    2 PPN / 6 Threads

  • 12

    OptiStruct Performance Interconnect

    EDR IB delivers superior scalability performance over previous InfiniBand

    EDR InfiniBand improves over FDR IB by 40% at 24 nodes

    EDR InfiniBand outperforms FDR InfiniBand by 9% at 16 nodes

    New EDR IB architecture supersedes previous FDR IB generation of in scalability

    4 PPN / 6 Threads

    40%9%

    Higher is better

  • 13

    OptiStruct Performance Processes Per Node

    OptiStruct reduces communication by deploying hybrid MPI mode

    Hybrid MPI process can spawn threads; helps reducing communications on network

    By enabling more MPI processes per node, it helps to unlock additional performance

    The following environment setting and tuned flags are used :

    I_MPI_PIN_DOMAIN auto, I_MPI_ADJUST_ALLREDUCE 2, I_MPI_ADJUST_BCAST

    1, I_MPI_ADJUST_REDUCE 2,

    ulimit -s unlimited

    Higher is Better

    4%

    10%

  • 14

    OptiStruct Performance IMPI Tuning

    Tuning Intel MPI collective algorithm can improve performance

    MPI profile shows ~30% of runtime spent on MPI_Allreduce IB communications

    Default algorithm in Intel MPI is Recursive Doubling (I_MPI_ADJUST_ALLREDUCE=1)

    Rabenseifner's algorithm for Allreduce appears to the be the best on 24 nodes

    Intel MPI

    Higher is better 4 PPN / 6 Threads

  • 15

    OptiStruct Performance CPU Frequency

    Increase in CPU clock speed allows higher job efficiency

    Up to 11% of high productivity by increasing clock speed from 2300MHz to 2600MHz

    Turbo Mode boosts job efficiency higher than increase in clock speed

    Up to 31% of performance jump by enabling Turbo Mode at 2600MHz

    Performance gain by turbo mode depends on environment factors, e.g. temperature

    8%

    Higher is better 4 PPN / 6 Threads

    31%

    11%10%

    4%17%

  • 16

    OptiStruct Profiling Disk I/O

    OptiStruct makes use of distributed I/O of local scratch of compute nodes

    Heavy disk IO appears to take place throughout the run on each compute node

    The high I/O usage causes system memory to also to be utilized for I/O caching

    Disk I/O is distributed on all compute nodes; thus provides higher I/O performance

    Workload would complete faster as more nodes take part on the distributed I/O

    Higher is better 4 PPN / 6 Threads

  • 17

    OptiStruct Profiling MPI Message Sizes

    Majority of data transfer takes place from rank 0 to the rest

    It appears that most data transfer takes place between rank 0 to the rest

    Those non-blocking communication appears data transfers to hide latency in network

    The collective operations appear to be much less in size

    32 Nodes

    2 PPN / 6 Threads

    16 Nodes

  • 18

    OptiStruct Summary

    OptiStruct is designed to perform structural analysis at large scale

    OptiStruct designed hybrid MPI mode to perform at scale

    EDR InfiniBand shows to outperform Ethernet in scalability performance

    ~70 times better performance than 1GbE at 24 nodes

    4.8x better performance than 10GbE at 24 nodes

    EDR InfiniBand improves over FDR IB by 40% at 24 nodes

    Hybrid MPI process can spawn threads; helps reducing communications on network

    By enabling more

Search related