Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

1 © 2019 ANSYS, Inc. October 24, 2019

Understanding Hardware Selection to Speed Up Your Simulations

October 2019

Wim Slagter, PhD

ANSYS, Inc.


Major Barrier- Turnaround Time Limitations

Source: Intel-ANSYS Simulation Survey 2014


Problem Statement

“I am not achieving the performance and throughput I was

expecting from my hardware & software”

Image courtesy of Intel Corporation


Building A Balanced System Is The Key To Improving Your Experience

If Your System Is

Slow, So Are Your

Engineers &

Analysts Processors

Memory

Storage

Networks

Image courtesy of Intel Corporation


HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?


Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing


Agenda

• HPC Terminology






HPC Hardware Terminology

Machine 1 (or Node 1)

GPU

Processor 1 (or Socket 1)


Interconnect(GigE or InfiniBand)

Machine N (or Node N)

GPU




Shared Memory Parallel

• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.

• OpenMP is the industry standard.




Distributed Memory Parallel

• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.

• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.

• MPI (Message Passing Interface) is the industry standard for this.




Agenda

• HPC Terminology






HDD vs. SSD



CPUs? GPUs?


Scalability on Workstations- ANSYS Fluent 2019 R2

• HP Z8 G4 Workstation.• 2x Intel Xeon Platinum 8160 (2.1-3.7GHz, 24cores) CPUs.• 192GB (2600MHz, 8GBx24 DIMMs. • 1TB HP Z Turbo Drive G2 (NVMe SSD)

1,00

1,82

2,67

3,43

5,47 5,59 5,82

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

aircraft

1,00

1,81

2,61

3,32

5,46 5,73 6,00

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

Landig Gear


Performance Comparison of an Old with New Workstation- ANSYS Fluent 2019 R2

Contents

Workstations

Z820 WorkstationDual Intel Xeon E5-2697v2 (2.7-3.5GHz, 12 cores)

RAM : 1866MHz, 4 channelsvs

Z8 G4 WorkstationDual Intel Xeon Platinum 8160 (2.1-3.7GHz, 24 cores)

RAM : 2666MHz, 6 channels

MPI IBM-MPI

Number of CPU cores tested 4 / 8 / 12 / 16 / 24 / 32 / 36

Benchmark Models 2 cases: aircraft and landing gear


Performance Comparison of an Old with New Workstation- ANSYS Fluent 2019 R2

1,00 1,72 1,78 1,97 2,04

NoData

NoData

1,07

1,94

2,82

3,65

4,95

5,80 5,91

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 24 32 36

Spee

dUP

Ratio

Number of CPU cores

aircraft

Z820 Z8G4

1,00

1,77 2,29

2,64 2,81

NoData

NoData

1,08

2,08

2,97

3,83

5,20

6,05 6,30

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 24 32 36

Spee

dUP

Ratio

Number of CPU cores

Landing gear

Z820 Z8G4

85% Speed Up at 24 cores for a middle size CFD simulation

142% Speed Up at 24 cores for a small size CFD simulation

The memory is different on these two machines 2666MHz vs 1866MHz. They may have different memory bandwidth also. For 4-cores, they have very similar performance. When more cores are being used, the performance is starting to deviate indicating a memory bandwidth difference. In short, it may be a matter of both memory speed and bandwidth.


Optimized for the Latest HPC ArchitecturesANSYS Mechanical 19.x

ANSYS Features & Capabilities

Optimized for Intel Xeon Gold processors:• Upgraded to the Intel MKL 2017 update 2 libraries on Linux and Windows• Provides access to the AVX-512 instruction set• Biggest speedup gains achieved in the sparse direct solver

Iterative Solver Benchmarks

Direct Solver Benchmarks

R19.0 572 sec 425 sec

R19.1 539 sec 404 sec

• R19 Benchmark set (DMP)• Used geometric mean values for each class of benchmarks • Used 1, 2, 4, 8, 16, & 32 cores• 2 Intel Xeon Gold 6148 (2.4 GHz, 40 cores total), 192 GB RAM, Linux CentOS 7.3

R19.1 performs ~10% faster than R19.0 on Skylake systems


Optimized for the Latest HPC ArchitecturesANSYS Mechanical 2019 R2

1,65

1,19

1,55 1,53

1,78

1,55 1,561,71

1,82 1,76

0,00

0,50

1,00

1,50

2,00

V19cg-1 V19cg-2 V19cg-3 V19ln-1 V19ln-2 V19sp-1 V19sp-2 V19sp-3 V19sp-4 V19sp-5

Core Solver Rating on 32 coresNormalized to Haswell

Intel® Xeon® E5-2699 v3 (Haswell)Intel® Xeon® E5-2697 v4 (Broadwell)Intel® Xeon® Gold 6254 (Cascade Lake)


Optimized for the Latest HPC ArchitecturesANSYS Fluent 2019 Rx

0

2000

4000

6000

8000

10000

12000

32 64 128 256 512

Ratin

g (Jo

bs/d

ay)

MPI Tasks (Cores)

ANSYS Fluent 2019 Aircraft Wing 14M Benchmarkon Intel® Gold 6142 2.60 GHz/Intel® Gold6242 2.8GHz

Gold 6142 processors (16c/2.6GHz/150W) 2019 R1 (19.3.0)

Gold 6242 processors (16c/2.8GHz/150W) 2019 R3 (19.5.0)

Higher is Better

ANSYS Fluent Standard Benchmark Aircraft Wing 14M Cells


Optimized for the Latest HPC ArchitecturesANSYS CFX 2019 R3

0

3000

6000

9000

12000

15000

18000

21000

1 2 4 8

Ratin

g (Jo

bs P

er D

ay)

Number Compute Nodes

Intel® Xeon Processors

Xeon E5-2690v4

Xeon Gold 6154

Platinum 8268

Higher is Better


Performance Comparison of Intel Xeon Processors- ANSYS CFD 18.1

0

2

4

6

8

10

12

14

16

18

20

22

0 4 8 12 16 20 24 28 32Sp

eedU

p

Number of Cores

ANSYS CFX 18.1

E5-2680v2_HDD

E5-2697v4_HDD

Gold6150_HDD

0

2

4

6

8

10

12

14

16

18

20

22

0 4 8 12 16 20 24 28 32

Spee

dUp

Number of Cores

ANSYS Fluent 18.1

E5-2680v2_HDD

E5-2697v4_HDD

Gold6150_HDD

※Each series is the average value of model_1 and _2, and Turbo Boost On andOff in each CPU.

※Each series is the average value of model_1 and _2, and Turbo Boost On andOff in each CPU.

vs vs vs vs

35%up

43%up


Performance Comparison of Intel Xeon Processors- ANSYS Fluent 2019 R2

1,00

1,82

2,67

3,43

5,47 5,59 5,82

1,03

1,85

2,71 3,41

5,04 5,09

NoData

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

dUP

Ratio

Number of CPU cores

aircraft

Xeon Platinum 8160 Xeon Gold 6154

When it exceeds the 32 parallels, Xeon Gold 6154 shows a tendency to stall performance compared with Xeon Platinum 8160

1,00

1,81

2,61 3,32

5,46 5,73 6,00

1,09

1,98

2,86 3,54

5,17 5,09

NoData

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

dUP

Ratio

Number of CPU cores

Landing gear

Xeon Platinum 8160 Xeon Gold 6154


• A newer generation (in this case: Cascade Lake) might have significant performance gain over the previous generation but may require more cores.

• Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect.

• This is just something to be aware of when comparing one processor and one another.

1.2 and 2.4 relative performance based on increased core count, resp.

Performance Comparison of Intel Xeon Processors- ANSYS 2019 R1


• A newer generation might have significant performance gain over the previous generation but may require more cores.

• Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect.

• This is just something to be aware of when comparing one processor and one another.

Processor Performance Comparisons

Up to YY%

faster

0

1

2S Intel® Xeon® processor E5-2698 v3

2S Intel® Xeon® processor E5-2697 v4

Intel® Xeon® Gold 6148 processor

Up to 13%

faster

Fluent workload: sedan_4m.

ANSYS® Fluent 18.1 increased performance1 with the Intel® Xeon® Gold 6148 processor

Up to 60%

faster@ 13% more cores


• For Mechanical, the situation is different because of AVX-512 support from E5-v4 and Gold processor generation.

Processor Performance ComparisonsANSYS Mechanical 2019 R2

1,65

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

V19cg-1

Core Solver Rating on 32 coresNormalized to Haswell

Intel® Xeon® E5-2699 v3 (Haswell)

Intel® Xeon® E5-2697 v4 (Broadwell)

Intel® Xeon® Gold 6254 (Cascade Lake)


Intel Xeon Skylake vs. AMD EPYC (Naples)Processor Performance Comparisons


Intel Xeon Skylake vs. AMD EPYC (Naples)Processor Performance Comparisons


Intel Xeon Cascade Lake vs. AMD EPYC (Naples)Processor Performance Comparisons

Hardware Specifics

‐ EPYC 7601➢AMD EPYC 7601 with 2 sockets, 32 cores per socket, Mellanox

EDR interconnect

‐ CLX-9242➢Intel Xeon Platinum 9242 with 2 sockets, 48 cores per socket, Intel

OPA interconnect

‐ CLX 8260L➢Intel Xeon Platinum 8260L (CLX-SP) with 2 sockets, 24 cores per

socket, Intel OPA interconnect (single rail)



• About 1.5X performance compared to EPYC

• Better with larger number of cores/nodes

• About 1.3X performance compared to EPYC

• Better with larger number of cores/nodes




Processor Performance ComparisonsIntel Cascade Lake vs. AMD EPYC (Rome)


Understanding the effect of clock speed- ANSYS CFD

ANSYS Fluent 19.2

5% on CentOS10% on Windows

higher is better

Intel Xeon Gold 6144• Normal: 3.5 GHz.

• OverClock: 4.09GHz.

Overclocking can increase Speedup by 10% on a Windows system. Half

of the increase on Linux.


Understanding the effect of clock speed- ANSYS CFD


Understanding the effect of clock speed- ANSYS Mechanical

• Effect of increased core operating frequencies on the DMP benchmarks running on 12 cores

• Influence is highest for sparse solver benchmarks

Using higher clock speed is alwayshelpful to realize productivity gains


• We can see that relative to 1 core we can see good performance gains in many cases by using Turbo Boost on the E5 processor family.

Turbo Boost (Intel)- ANSYS Mechanical


Turbo Boost (Intel)- ANSYS CFD

Using Turbo Boost / Core can behelpful to realize productivity gains


Turbo Boost (Intel)- ANSYS CFD

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

32Cores

TruboBoostOff_3MTruboBoostOn_3MTruboBoostOff_30MTruboBoostOn_30M

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

16Cores


0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

10Cores


0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

4Cores


1.14 1.13 1.11 1.08

1.05 1.07 1.03 1.03

1.14 1.10 1.10 1.07

1.07 1.08 1.04 1.04

Using Turbo Boost / Core can behelpful to realize productivity gains,Particularly at lower core counts.

3M cell single phase

30M cell single phase


Hyper-threadingEvaluation of Hyperthreading on ANSYS/FLUENT Performance

iDataplex M3 (Intel Xeon x5670, 2.93 GHz)TURBO: ON

(measurement is improvement relative ot Hyperthtreading OFF)

0.90

0.95

1.00

1.05

1.10

eddy_417K turbo_500K aircraft_2M sedan_4M truck_14MANSYS/FLUENT Model

Impr

ovem

et d

ue to

Hyp

erth

read

ing

.

HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores)

High

er is

bet

ter

Hyper-threading is NOT recommended


Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?

3 x (8) = 24 cores 2 x (12) = 24 cores



22%up


10-core processor has higher performance per core than

12-core processor.Consider memory per core!

Understanding the effect of memory bandwidth



Using less cores per node can behelpful to realize productivity gains


Understanding the effect of memory bandwidth

Using less cores per node can behelpful to realize productivity gains


10-core processor has higher performance per core than

12-core processor.Consider memory per core!

Understanding the effect of memory channels


Understanding the effect of memory speed- ANSYS CFD Impact of DIMM speed on ANSYS/FLUENT Application Performance

(Intel Xeon x5670, 2.93 GHz)Hyper Threading: OFF, TURBO: ON

Active threads per node: 12(performance measure improvement is relative to memory speed of 1066 MHz)

80%

85%

90%

95%

100%

105%

110%

115%

120%

125%

130%

eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M

ANSYS/FLUENT Model

Impa

ct o

f Mem

ory

Spe

ed

1066 MHz1333 MHz

• Some processors types have slower memory speeds by default

• In the past, memory speed was more influential

• With current processors, we can see a minimal effect of memory speed

• On other processors non-optimally filling of the memory channels can slow the memory speed

Memory speed is shown to have a measurable, but small effect of

approximately 2%


Distributed Memory Parallel is Outperforming Shared Memory Parallel computing

SMP DMP

4 8 12 160

5.0

2.5

0.0

50.0

25.0

64 128 192 2560

Speedup Factor vs. Number of Coresfor ANSYS Mechanical

0.0

SMP vs. DMP


• Faster cores mean faster solution

• Faster memory means slightly faster solution

• Memory bandwidth is an important factor for (linear) scale-ability

• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.

• In general hyper threading should not be used because of licensing implications.

• Be careful when looking at comparisons! Make sure you are comparing like with like!

Recap


HDD vs. SSD



CPUs? GPUs?


Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.0 with Nvidia GPU

ANSYS Application Examples


Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.x with Nvidia GPU


• R19 benchmark suite run on Linux server with 2 Intel Xeon E5-2695v3 processors• 256 GB RAM, SSD, 1 NVIDIA P100 16 GB PCIe card, CentOS 7.2

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)


Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.x with Nvidia GPU


• 4.2 million DOF; sparse solver, nonlinear static analysis involving contact, plasticity and gasket elements• Linux cluster; each compute node contains 2 Intel Xeon E5-2690 v4 (2.1GHz, 8c) processors, 256 GB RAM,

2 NVIDIA Tesla P100, CentOS 7.4


0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

0 GPU 1 GPU 2 GPU

Rela

tive

Spee

dup

DMP Performance w/ 16 cores

R19.0

R19.1




• R18 benchmark suite run on PNY workstation with 2 Intel Xeon E5-2620 v4 (2.1GHz, 8c) processors• 128 GB RAM, 2 x SSD Raid 0, 1 NVIDIA Quadro GP100, Windows 10


1,89

1,39 1,33 1,301,40

1,65 1,61

1,13 1,03

1,54

2,26

1,59

1,18

0,00

0,50

1,00

1,50

2,00

2,50

V18cg-1 V18cg-2 V18cg-3 V18ln-1 V18ln-2 V18ln-3(DSLP)

V18ln-4(Impeller)

V18sp-1 V18sp-2 V18sp-3 V18sp-4 V18sp-5 V18sp-6(Coupling)

Elapsed Time Speed Up from "10 CPU cores" to "9 CPU cores with 1 GP100" for 1 HPC Pack




2.5x

When GPU accelerator is used, job speeds up by 2.5 times with 2 cores, by 2.1 times with 4 cores and 1. 7 times with 8 cores.

When GPU accelerator is used with 16 cores, job speeds up by 6.33 times.

higher is better

2.1x

1.7x

Hardware Configuration:• HP Z840 workstation with dual E5-2699v4 (2.2 GHz), 128GBs 2400MHz memory• Optional NVIDIA card: Tesla K40c or Quadro GP100














NVIDIA-GPU Solution Fit for ANSYS Mechanical

GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs• Characterized by both high DOF and high factorization requirements• Models with solid elements (such as castings) and have >500K DOF experience good

speedups

Better performance when run on DMP mode over SMP mode

GPU and system memories both play important roles in performance• Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated– If the model exceeds 5M DOF, then use a single GPU with 12 GB memory (Tesla K40, Quadro

K6000 / GP100, P100).

• PCG/JCG solver: – Memory saving (MSAVE) option should be turned off for enabling GPUs– Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs


Optimized for the Latest HPC ArchitecturesANSYS Fluent 19.0

ANSYS Application Example







Case Details:• Boeing Landing Gear Analysis; 15 million mixed cells, 100 iterationsHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100




Case Details:• F1 race car model; 140 million hexa-core cells; pseudo transient solver is off; 100 iterationsHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100




Case Details:• External flow over truck; 14 million mixed cells; until convergenceHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100




Case Details:• External flow over truck; 14 million mixed cells; until convergenceHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100




Case Details:• 9.6 million cell pipe benchmarkHardware Configuration:• Cluster of XL250 Gen9s with E5-2690v4, 128GBs 2400MHz memory and 2 NVIDIA K80s/node




• 2x Intel Xeon Broadwell-EP (Xeon E5-2690 v4 2.6 GHz) 16-core CPU, Quadro GP100, windows 10, 256 GB RAM• Customer model: automotive water-cooled engine jacket (5.5m cells)

Comparison of time until convergence (not just the calculations that are accelerated on the GPU)


NVIDIA-GPU Solution Fit for ANSYS Fluent


NVIDIA-GPU Solution Fit for ANSYS Fluent- Supported Hardware Configurations

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

Some nodes with 16 processes and some with 12 processes

Some nodes with 2 GPUs some with 1 GPU

15 processes not divisible by 2 GPUs

● Homogeneous process distribution● Homogeneous GPU selection● Number of processes be an exact

multiple of number of GPUs


• Adding GPUs to a CPU-only node resulted in 2.1x speed up while reducing energy consumption by 38%

NVIDIA-GPU Solution Fit for ANSYS Fluent- Power Consumption Study


NVIDIA-GPU Solution Fit for ANSYS Fluent

GPUs accelerate the AMG solver portion of the CFD analysis, thus benefit problems with relatively high %AMG • Coupled solvers have high %AMG in the range of 60-70%• Fine meshes and low-dissipation problems have high %AMG

In some cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)

The whole problem must fit on GPUs for the calculations to proceed• In pressure-based coupled solver, each million cells need approx. 4 GB of GPU memory• High-memory cards such as Tesla K80, Quadro K6000 / GP100 or P100 are ideal

Moving scalar equations such as turbulence may not benefit much because of low workloads (using ‘scalar yes’ option in ‘amg-options’)

Better performance on lower CPU core counts• A ratio of 3 or 4 CPU cores to 1 GPU is recommended


Optimized for the Latest HPC ArchitecturesANSYS HFSS Transient 18.1


0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

8,00

9,00

cauer DifferntialVia Dipole_PML F35_800Mhz GSM_Antenna PECMine

Spee

dup

Xeon E5-2687W

Tesla K40

Quadro GP100


• 2x Intel Xeon Intel Xeon E5-2687W 3.1GHz] 8-core CPU. • Tesla K40. Tesla GP100. 256 GB RAM. Windows 7 x64.


Optimized for the Latest HPC ArchitecturesANSYS HFSS Transient 18.1


• Automatic job assignment for parametric sweeps or network analyses with multiple excitations

• Speedup scales linearly with respect to the number of GPUs

• Auto detection of GPUs attached to displays and exclude them from GPU acceleration

GPU monitoring by nvidia-smi

CPU monitoring by Windows Task Manager


Optimized for the Latest HPC ArchitecturesANSYS HFSS 18.1


• 2x Intel Xeon Haswell-EP [Xeon E5-2695 v3 2.3 GHz] 14-core CPU. • Tesla K80. Tesla P100. 256 GB RAM. CentOS 7.2 64-bit.


0 0,5 1 1,5 2 2,5 3 3,5 4

F35

HitachiCar

EBG_Ground_plane

8 CPU Cores+ 1x P100

8 CPU Cores+ 1x K80

8 CPU Cores


Optimized for the Latest HPC ArchitecturesANSYS Maxwell3D 18.1


• Benchmark Model: T.E.A.M. Problem 21. Eddy Current Loss in Power Transformer.• 2x Intel Xeon Haswell-EP [Xeon E5-2695 v3 2.3 GHz] 14-core CPU. • Tesla K80. Tesla P100. 256 GB RAM. CentOS 7.2 64-bit.


0 0,2 0,4 0,6 0,8 1 1,2 1,4

8 CPU Cores

8 CPU Cores+ 1x K80


8 CPU Cores 8 CPU Cores+ 1x K80



NVIDIA-GPU Solution Fit for ANSYS HFSS & Maxwell3D

GPUs accelerate the hybrid solver in HFSS Transient• The GPU-accelerated hybrid solver clearly outperforms the implicit solver.• High-frequency problems with uniform meshes and high operating frequency benefit

the most.• Usually good speedups can be achieved starting from 140K DOFs.• The hybrid solver in HFSS Transient detects projects not suitable to run on GPUs

(speedup < 1x) and falls back to CPUs automatically.

GPUs accelerate the multi-frontal sparse direct solver in HFSS and Maxwell3D• Usually good speedups can be achieved starting from 2M DOFs.• Double precision only.• Also these solvers use GPUs only if there is a potential speedup.


HDD vs. SSD



CPUs? GPUs?


• Need fast interconnects to feed fast processors– Two main characteristics for each interconnect: latency and bandwidth– Distributed ANSYS is highly bandwidth bound

+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+

Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07

Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

Total number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI

Core Machine Name Working Directory----------------------------------------------------

0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork

Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds

Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

Understanding the effect of the interconnect


Understanding the effect of the interconnect- ANSYS Fluent

Exhaust Model

7.6M cellsTransient simulation with explicit time stepping for engine startup cycleFujitsu PRIMERGY CX250 HPC systems (E5-2690v2 with 20 and E5-2697v2 with 24 cores per node, resp.) For CFD we can see the performance

of IB vs GiGE – GiGE starts to drop off after 2 nodes



• Fluent 18.0 performance measured using benchmark sets ranging from 2 to 14 million cells.

• Intel Xeon E5 v4 processor family – up to 96 nodes (3456 cores).

• At lower core counts (~576 cores) the performance between Intel Omni-Path vs EDR InfiniBand is comparable and at higher core counts Omni-Path outperforms by ~25-47%.



• For the Combustor 12 million cell model, OPA is ~33% better in performance compared to EDR InfiniBand (using 36 nodes, 3456 cores).

• For the Open Racecar 280 million cell case, OPA maintains nearly linear scalability up to ~7000 core count run.


Understanding the effect of the interconnect- ANSYS Fluent 2019 R2

In case of IBM-MPI and MSMPI, saturation occurred at 48

parallels in the small size CFD simulation.

MSMPI is fastest MPI in the middle size CFD simulation.

1,00

1,82

2,67

3,43

5,47 5,59 5,82

1,01

1,82 2,65

3,43

5,45 5,56

4,15

0,99

1,80

2,65 3,36

5,09 4,96 4,51

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

aircraft

IntelMPI IBM-MPI MSMPI

1,00

1,81

2,61 3,32

5,46 5,73 6,00

1,01

1,94

2,78 3,58

5,65 5,88

4,57

1,03

1,94

2,80

3,60

5,67 5,87 6,21

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48Sp

eed

Up

Ratio

Number of CPU cores

Landing gear

IntelMPI IBM-MPI MSMPI

• HP Z8 G4 Workstation.• 2x Intel Xeon Platinum 8160 (2.1-3.7GHz, 24cores) CPUs.• 192GB (2600MHz, 8GBx24 DIMMs. • 1TB HP Z Turbo Drive G2 (NVMe SSD)



For the several MPI benchmarks, HPC-X exhibits higher performance

and better scalability




Understanding the effect of the interconnect- ANSYS Mechanical

For ANSYS Mechanical GiGE does not scale to more than 1 node!


Understanding the effect of the interconnect- ANSYS Mechanical

V13sp-5 Model

Turbine geometry2,100 K DOFSOLID187 FEsStatic, nonlinearOne iterationDirect sparseLinux cluster (8 cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Ratin

g (r

uns/

day)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

Using faster interconnects can behelpful to realize productivity gains

- particularly at higher core/node counts


• 10GiGE and InfiniBand are recommended for HPC Clusters. o Currently InfiniBand only for large clusters is recommended o QDR should be more than adequate for small to medium clusters.

FDR for large clusters.

• For more than 1 node you will see performance decrease using GiGE. o For Mechanical users do not use GiGE at all if their jobs span more

than one node.

Recap


HDD vs. SSD



CPUs? GPUs?


• Need fast hard drives to feed fast processors– Check the bandwidth specs

– ANSYS Mechanical can be highly I/O bandwidth bound– Sparse solver in the out-of-core memory mode does lots of I/O

– Distributed ANSYS can be highly I/O latency bound– Seek time to read/write each set of files causes overhead

– Consider SSDs– High bandwidth and extremely low seek times

– Consider RAID configurationsRAID 0 – for speed RAID 1,5 – for redundancyRAID 10 – for speed and redundancy

Understanding the effect of the disks/storage- ANSYS Mechanical


Understanding the effect of the disks/storage- ANSYS Mechanical 18.1

When working directory is assigned to Z Turbo Drive G2 and BMT models for CG solver are used with more than 16 cores, job speeds up by 1.4 times.

When working directory is assigned to Z Turbo DriveG2 and BMT models for SPARSE are used with more than 16 cores, job speeds up by 1.8-2.6 times.

higher is better

higher is better

1.4x 1.4x 1.4x

1.8x

2.6x2.1x

Hardware Configuration:• HP Z840 workstation with dual E5-2699v4 (2.2 GHz), 128GBs 2400MHz memory• Optional Storage: Micron SATA SSD No RAID or HP Z Turbo Drive G2 512GB No RAID






Understanding the effect of the disks/storage- ANSYS Mechanical

Ratin

g

Number of Cores

Using faster disks can behelpful to realize productivity gains

- particularly at higher core/node counts


Landing Gear Noise Predictions using Scale-Resolving Simulations (180M cell model using pressure based segregated solver)

Understanding the effect of the disks/storage- ANSYS Fluent


Mesh File Location Async I/O Time

15M Cas NFS OFF 217s

15M Cas NFS ON 62s

15M Dat NFS OFF 113s

15M Dat NFS ON 8s

30M Cas NFS OFF 207s

30M Cas NFS ON 75s

30M Dat NFS OFF 144s

30M Dat NFS ON 10s

Asynchronous I/O for Linux FluentTotal write time 3-5x quicker over NFSEven larger speed-ups on bigger cases and local disk (up to 10x)

Understanding the effect of the disks/storage- ANSYS Fluent


• I/O is very important for Mechanical Solvero Raid 0 mandatory for multiple diskso SSD’s recommended for speed, 15k SAS drives

• Fluent and CFX for most customers won’t require fast local disk access (for most type of job)

• Parallel file systems can meet the requirements of both types of solver

Recap


Agenda

• HPC Terminology






ANSYS CFX/Fluent Starter Cluster


ANSYS Mechanical Starter Cluster


Agenda

• HPC Terminology






Supporting “HPC Resources Anywhere”

● ANSYS support and certify the leading remote display software solutions (VNC, DCV, Exceed onDemand, and Microsoft Remote Desktop).

● ANSYS support and certify the leading job schedulers (LSF, PBS Professional, UGE/SGE, MOAB/Torque, Microsoft HPC).

● ANSYS support cloud portal workflows (IBM Platform Application Center, Altair Compute Manager, NICE EnginFrame, ANSYS EKM).


Customer Benefits● Easy access to more powerful HPC

resources, and simulate models that were simply impossible in the past.

● Collaborate virtually from anywhere with any client device.

● Increase HPC resource utilization while lowering IT support overhead.

● Reduce network overload and security concerns by elimination of moving big simulation data sets around!

http://www.google.nl/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://volunteercapitalcentre.blogspot.com/2011/02/financial-benefits-of-volunteering.html&ei=AvjkVJOiBov3O8bbgFA&psig=AFQjCNEVtcq7t3gwRMnslzgU6IbxMFH7Yg&ust=1424378193494636


ANSYS 2019 R3 supports the following remote display solutions:• Nice Desktop Cloud Visualization (DCV) 2017.4o Linux server + Linux/Windows client

• OpenText Exceed onDemand 8 SP11o Linux server + Linux/Windows client

• OpenText Exceed TurboX 12.0o Linux server + Linux/Windows client

• VNC Connect 6.4 (with VirtualGL 2.6)o Linux server + Linux/Windows client

• Microsoft Remote Desktop (on Windows cluster)

Hardware requirements for remote visualization servers require:• GPU capable video cards• large amounts of RAM accessible for multiple user availability when running

ANSYS applications and pre/post processing

Supporting “HPC Resources Anywhere”Remote Display Support



Supporting “HPC Resources Anywhere”Virtual Desktop (VDI) Support


Support for virtual GPU• for less graphically intensive work – GPU to be shared

between multiple virtual machines (VMs)

GPU pass-through still for best performance• One GPU per VM, up to 8 VMs per machine (K1, K2

cards); memory constraints will limit in any case

Supported at ANSYS 2019 R3:


Agenda

• HPC Terminology






HPC Parallel & Parametric Licensing

❏ Standalone HPC licensesBasic option to license individual cores for low-level HPC (e.g. 5-6 cores)

❏ HPC PackHPC product rewarding volume parallel processing for high-fidelity individual simulations

❏ HPC WorkgroupHPC product rewarding volume parallel processing for increased simulation throughput shared among engineers throughout a single location or the world

❏ HPC Parametric PackHPC product enabling simultaneous execution of multiple design points while consuming just one set of licenses

2052

36

12

132

516

Parallel Enabled(Total Cores)

HPC Packs per Simulation1 2 3 4 5

32772

8196

6 7

32+4

8+4

128+4

512+4

2048+4

8192+4

23768+4


What’s new in ANSYS 19.0?

ANSYS Mechanical Pro, Premium, Enterprise

ANSYS CFD Premium and Enterprise

ANSYS Mechanical CFD ANSYS HFSS

ANSYS AIM ANSYS Q3D Extractor

ANSYS Maxwell ANSYS Icepak

ANSYS Mechanical CFD Maxwell 3D

ANSYS Chemkin-Pro and Enterprise

ANSYS Mechanical Maxwell 3D

ANSYS SIwave

More products are now using ANSYS HPC • Standalone HPC licenses, HPC Packs and HPC

Workgroup become more flexible and work across physics with all ANSYS Mechanical, Fluids and Electronics products*

4 Built-in HPCs now across all physics• 4 built-in HPCs are now included in Mechanical,

Fluids and Electronics products, including ANSYS AIM and ANSYS Chemkin Enterprise.

HPC Packs are now additive • HPC Packs becomes additive in nature to the 4 built-

in HPCs (e.g. 1 HPC Pack licenses 8 + 4 = 12 total cores, 2 HPC Pack license 32 + 4 = 36 total cores, etc.)

* Impacted products :

Note: R19.0 license manager is required. For ANSYS Mechanical and Fluids products changes are backward compatible; for ANSYS Electronics products changes are compatible with version 19.0 and forward

Note: built-in HPCs are linked to a solver seat and cannot be shared with other solver seats!

Note: the single, standalone HPCs are not additive to the Packs


HPC license for running parametric FEA or CFD simulations on multiple CPU cores simultaneously, and more cost effectively

ANSYS HPC Parametric Pack License

Key Benefits• Ability to automatically and simultaneously

execute design points while consuming just one set of application licenses

• Scalable because number of simultaneous design points enabled increases quickly with added packs

• Amplifies complete workflow because design points can include execution of multiple applications (pre, meshing, solve, HPC, post)

Number of Simultaneous Design Points Enabled64

2

8

Number of HPC Parametric Pack Licenses1

4

16

32

3 4 5


HPC Parametric Packs to Reduce Time for Design Variations

dp1dp2dp3dp4

Sequ

entia

l se

ries o

f D

esig

n po

ints

Unused Cores

One solver key and one HPC Parametric Pack

without HPC

94% Reduced Time to Innovation

HPC Parametric Packs amplify both solver licenses and HPC licenses

allowing you to drastically reduce time to innovation, without the cost of additional solver or HPC licenses…

One solver key

Four solver keysOR

+ 4 HPC keys


Licensing GPUs for Computing

Electronics products

● 4 HPC licenses enable 1 GPU through the available 8 HPC tasks● 1 HPC Pack enables up to 12 CPU cores + 1 GPUs through the available 12 HPC tasks● 2 HPC Packs enable up to 36 CPU cores + 4 GPUs through the available 36 HPC tasks

Fluids / Structural products1 GPU requires 1 HPC task as long as GPUs ≤ CPU cores

Examples:● 2 HPC licenses enable up to 3 CPU cores + 3 GPUs through the available 6 HPC tasks● 1 HPC Pack enables up to 6 CPU cores + 6 GPUs through the available 12 HPC tasks● 2 HPC Packs enable up to 18 CPU cores + 18 GPUs through the available 36 HPC tasks

1 GPU unlocked by every 8 HPC tasks

GPU acceleration can be enabled through all ANSYS HPC product licenses: ANSYS HPC, ANSYS HPC Pack and ANSYS HPC Workgroup.


Pay-per-use HPC at ANSYS 2019 R2

HPC Packs and HPC Parametric Packs are supported by usage-based ANSYS Elastic Units (AEUs) for on-premise

and off-premise (i.e. cloud) deployments. Optimal for intermittent use and/or peak demands in HPC!

Consumption Rate by Product Category

1 AEU/h 2 AEU/h 4 AEU/h 8 AEU/h

Geometry Interfaces ECAD Translators Pre/Post Solver

DSO Optimization HPC Pack HPC Parametric Pack

AIM, Maxwell 2D, Simplorer


HPC license cost decreases as more are purchased either as HPC Packs or as HPC Workgroups.

ANSYS HPC and ANSYS HPC Workgroup gives flexible use of a pool of licenses.

ANSYS HPC Pack gives “quick” scale-up but is more restrictive in how users can use it.

The ability to be more flexible is why the HPC Workgroup options cost more than the HPC Packs.

HPC Parametric Pack enables more cost-effective licensing for design exploration and optimization.

Which Type of Licensing is Right for Me?


Multiple licensing options to fit different requirements.

HPC Packs for quick scale-up.

HPC Workgroup for Flexibility.

GPU’s treated the same as cores in the licensing model.

As you scale-up license cost decreases per core.

Per core pricing becomes less of an issue.

Wrap-up - Licensing

Running on 2,000 cores instead of 20 cores at 1.5X – and not 100X

Filling up a 1024- instead of 128-core clusterwith 32-core jobs will cut the price per job in half!

Enabling 64 instead of 4 simultaneous design points at ~3X – and not 16X


Additional Resources- Tools to Check Out!

www.ansys.com/ws-roi-estimator

ROI Estimator!

For hardware Advice!

www.ansys.com/support/platform-support www.ansys.com/hpc-webinarswww.ansys.com/hpc-cluster-appliance

www.ansys.com/free-hpc-benchmark

Get a Free HPC Benchmark!

http://www.ansys.com/hpc-cluster-appliance

http://www.ansys.com/ws-roi-estimator


http://www.ansys.com/support/platform-support


http://www.ansys.com/hpc-webinars



http://www.ansys.com/free-hpc-benchmark


White Papers by clicking below:• Speed Simulation & Innovation

• The Value of High-Performance Computing for Simulation

• Debunking Six Myths of High-Performance Computing

• Higher Performance Across a Wide Range of ANSYS Fluent Simulations with the Intel Xeon Gold 6148 Processor

• Value of HPC for Ensuring Product Integrity

• Optimizing Business Value in High-Performance Engineering Computing

• ANSYS Fluent with Fujitsu PRIMERGY HPC: HVAC for Built Environment

• ANSYS® Application Benchmarking on Dell PowerEdge VRTX

• HPE Reference Architecture for Small and Medium Enterprises

• ANSYS Fluent Brings CFD Performance with Intel Processors and Fabrics

Additional Resources- IT White Papers

https://www.ansys.com/resource-library/white-paper/speed-simulation-innovation

http://www.ansys.com/Campaigns/value-of-hpc-for-simulation

http://www.ansys.com/Campaigns/debunking-hpc-myths

http://www.ansys.com/en-gb/resource-library/white-paper/performance-ansys-fluent-intel-xeon-processor

http://www.ansys.com/Resource-Library/white-paper/value-of-hpc-for-ensuring-product-integrity

http://www.ansys.com/resource-library/white-paper/optimizing-business-value-in-high-performance-engineering-computing

http://www.ansys.com/resource-library/white-paper/ansys-fluent-with-fujitsu-primergy-hpc-hvac-for-built-environment

http://www.ansys.com/resource-library/white-paper/ansys-application-benchmarking-on-dell-poweredge-vrtx

http://www.ansys.com/resource-library/white-paper/hpe-reference-architecture-for-small-and-medium-enterprises

http://www.ansys.com/resource-library/white-paper/ansys-fluent-brings-cfd-performance-with-intel-processors-and-fabrics


Additional Resources- IT White Papers & Technical Briefs

Technical Briefs by clicking below:• Dell EMC HPC System for Manufacturing— ANSYS Application Performance

• SGI Technology Guide for ANSYS Mechanical Analysts

• SGI Technology Guide for ANSYS Fluent Analysts

• Workstations for FEA Simulation

• HP Reference Architecture for Small and Medium Enterprises

White Papers by clicking below:• Mechanical Engineer Productivity Boosted by Higher-Core CPUs

• Focus on Faster Mechanical Simulation

• Workstations for FEA Simulation

• Intel Solid-State Drives Increase Productivity of Product Design and Simulation

http://www.ansys.com/Resource-Library/brochure/dell-emc-hpc-system-for-manufacturing-ansys-application-performance-technical-brief

http://www.ansys.com/resource-library/application-brief/sgi-technology-guide-for-ansys-mechanical-analysts

http://www.ansys.com/resource-library/application-brief/sgi-technology-guide-for-ansys-fluent-analysts

http://www.ansys.com/Resource-Library/application-brief/workstations-for-fea-simulation

http://www.ansys.com/resource-library/application-brief/hp-reference-architecture-for-small-and-medium-enterprises

http://www.ansys.com/en-gb/resource-library/white-paper/intel-productivity-hpc

http://www.ansys.com/resource-library/white-paper/focus-on-faster-mechanical-simulation

http://www.ansys.com/Resource-Library/application-brief/workstations-for-fea-simulation

http://www.ansys.com/resource-library/white-paper/intel-solid-state-drives-increase-productivity-of-product-design-and-simulation


Click on webinars related to HPC/IT for more and upcoming ones!

Additional Resources- IT Webinars

Watch recorded webinars by clicking below:• Understand How High-Performance Compute Can Accelerate Your Simulation

Throughput• Speed-up Your Desktop ANSYS FEA Simulations with ANSYS HPC• Getting the Most from Your ANSYS Simulation Applications• How to Evaluate and Improve the Performance of ANSYS Mechanical• Extreme Scalability for High-Fidelity CFD Simulations• Industry Perspectives on Extreme Scalability for High-Fidelity CFD Simulations

http://www.ansys.com/hpc-webinars

http://www.ansys.com/resource-library/webinar/understand-how-high-performance-compute-can-accelerate-your-simulation-throughput

http://www.ansys.com/resource-library/webinar/speed-up-your-desktop-ansys-fea-simulations

http://www.ansys.com/resource-library/webinar/getting-the-most-from-your-ansys-simulation-applications

http://www.ansys.com/Resource-Library/webinar/how-to-evaluate-and-improve-the-performance-of-ansys-mechanical

http://www.ansys.com/Support/resource-library/webinar/extreme-scalability-for-high-fidelity-cfd-simulations

http://www.ansys.com/resource-library/webinar/extreme-scalability-for-high-fidelity-cfd-simulations

http://www.ansys.com/Support/resource-library/webinar/extreme-scalability-for-high-fidelity-cfd-simulations

http://www.ansys.com/resource-library/webinar/industry-perspectives-on-extreme-scalability-for-high-fidelity-cfd-simulations


• Connect with Me– [email protected]

• Connect with ANSYS, Inc.– LinkedIn ANSYSInc– Twitter @ANSYS– Facebook ANSYSInc

• Follow our Blog– ansys-blog.com

Thank You!

mailto:[email protected]

Documents

Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just