115
1 © 2019 ANSYS, Inc. October 24, 2019 Understanding Hardware Selection to Speed Up Your Simulations October 2019 Wim Slagter, PhD ANSYS, Inc.

Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

1 © 2019 ANSYS, Inc. October 24, 2019

Understanding Hardware Selection to Speed Up Your Simulations

October 2019

Wim Slagter, PhD

ANSYS, Inc.

Page 2: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

2 © 2019 ANSYS, Inc. October 24, 2019

Major Barrier- Turnaround Time Limitations

Source: Intel-ANSYS Simulation Survey 2014

Page 3: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

3 © 2019 ANSYS, Inc. October 24, 2019

Problem Statement

“I am not achieving the performance and throughput I was

expecting from my hardware & software”

Image courtesy of Intel Corporation

Page 4: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

4 © 2019 ANSYS, Inc. October 24, 2019

Building A Balanced System Is The Key To Improving Your Experience

If Your System Is

Slow, So Are Your

Engineers &

Analysts Processors

Memory

Storage

Networks

Image courtesy of Intel Corporation

Page 5: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

5 © 2019 ANSYS, Inc. October 24, 2019

HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?

Page 6: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

6 © 2019 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 7: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

7 © 2019 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 8: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

8 © 2019 ANSYS, Inc. October 24, 2019

HPC Hardware Terminology

Machine 1 (or Node 1)

GPU

Processor 1 (or Socket 1)

Processor 2 (or Socket 2)

Interconnect(GigE or InfiniBand)

Machine N (or Node N)

GPU

Processor 1 (or Socket 1)

Processor 2 (or Socket 2)

Page 9: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

9 © 2019 ANSYS, Inc. October 24, 2019

Shared Memory Parallel

• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.

• OpenMP is the industry standard.

Machine 1 (or Node 1)

Processor 1 (or Socket 1)

Page 10: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

10 © 2019 ANSYS, Inc. October 24, 2019

Distributed Memory Parallel

• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.

• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.

• MPI (Message Passing Interface) is the industry standard for this.

Machine 1 (or Node 1)

Processor 1 (or Socket 1)

Page 11: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

11 © 2019 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 12: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

12 © 2019 ANSYS, Inc. October 24, 2019

HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?

Page 13: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

13 © 2019 ANSYS, Inc. October 24, 2019

Scalability on Workstations- ANSYS Fluent 2019 R2

• HP Z8 G4 Workstation.• 2x Intel Xeon Platinum 8160 (2.1-3.7GHz, 24cores) CPUs.• 192GB (2600MHz, 8GBx24 DIMMs. • 1TB HP Z Turbo Drive G2 (NVMe SSD)

1,00

1,82

2,67

3,43

5,47 5,59 5,82

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

aircraft

1,00

1,81

2,61

3,32

5,46 5,73 6,00

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

Landig Gear

Page 14: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

14 © 2019 ANSYS, Inc. October 24, 2019

Performance Comparison of an Old with New Workstation- ANSYS Fluent 2019 R2

Contents

Workstations

Z820 WorkstationDual Intel Xeon E5-2697v2 (2.7-3.5GHz, 12 cores)

RAM : 1866MHz, 4 channelsvs

Z8 G4 WorkstationDual Intel Xeon Platinum 8160 (2.1-3.7GHz, 24 cores)

RAM : 2666MHz, 6 channels

MPI IBM-MPI

Number of CPU cores tested 4 / 8 / 12 / 16 / 24 / 32 / 36

Benchmark Models 2 cases: aircraft and landing gear

Page 15: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

15 © 2019 ANSYS, Inc. October 24, 2019

Performance Comparison of an Old with New Workstation- ANSYS Fluent 2019 R2

1,00 1,72 1,78 1,97 2,04

NoData

NoData

1,07

1,94

2,82

3,65

4,95

5,80 5,91

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 24 32 36

Spee

dUP

Ratio

Number of CPU cores

aircraft

Z820 Z8G4

1,00

1,77 2,29

2,64 2,81

NoData

NoData

1,08

2,08

2,97

3,83

5,20

6,05 6,30

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 24 32 36

Spee

dUP

Ratio

Number of CPU cores

Landing gear

Z820 Z8G4

85% Speed Up at 24 cores for a middle size CFD simulation

142% Speed Up at 24 cores for a small size CFD simulation

The memory is different on these two machines 2666MHz vs 1866MHz. They may have different memory bandwidth also. For 4-cores, they have very similar performance. When more cores are being used, the performance is starting to deviate indicating a memory bandwidth difference. In short, it may be a matter of both memory speed and bandwidth.

Page 16: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

17 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Mechanical 19.x

ANSYS Features & Capabilities

Optimized for Intel Xeon Gold processors:• Upgraded to the Intel MKL 2017 update 2 libraries on Linux and Windows• Provides access to the AVX-512 instruction set• Biggest speedup gains achieved in the sparse direct solver

Iterative Solver Benchmarks

Direct Solver Benchmarks

R19.0 572 sec 425 sec

R19.1 539 sec 404 sec

• R19 Benchmark set (DMP)• Used geometric mean values for each class of benchmarks • Used 1, 2, 4, 8, 16, & 32 cores• 2 Intel Xeon Gold 6148 (2.4 GHz, 40 cores total), 192 GB RAM, Linux CentOS 7.3

R19.1 performs ~10% faster than R19.0 on Skylake systems

Page 17: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

18 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Mechanical 2019 R2

1,65

1,19

1,55 1,53

1,78

1,55 1,561,71

1,82 1,76

0,00

0,50

1,00

1,50

2,00

V19cg-1 V19cg-2 V19cg-3 V19ln-1 V19ln-2 V19sp-1 V19sp-2 V19sp-3 V19sp-4 V19sp-5

Core Solver Rating on 32 coresNormalized to Haswell

Intel® Xeon® E5-2699 v3 (Haswell)Intel® Xeon® E5-2697 v4 (Broadwell)Intel® Xeon® Gold 6254 (Cascade Lake)

Page 18: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

19 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 2019 Rx

0

2000

4000

6000

8000

10000

12000

32 64 128 256 512

Ratin

g (Jo

bs/d

ay)

MPI Tasks (Cores)

ANSYS Fluent 2019 Aircraft Wing 14M Benchmarkon Intel® Gold 6142 2.60 GHz/Intel® Gold6242 2.8GHz

Gold 6142 processors (16c/2.6GHz/150W) 2019 R1 (19.3.0)

Gold 6242 processors (16c/2.8GHz/150W) 2019 R3 (19.5.0)

Higher is Better

ANSYS Fluent Standard Benchmark Aircraft Wing 14M Cells

Page 19: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

20 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS CFX 2019 R3

0

3000

6000

9000

12000

15000

18000

21000

1 2 4 8

Ratin

g (Jo

bs P

er D

ay)

Number Compute Nodes

Intel® Xeon Processors

Xeon E5-2690v4

Xeon Gold 6154

Platinum 8268

Higher is Better

Page 20: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

22 © 2019 ANSYS, Inc. October 24, 2019

Performance Comparison of Intel Xeon Processors- ANSYS CFD 18.1

0

2

4

6

8

10

12

14

16

18

20

22

0 4 8 12 16 20 24 28 32Sp

eedU

p

Number of Cores

ANSYS CFX 18.1

E5-2680v2_HDD

E5-2697v4_HDD

Gold6150_HDD

0

2

4

6

8

10

12

14

16

18

20

22

0 4 8 12 16 20 24 28 32

Spee

dUp

Number of Cores

ANSYS Fluent 18.1

E5-2680v2_HDD

E5-2697v4_HDD

Gold6150_HDD

※Each series is the average value of model_1 and _2, and Turbo Boost On andOff in each CPU.

※Each series is the average value of model_1 and _2, and Turbo Boost On andOff in each CPU.

vs vs vs vs

35%up

43%up

Page 21: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

23 © 2019 ANSYS, Inc. October 24, 2019

Performance Comparison of Intel Xeon Processors- ANSYS Fluent 2019 R2

1,00

1,82

2,67

3,43

5,47 5,59 5,82

1,03

1,85

2,71 3,41

5,04 5,09

NoData

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

dUP

Ratio

Number of CPU cores

aircraft

Xeon Platinum 8160 Xeon Gold 6154

When it exceeds the 32 parallels, Xeon Gold 6154 shows a tendency to stall performance compared with Xeon Platinum 8160

1,00

1,81

2,61 3,32

5,46 5,73 6,00

1,09

1,98

2,86 3,54

5,17 5,09

NoData

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

dUP

Ratio

Number of CPU cores

Landing gear

Xeon Platinum 8160 Xeon Gold 6154

Page 22: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

24 © 2019 ANSYS, Inc. October 24, 2019

• A newer generation (in this case: Cascade Lake) might have significant performance gain over the previous generation but may require more cores.

• Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect.

• This is just something to be aware of when comparing one processor and one another.

1.2 and 2.4 relative performance based on increased core count, resp.

Performance Comparison of Intel Xeon Processors- ANSYS 2019 R1

Page 23: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

25 © 2019 ANSYS, Inc. October 24, 2019

• A newer generation might have significant performance gain over the previous generation but may require more cores.

• Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect.

• This is just something to be aware of when comparing one processor and one another.

Processor Performance Comparisons

Up to YY%

faster

0

1

2S Intel® Xeon® processor E5-2698 v3

2S Intel® Xeon® processor E5-2697 v4

Intel® Xeon® Gold 6148 processor

Up to 13%

faster

Fluent workload: sedan_4m.

ANSYS® Fluent 18.1 increased performance1 with the Intel® Xeon® Gold 6148 processor

Up to 60%

faster@ 13% more cores

Page 24: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

26 © 2019 ANSYS, Inc. October 24, 2019

• For Mechanical, the situation is different because of AVX-512 support from E5-v4 and Gold processor generation.

Processor Performance ComparisonsANSYS Mechanical 2019 R2

1,65

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

V19cg-1

Core Solver Rating on 32 coresNormalized to Haswell

Intel® Xeon® E5-2699 v3 (Haswell)

Intel® Xeon® E5-2697 v4 (Broadwell)

Intel® Xeon® Gold 6254 (Cascade Lake)

Page 25: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

29 © 2019 ANSYS, Inc. October 24, 2019

Intel Xeon Skylake vs. AMD EPYC (Naples)Processor Performance Comparisons

Page 26: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

30 © 2019 ANSYS, Inc. October 24, 2019

Intel Xeon Skylake vs. AMD EPYC (Naples)Processor Performance Comparisons

Page 27: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

31 © 2019 ANSYS, Inc. October 24, 2019

Intel Xeon Cascade Lake vs. AMD EPYC (Naples)Processor Performance Comparisons

Hardware Specifics

‐ EPYC 7601➢AMD EPYC 7601 with 2 sockets, 32 cores per socket, Mellanox

EDR interconnect

‐ CLX-9242➢Intel Xeon Platinum 9242 with 2 sockets, 48 cores per socket, Intel

OPA interconnect

‐ CLX 8260L➢Intel Xeon Platinum 8260L (CLX-SP) with 2 sockets, 24 cores per

socket, Intel OPA interconnect (single rail)

Page 28: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

32 © 2019 ANSYS, Inc. October 24, 2019

Intel Xeon Cascade Lake vs. AMD EPYC (Naples)Processor Performance Comparisons

• About 1.5X performance compared to EPYC

• Better with larger number of cores/nodes

• About 1.3X performance compared to EPYC

• Better with larger number of cores/nodes

Page 29: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

33 © 2019 ANSYS, Inc. October 24, 2019

Intel Xeon Cascade Lake vs. AMD EPYC (Naples)Processor Performance Comparisons

Page 30: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

34 © 2019 ANSYS, Inc. October 24, 2019

Processor Performance ComparisonsIntel Cascade Lake vs. AMD EPYC (Rome)

Page 31: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

35 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of clock speed- ANSYS CFD

ANSYS Fluent 19.2

5% on CentOS10% on Windows

higher is better

Intel Xeon Gold 6144• Normal: 3.5 GHz.

• OverClock: 4.09GHz.

Overclocking can increase Speedup by 10% on a Windows system. Half

of the increase on Linux.

Page 32: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

36 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of clock speed- ANSYS CFD

Page 33: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

37 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of clock speed- ANSYS Mechanical

• Effect of increased core operating frequencies on the DMP benchmarks running on 12 cores

• Influence is highest for sparse solver benchmarks

Using higher clock speed is alwayshelpful to realize productivity gains

Page 34: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

38 © 2019 ANSYS, Inc. October 24, 2019

• We can see that relative to 1 core we can see good performance gains in many cases by using Turbo Boost on the E5 processor family.

Turbo Boost (Intel)- ANSYS Mechanical

Page 35: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

39 © 2019 ANSYS, Inc. October 24, 2019

Turbo Boost (Intel)- ANSYS CFD

Using Turbo Boost / Core can behelpful to realize productivity gains

Page 36: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

40 © 2019 ANSYS, Inc. October 24, 2019

Turbo Boost (Intel)- ANSYS CFD

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

32Cores

TruboBoostOff_3MTruboBoostOn_3MTruboBoostOff_30MTruboBoostOn_30M

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

16Cores

TruboBoostOff_3MTruboBoostOn_3MTruboBoostOff_30MTruboBoostOn_30M

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

10Cores

TruboBoostOff_3MTruboBoostOn_3MTruboBoostOff_30MTruboBoostOn_30M

0,5

0,75

1

1,25

E5-2697v4 Gold6150

Spee

dUp

CPU

4Cores

TruboBoostOff_3MTruboBoostOn_3MTruboBoostOff_30MTruboBoostOn_30M

1.14 1.13 1.11 1.08

1.05 1.07 1.03 1.03

1.14 1.10 1.10 1.07

1.07 1.08 1.04 1.04

Using Turbo Boost / Core can behelpful to realize productivity gains,Particularly at lower core counts.

3M cell single phase

30M cell single phase

Page 37: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

41 © 2019 ANSYS, Inc. October 24, 2019

Hyper-threadingEvaluation of Hyperthreading on ANSYS/FLUENT Performance

iDataplex M3 (Intel Xeon x5670, 2.93 GHz)TURBO: ON

(measurement is improvement relative ot Hyperthtreading OFF)

0.90

0.95

1.00

1.05

1.10

eddy_417K turbo_500K aircraft_2M sedan_4M truck_14MANSYS/FLUENT Model

Impr

ovem

et d

ue to

Hyp

erth

read

ing

.

HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores)

High

er is

bet

ter

Hyper-threading is NOT recommended

Page 38: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

42 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?

3 x (8) = 24 cores 2 x (12) = 24 cores

Page 39: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

43 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?

22%up

Page 40: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

44 © 2019 ANSYS, Inc. October 24, 2019

10-core processor has higher performance per core than

12-core processor.Consider memory per core!

Understanding the effect of memory bandwidth

Page 41: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

45 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of memory bandwidth- Is 20 Cores Equal to 20 Cores?

Using less cores per node can behelpful to realize productivity gains

Page 42: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

46 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of memory bandwidth

Using less cores per node can behelpful to realize productivity gains

Page 43: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

47 © 2019 ANSYS, Inc. October 24, 2019

10-core processor has higher performance per core than

12-core processor.Consider memory per core!

Understanding the effect of memory channels

Page 44: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

48 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of memory speed- ANSYS CFD Impact of DIMM speed on ANSYS/FLUENT Application Performance

(Intel Xeon x5670, 2.93 GHz)Hyper Threading: OFF, TURBO: ON

Active threads per node: 12(performance measure improvement is relative to memory speed of 1066 MHz)

80%

85%

90%

95%

100%

105%

110%

115%

120%

125%

130%

eddy_417K turbo_500K aircraft_2M sedan_4M truck_14M

ANSYS/FLUENT Model

Impa

ct o

f Mem

ory

Spe

ed

1066 MHz1333 MHz

• Some processors types have slower memory speeds by default

• In the past, memory speed was more influential

• With current processors, we can see a minimal effect of memory speed

• On other processors non-optimally filling of the memory channels can slow the memory speed

Memory speed is shown to have a measurable, but small effect of

approximately 2%

Page 45: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

49 © 2019 ANSYS, Inc. October 24, 2019

Distributed Memory Parallel is Outperforming Shared Memory Parallel computing

SMP DMP

4 8 12 160

5.0

2.5

0.0

50.0

25.0

64 128 192 2560

Speedup Factor vs. Number of Coresfor ANSYS Mechanical

0.0

SMP vs. DMP

Page 46: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

50 © 2019 ANSYS, Inc. October 24, 2019

• Faster cores mean faster solution

• Faster memory means slightly faster solution

• Memory bandwidth is an important factor for (linear) scale-ability

• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.

• In general hyper threading should not be used because of licensing implications.

• Be careful when looking at comparisons! Make sure you are comparing like with like!

Recap

Page 47: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

51 © 2019 ANSYS, Inc. October 24, 2019

HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?

Page 48: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

52 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.0 with Nvidia GPU

ANSYS Application Examples

Page 49: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

53 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.x with Nvidia GPU

ANSYS Application Examples

• R19 benchmark suite run on Linux server with 2 Intel Xeon E5-2695v3 processors• 256 GB RAM, SSD, 1 NVIDIA P100 16 GB PCIe card, CentOS 7.2

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

Page 50: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

54 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 19.x with Nvidia GPU

ANSYS Application Examples

• 4.2 million DOF; sparse solver, nonlinear static analysis involving contact, plasticity and gasket elements• Linux cluster; each compute node contains 2 Intel Xeon E5-2690 v4 (2.1GHz, 8c) processors, 256 GB RAM,

2 NVIDIA Tesla P100, CentOS 7.4

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

0 GPU 1 GPU 2 GPU

Rela

tive

Spee

dup

DMP Performance w/ 16 cores

R19.0

R19.1

Page 51: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

55 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 18.1 with Nvidia GPU

ANSYS Application Examples

• R18 benchmark suite run on PNY workstation with 2 Intel Xeon E5-2620 v4 (2.1GHz, 8c) processors• 128 GB RAM, 2 x SSD Raid 0, 1 NVIDIA Quadro GP100, Windows 10

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

1,89

1,39 1,33 1,301,40

1,65 1,61

1,13 1,03

1,54

2,26

1,59

1,18

0,00

0,50

1,00

1,50

2,00

2,50

V18cg-1 V18cg-2 V18cg-3 V18ln-1 V18ln-2 V18ln-3(DSLP)

V18ln-4(Impeller)

V18sp-1 V18sp-2 V18sp-3 V18sp-4 V18sp-5 V18sp-6(Coupling)

Elapsed Time Speed Up from "10 CPU cores" to "9 CPU cores with 1 GP100" for 1 HPC Pack

Page 52: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

56 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 18.1 with Nvidia GPU

ANSYS Application Examples

2.5x

When GPU accelerator is used, job speeds up by 2.5 times with 2 cores, by 2.1 times with 4 cores and 1. 7 times with 8 cores.

When GPU accelerator is used with 16 cores, job speeds up by 6.33 times.

higher is better

2.1x

1.7x

Hardware Configuration:• HP Z840 workstation with dual E5-2699v4 (2.2 GHz), 128GBs 2400MHz memory• Optional NVIDIA card: Tesla K40c or Quadro GP100

Page 53: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

57 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 18.1 with Nvidia GPU

ANSYS Application Examples

Page 54: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

58 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 18.1 with Nvidia GPU

ANSYS Application Examples

Page 55: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

59 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 17.0 with Nvidia GPU

ANSYS Application Examples

Page 56: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

60 © 2019 ANSYS, Inc. October 24, 2019

Take Advantage of the Latest HPC ArchitecturesANSYS Mechanical 17.0 with Nvidia GPU

ANSYS Application Examples

Page 57: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

61 © 2019 ANSYS, Inc. October 24, 2019

NVIDIA-GPU Solution Fit for ANSYS Mechanical

GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs• Characterized by both high DOF and high factorization requirements• Models with solid elements (such as castings) and have >500K DOF experience good

speedups

Better performance when run on DMP mode over SMP mode

GPU and system memories both play important roles in performance• Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated– If the model exceeds 5M DOF, then use a single GPU with 12 GB memory (Tesla K40, Quadro

K6000 / GP100, P100).

• PCG/JCG solver: – Memory saving (MSAVE) option should be turned off for enabling GPUs– Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs

Page 58: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

62 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 19.0

ANSYS Application Example

Page 59: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

63 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 19.0

ANSYS Application Example

Page 60: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

64 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 18.1

ANSYS Application Example

Case Details:• Boeing Landing Gear Analysis; 15 million mixed cells, 100 iterationsHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100

Page 61: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

65 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 18.1

ANSYS Application Example

Case Details:• F1 race car model; 140 million hexa-core cells; pseudo transient solver is off; 100 iterationsHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100

Page 62: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

66 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 18.1

ANSYS Application Example

Case Details:• External flow over truck; 14 million mixed cells; until convergenceHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100

Page 63: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

67 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 18.1

ANSYS Application Example

Case Details:• External flow over truck; 14 million mixed cells; until convergenceHardware Configuration:• Dual Intel Xeon E5-2698v3 (2.3GHz), 256GB, Tesla P100

Page 64: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

68 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 18.0

ANSYS Application Example

Case Details:• 9.6 million cell pipe benchmarkHardware Configuration:• Cluster of XL250 Gen9s with E5-2690v4, 128GBs 2400MHz memory and 2 NVIDIA K80s/node

Page 65: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

69 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Fluent 17.0

ANSYS Application Example

• 2x Intel Xeon Broadwell-EP (Xeon E5-2690 v4 2.6 GHz) 16-core CPU, Quadro GP100, windows 10, 256 GB RAM• Customer model: automotive water-cooled engine jacket (5.5m cells)

Comparison of time until convergence (not just the calculations that are accelerated on the GPU)

Page 66: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

70 © 2019 ANSYS, Inc. October 24, 2019

NVIDIA-GPU Solution Fit for ANSYS Fluent

Page 67: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

71 © 2019 ANSYS, Inc. October 24, 2019

NVIDIA-GPU Solution Fit for ANSYS Fluent- Supported Hardware Configurations

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

Some nodes with 16 processes and some with 12 processes

Some nodes with 2 GPUs some with 1 GPU

15 processes not divisible by 2 GPUs

● Homogeneous process distribution● Homogeneous GPU selection● Number of processes be an exact

multiple of number of GPUs

Page 68: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

72 © 2019 ANSYS, Inc. October 24, 2019

• Adding GPUs to a CPU-only node resulted in 2.1x speed up while reducing energy consumption by 38%

NVIDIA-GPU Solution Fit for ANSYS Fluent- Power Consumption Study

Page 69: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

73 © 2019 ANSYS, Inc. October 24, 2019

NVIDIA-GPU Solution Fit for ANSYS Fluent

GPUs accelerate the AMG solver portion of the CFD analysis, thus benefit problems with relatively high %AMG • Coupled solvers have high %AMG in the range of 60-70%• Fine meshes and low-dissipation problems have high %AMG

In some cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent)

The whole problem must fit on GPUs for the calculations to proceed• In pressure-based coupled solver, each million cells need approx. 4 GB of GPU memory• High-memory cards such as Tesla K80, Quadro K6000 / GP100 or P100 are ideal

Moving scalar equations such as turbulence may not benefit much because of low workloads (using ‘scalar yes’ option in ‘amg-options’)

Better performance on lower CPU core counts• A ratio of 3 or 4 CPU cores to 1 GPU is recommended

Page 70: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

74 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS HFSS Transient 18.1

ANSYS Application Examples

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

8,00

9,00

cauer DifferntialVia Dipole_PML F35_800Mhz GSM_Antenna PECMine

Spee

dup

Xeon E5-2687W

Tesla K40

Quadro GP100

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

• 2x Intel Xeon Intel Xeon E5-2687W 3.1GHz] 8-core CPU. • Tesla K40. Tesla GP100. 256 GB RAM. Windows 7 x64.

Page 71: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

75 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS HFSS Transient 18.1

ANSYS Features & Capabilities

• Automatic job assignment for parametric sweeps or network analyses with multiple excitations

• Speedup scales linearly with respect to the number of GPUs

• Auto detection of GPUs attached to displays and exclude them from GPU acceleration

GPU monitoring by nvidia-smi

CPU monitoring by Windows Task Manager

Page 72: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

76 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS HFSS 18.1

ANSYS Application Examples

• 2x Intel Xeon Haswell-EP [Xeon E5-2695 v3 2.3 GHz] 14-core CPU. • Tesla K80. Tesla P100. 256 GB RAM. CentOS 7.2 64-bit.

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

0 0,5 1 1,5 2 2,5 3 3,5 4

F35

HitachiCar

EBG_Ground_plane

8 CPU Cores+ 1x P100

8 CPU Cores+ 1x K80

8 CPU Cores

Page 73: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

77 © 2019 ANSYS, Inc. October 24, 2019

Optimized for the Latest HPC ArchitecturesANSYS Maxwell3D 18.1

ANSYS Application Examples

• Benchmark Model: T.E.A.M. Problem 21. Eddy Current Loss in Power Transformer.• 2x Intel Xeon Haswell-EP [Xeon E5-2695 v3 2.3 GHz] 14-core CPU. • Tesla K80. Tesla P100. 256 GB RAM. CentOS 7.2 64-bit.

Comparison of entire simulation solution time (not just the calculations that are accelerated on the GPU)

0 0,2 0,4 0,6 0,8 1 1,2 1,4

8 CPU Cores

8 CPU Cores+ 1x K80

8 CPU Cores+ 1x P100

8 CPU Cores 8 CPU Cores+ 1x K80

8 CPU Cores+ 1x P100

Page 74: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

78 © 2019 ANSYS, Inc. October 24, 2019

NVIDIA-GPU Solution Fit for ANSYS HFSS & Maxwell3D

GPUs accelerate the hybrid solver in HFSS Transient• The GPU-accelerated hybrid solver clearly outperforms the implicit solver.• High-frequency problems with uniform meshes and high operating frequency benefit

the most.• Usually good speedups can be achieved starting from 140K DOFs.• The hybrid solver in HFSS Transient detects projects not suitable to run on GPUs

(speedup < 1x) and falls back to CPUs automatically.

GPUs accelerate the multi-frontal sparse direct solver in HFSS and Maxwell3D• Usually good speedups can be achieved starting from 2M DOFs.• Double precision only.• Also these solvers use GPUs only if there is a potential speedup.

Page 75: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

79 © 2019 ANSYS, Inc. October 24, 2019

HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?

Page 76: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

80 © 2019 ANSYS, Inc. October 24, 2019

• Need fast interconnects to feed fast processors– Two main characteristics for each interconnect: latency and bandwidth– Distributed ANSYS is highly bandwidth bound

+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+

Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07

Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

Total number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI

Core Machine Name Working Directory----------------------------------------------------

0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork

Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds

Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

Understanding the effect of the interconnect

Page 77: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

81 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent

Exhaust Model

7.6M cellsTransient simulation with explicit time stepping for engine startup cycleFujitsu PRIMERGY CX250 HPC systems (E5-2690v2 with 20 and E5-2697v2 with 24 cores per node, resp.) For CFD we can see the performance

of IB vs GiGE – GiGE starts to drop off after 2 nodes

Page 78: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

82 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent

• Fluent 18.0 performance measured using benchmark sets ranging from 2 to 14 million cells.

• Intel Xeon E5 v4 processor family – up to 96 nodes (3456 cores).

• At lower core counts (~576 cores) the performance between Intel Omni-Path vs EDR InfiniBand is comparable and at higher core counts Omni-Path outperforms by ~25-47%.

Page 79: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

83 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent

• For the Combustor 12 million cell model, OPA is ~33% better in performance compared to EDR InfiniBand (using 36 nodes, 3456 cores).

• For the Open Racecar 280 million cell case, OPA maintains nearly linear scalability up to ~7000 core count run.

Page 80: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

84 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent 2019 R2

In case of IBM-MPI and MSMPI, saturation occurred at 48

parallels in the small size CFD simulation.

MSMPI is fastest MPI in the middle size CFD simulation.

1,00

1,82

2,67

3,43

5,47 5,59 5,82

1,01

1,82 2,65

3,43

5,45 5,56

4,15

0,99

1,80

2,65 3,36

5,09 4,96 4,51

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48

Spee

d U

p Ra

tio

Number of CPU cores

aircraft

IntelMPI IBM-MPI MSMPI

1,00

1,81

2,61 3,32

5,46 5,73 6,00

1,01

1,94

2,78 3,58

5,65 5,88

4,57

1,03

1,94

2,80

3,60

5,67 5,87 6,21

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

4 8 12 16 32 36 48Sp

eed

Up

Ratio

Number of CPU cores

Landing gear

IntelMPI IBM-MPI MSMPI

• HP Z8 G4 Workstation.• 2x Intel Xeon Platinum 8160 (2.1-3.7GHz, 24cores) CPUs.• 192GB (2600MHz, 8GBx24 DIMMs. • 1TB HP Z Turbo Drive G2 (NVMe SSD)

Page 81: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

86 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent

For the several MPI benchmarks, HPC-X exhibits higher performance

and better scalability

Page 82: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

87 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Fluent

Page 83: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

88 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Mechanical

For ANSYS Mechanical GiGE does not scale to more than 1 node!

Page 84: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

89 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the interconnect- ANSYS Mechanical

V13sp-5 Model

Turbine geometry2,100 K DOFSOLID187 FEsStatic, nonlinearOne iterationDirect sparseLinux cluster (8 cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Ratin

g (r

uns/

day)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

Using faster interconnects can behelpful to realize productivity gains

- particularly at higher core/node counts

Page 85: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

90 © 2019 ANSYS, Inc. October 24, 2019

• 10GiGE and InfiniBand are recommended for HPC Clusters. o Currently InfiniBand only for large clusters is recommended o QDR should be more than adequate for small to medium clusters.

FDR for large clusters.

• For more than 1 node you will see performance decrease using GiGE. o For Mechanical users do not use GiGE at all if their jobs span more

than one node.

Recap

Page 86: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

91 © 2019 ANSYS, Inc. October 24, 2019

HDD vs. SSD

What Hardware Configuration to Select?

SMP vs. DMP Interconnects?Clusters?

CPUs? GPUs?

Page 87: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

92 © 2019 ANSYS, Inc. October 24, 2019

• Need fast hard drives to feed fast processors– Check the bandwidth specs

– ANSYS Mechanical can be highly I/O bandwidth bound– Sparse solver in the out-of-core memory mode does lots of I/O

– Distributed ANSYS can be highly I/O latency bound– Seek time to read/write each set of files causes overhead

– Consider SSDs– High bandwidth and extremely low seek times

– Consider RAID configurationsRAID 0 – for speed RAID 1,5 – for redundancyRAID 10 – for speed and redundancy

Understanding the effect of the disks/storage- ANSYS Mechanical

Page 88: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

93 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the disks/storage- ANSYS Mechanical 18.1

When working directory is assigned to Z Turbo Drive G2 and BMT models for CG solver are used with more than 16 cores, job speeds up by 1.4 times.

When working directory is assigned to Z Turbo DriveG2 and BMT models for SPARSE are used with more than 16 cores, job speeds up by 1.8-2.6 times.

higher is better

higher is better

1.4x 1.4x 1.4x

1.8x

2.6x2.1x

Hardware Configuration:• HP Z840 workstation with dual E5-2699v4 (2.2 GHz), 128GBs 2400MHz memory• Optional Storage: Micron SATA SSD No RAID or HP Z Turbo Drive G2 512GB No RAID

Page 89: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

94 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the disks/storage- ANSYS Mechanical 18.1

Page 90: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

95 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the disks/storage- ANSYS Mechanical 18.1

Page 91: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

96 © 2019 ANSYS, Inc. October 24, 2019

Understanding the effect of the disks/storage- ANSYS Mechanical

Ratin

g

Number of Cores

Using faster disks can behelpful to realize productivity gains

- particularly at higher core/node counts

Page 92: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

97 © 2019 ANSYS, Inc. October 24, 2019

Landing Gear Noise Predictions using Scale-Resolving Simulations (180M cell model using pressure based segregated solver)

Understanding the effect of the disks/storage- ANSYS Fluent

Page 93: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

98 © 2019 ANSYS, Inc. October 24, 2019

Mesh File Location Async I/O Time

15M Cas NFS OFF 217s

15M Cas NFS ON 62s

15M Dat NFS OFF 113s

15M Dat NFS ON 8s

30M Cas NFS OFF 207s

30M Cas NFS ON 75s

30M Dat NFS OFF 144s

30M Dat NFS ON 10s

Asynchronous I/O for Linux FluentTotal write time 3-5x quicker over NFSEven larger speed-ups on bigger cases and local disk (up to 10x)

Understanding the effect of the disks/storage- ANSYS Fluent

Page 94: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

99 © 2019 ANSYS, Inc. October 24, 2019

• I/O is very important for Mechanical Solvero Raid 0 mandatory for multiple diskso SSD’s recommended for speed, 15k SAS drives

• Fluent and CFX for most customers won’t require fast local disk access (for most type of job)

• Parallel file systems can meet the requirements of both types of solver

Recap

Page 95: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

100 © 2019 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 96: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

101 © 2019 ANSYS, Inc. October 24, 2019

ANSYS CFX/Fluent Starter Cluster

Page 97: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

102 © 2019 ANSYS, Inc. October 24, 2019

ANSYS Mechanical Starter Cluster

Page 98: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

103 © 2019 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 99: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

104 © 2019 ANSYS, Inc. October 24, 2019

Supporting “HPC Resources Anywhere”

● ANSYS support and certify the leading remote display software solutions (VNC, DCV, Exceed onDemand, and Microsoft Remote Desktop).

● ANSYS support and certify the leading job schedulers (LSF, PBS Professional, UGE/SGE, MOAB/Torque, Microsoft HPC).

● ANSYS support cloud portal workflows (IBM Platform Application Center, Altair Compute Manager, NICE EnginFrame, ANSYS EKM).

ANSYS Features & Capabilities

Customer Benefits● Easy access to more powerful HPC

resources, and simulate models that were simply impossible in the past.

● Collaborate virtually from anywhere with any client device.

● Increase HPC resource utilization while lowering IT support overhead.

● Reduce network overload and security concerns by elimination of moving big simulation data sets around!

Page 100: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

105 © 2019 ANSYS, Inc. October 24, 2019

ANSYS 2019 R3 supports the following remote display solutions:• Nice Desktop Cloud Visualization (DCV) 2017.4o Linux server + Linux/Windows client

• OpenText Exceed onDemand 8 SP11o Linux server + Linux/Windows client

• OpenText Exceed TurboX 12.0o Linux server + Linux/Windows client

• VNC Connect 6.4 (with VirtualGL 2.6)o Linux server + Linux/Windows client

• Microsoft Remote Desktop (on Windows cluster)

Hardware requirements for remote visualization servers require:• GPU capable video cards• large amounts of RAM accessible for multiple user availability when running

ANSYS applications and pre/post processing

Supporting “HPC Resources Anywhere”Remote Display Support

ANSYS Features & Capabilities

Page 101: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

106 © 2019 ANSYS, Inc. October 24, 2019

Supporting “HPC Resources Anywhere”Virtual Desktop (VDI) Support

ANSYS Features & Capabilities

Support for virtual GPU• for less graphically intensive work – GPU to be shared

between multiple virtual machines (VMs)

GPU pass-through still for best performance• One GPU per VM, up to 8 VMs per machine (K1, K2

cards); memory constraints will limit in any case

Supported at ANSYS 2019 R3:

Page 102: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

107 © 2018 ANSYS, Inc. October 24, 2019

Agenda

• HPC Terminology

• Hardware Considerations

• Solution Reference Architecture

• Supporting “HPC Resources Anywhere”

• HPC Parallel & Parametric Licensing

Page 103: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

108 © 2018 ANSYS, Inc. October 24, 2019

HPC Parallel & Parametric Licensing

❏ Standalone HPC licensesBasic option to license individual cores for low-level HPC (e.g. 5-6 cores)

❏ HPC PackHPC product rewarding volume parallel processing for high-fidelity individual simulations

❏ HPC WorkgroupHPC product rewarding volume parallel processing for increased simulation throughput shared among engineers throughout a single location or the world

❏ HPC Parametric PackHPC product enabling simultaneous execution of multiple design points while consuming just one set of licenses

2052

36

12

132

516

Parallel Enabled(Total Cores)

HPC Packs per Simulation1 2 3 4 5

32772

8196

6 7

32+4

8+4

128+4

512+4

2048+4

8192+4

23768+4

Page 104: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

109 © 2018 ANSYS, Inc. October 24, 2019

What’s new in ANSYS 19.0?

ANSYS Mechanical Pro, Premium, Enterprise

ANSYS CFD Premium and Enterprise

ANSYS Mechanical CFD ANSYS HFSS

ANSYS AIM ANSYS Q3D Extractor

ANSYS Maxwell ANSYS Icepak

ANSYS Mechanical CFD Maxwell 3D

ANSYS Chemkin-Pro and Enterprise

ANSYS Mechanical Maxwell 3D

ANSYS SIwave

More products are now using ANSYS HPC • Standalone HPC licenses, HPC Packs and HPC

Workgroup become more flexible and work across physics with all ANSYS Mechanical, Fluids and Electronics products*

4 Built-in HPCs now across all physics• 4 built-in HPCs are now included in Mechanical,

Fluids and Electronics products, including ANSYS AIM and ANSYS Chemkin Enterprise.

HPC Packs are now additive • HPC Packs becomes additive in nature to the 4 built-

in HPCs (e.g. 1 HPC Pack licenses 8 + 4 = 12 total cores, 2 HPC Pack license 32 + 4 = 36 total cores, etc.)

* Impacted products :

Note: R19.0 license manager is required. For ANSYS Mechanical and Fluids products changes are backward compatible; for ANSYS Electronics products changes are compatible with version 19.0 and forward

Note: built-in HPCs are linked to a solver seat and cannot be shared with other solver seats!

Note: the single, standalone HPCs are not additive to the Packs

Page 105: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

110 © 2019 ANSYS, Inc. October 24, 2019

HPC license for running parametric FEA or CFD simulations on multiple CPU cores simultaneously, and more cost effectively

ANSYS HPC Parametric Pack License

Key Benefits• Ability to automatically and simultaneously

execute design points while consuming just one set of application licenses

• Scalable because number of simultaneous design points enabled increases quickly with added packs

• Amplifies complete workflow because design points can include execution of multiple applications (pre, meshing, solve, HPC, post)

Number of Simultaneous Design Points Enabled64

2

8

Number of HPC Parametric Pack Licenses1

4

16

32

3 4 5

Page 106: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

111 © 2019 ANSYS, Inc. October 24, 2019

HPC Parametric Packs to Reduce Time for Design Variations

dp1dp2dp3dp4

Sequ

entia

l se

ries o

f D

esig

n po

ints

Unused Cores

One solver key and one HPC Parametric Pack

without HPC

94% Reduced Time to Innovation

HPC Parametric Packs amplify both solver licenses and HPC licenses

allowing you to drastically reduce time to innovation, without the cost of additional solver or HPC licenses…

One solver key

Four solver keysOR

+ 4 HPC keys

Page 107: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

112 © 2019 ANSYS, Inc. October 24, 2019

Licensing GPUs for Computing

Electronics products

● 4 HPC licenses enable 1 GPU through the available 8 HPC tasks● 1 HPC Pack enables up to 12 CPU cores + 1 GPUs through the available 12 HPC tasks● 2 HPC Packs enable up to 36 CPU cores + 4 GPUs through the available 36 HPC tasks

Fluids / Structural products1 GPU requires 1 HPC task as long as GPUs ≤ CPU cores

Examples:● 2 HPC licenses enable up to 3 CPU cores + 3 GPUs through the available 6 HPC tasks● 1 HPC Pack enables up to 6 CPU cores + 6 GPUs through the available 12 HPC tasks● 2 HPC Packs enable up to 18 CPU cores + 18 GPUs through the available 36 HPC tasks

1 GPU unlocked by every 8 HPC tasks

GPU acceleration can be enabled through all ANSYS HPC product licenses: ANSYS HPC, ANSYS HPC Pack and ANSYS HPC Workgroup.

Page 108: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

113 © 2019 ANSYS, Inc. October 24, 2019

Pay-per-use HPC at ANSYS 2019 R2

HPC Packs and HPC Parametric Packs are supported by usage-based ANSYS Elastic Units (AEUs) for on-premise

and off-premise (i.e. cloud) deployments. Optimal for intermittent use and/or peak demands in HPC!

Consumption Rate by Product Category

1 AEU/h 2 AEU/h 4 AEU/h 8 AEU/h

Geometry Interfaces ECAD Translators Pre/Post Solver

DSO Optimization HPC Pack HPC Parametric Pack

AIM, Maxwell 2D, Simplorer

Page 109: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

114 © 2019 ANSYS, Inc. October 24, 2019

HPC license cost decreases as more are purchased either as HPC Packs or as HPC Workgroups.

ANSYS HPC and ANSYS HPC Workgroup gives flexible use of a pool of licenses.

ANSYS HPC Pack gives “quick” scale-up but is more restrictive in how users can use it.

The ability to be more flexible is why the HPC Workgroup options cost more than the HPC Packs.

HPC Parametric Pack enables more cost-effective licensing for design exploration and optimization.

Which Type of Licensing is Right for Me?

Page 110: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

115 © 2019 ANSYS, Inc. October 24, 2019

Multiple licensing options to fit different requirements.

HPC Packs for quick scale-up.

HPC Workgroup for Flexibility.

GPU’s treated the same as cores in the licensing model.

As you scale-up license cost decreases per core.

Per core pricing becomes less of an issue.

Wrap-up - Licensing

Running on 2,000 cores instead of 20 cores at 1.5X – and not 100X

Filling up a 1024- instead of 128-core clusterwith 32-core jobs will cut the price per job in half!

Enabling 64 instead of 4 simultaneous design points at ~3X – and not 16X

Page 111: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

116 © 2018 ANSYS, Inc. October 24, 2019

Additional Resources- Tools to Check Out!

www.ansys.com/ws-roi-estimator

ROI Estimator!

For hardware Advice!

www.ansys.com/support/platform-support www.ansys.com/hpc-webinarswww.ansys.com/hpc-cluster-appliance

www.ansys.com/free-hpc-benchmark

Get a Free HPC Benchmark!

Page 112: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

117 © 2018 ANSYS, Inc. October 24, 2019

White Papers by clicking below:• Speed Simulation & Innovation

• The Value of High-Performance Computing for Simulation

• Debunking Six Myths of High-Performance Computing

• Higher Performance Across a Wide Range of ANSYS Fluent Simulations with the Intel Xeon Gold 6148 Processor

• Value of HPC for Ensuring Product Integrity

• Optimizing Business Value in High-Performance Engineering Computing

• ANSYS Fluent with Fujitsu PRIMERGY HPC: HVAC for Built Environment

• ANSYS® Application Benchmarking on Dell PowerEdge VRTX

• HPE Reference Architecture for Small and Medium Enterprises

• ANSYS Fluent Brings CFD Performance with Intel Processors and Fabrics

Additional Resources- IT White Papers

Page 113: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

118 © 2018 ANSYS, Inc. October 24, 2019

Additional Resources- IT White Papers & Technical Briefs

Technical Briefs by clicking below:• Dell EMC HPC System for Manufacturing— ANSYS Application Performance

• SGI Technology Guide for ANSYS Mechanical Analysts

• SGI Technology Guide for ANSYS Fluent Analysts

• Workstations for FEA Simulation

• HP Reference Architecture for Small and Medium Enterprises

White Papers by clicking below:• Mechanical Engineer Productivity Boosted by Higher-Core CPUs

• Focus on Faster Mechanical Simulation

• Workstations for FEA Simulation

• Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Page 114: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

119 © 2018 ANSYS, Inc. October 24, 2019

Click on webinars related to HPC/IT for more and upcoming ones!

Additional Resources- IT Webinars

Watch recorded webinars by clicking below:• Understand How High-Performance Compute Can Accelerate Your Simulation

Throughput• Speed-up Your Desktop ANSYS FEA Simulations with ANSYS HPC• Getting the Most from Your ANSYS Simulation Applications• How to Evaluate and Improve the Performance of ANSYS Mechanical• Extreme Scalability for High-Fidelity CFD Simulations• Industry Perspectives on Extreme Scalability for High-Fidelity CFD Simulations

Page 115: Understanding Hardware Selection to Speed Up Your …...Often hardware vendors are not providing an apples to apples comparison in the terms that we would expect. • This is just

120 © 2018 ANSYS, Inc. October 24, 2019

• Connect with Me– [email protected]

• Connect with ANSYS, Inc.– LinkedIn ANSYSInc– Twitter @ANSYS– Facebook ANSYSInc

• Follow our Blog– ansys-blog.com

Thank You!