31
STARCCM+ and STARCD Performance Benchmark and Profiling March 2009

STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

  • Upload
    doanque

  • View
    295

  • Download
    5

Embed Size (px)

Citation preview

Page 1: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

STAR‐CCM+ and STAR‐CD Performance  Benchmark and Profiling

March 2009

Page 2: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: AMD, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank CD-adapco for their support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com

Page 3: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

3

STAR-CCM+ and STAR-CD

• STAR-CCM+

– An engineering process-oriented CFD tool

– Client-server architecture, object-oriented programming

– Delivers the entire CFD process in a single integrated software

environment

• STAR-CD

– An integrated platform for multi-physics simulations

– A long established platform for industrial CFD simulation

– Bridging the gap between CFD and structural-mechanics

• Developed by CD-adapco

Page 4: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

4

Objectives

• The presented research was done to provide best practices

– STAR-CCM+ and STAR-CD performance benchmarking

– Interconnect performance comparisons

– Ways to increase STAR-CCM+ and STAR-CD productivity

– Understanding STAR-CD communication patterns

– MPI libraries comparisons

Page 5: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack

• MPI: Platform MPI 5.6.5, HP-MPI 2.3

• Application: STAR-CCM+ Version 3.06, STAR-CD Version 4.08

• Benchmark Workload

– STAR-CD: A-Class (Turbulent Flow around A-Class Car)

– STAR-CCM+: Auto Aerodynamics test

Page 6: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

Page 7: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

Page 8: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

Page 9: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

9

STAR-CCM+ Benchmark Results - Interconnect

• Input Dataset– Auto Aerodynamics test

• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 136% faster run time

HP MPILower is better

STAR-CCM+ Benchmark Results(Auto Aerodynamics Test )

02468

1012141618

1 2 4 8 16 24Number of Nodes

Ela

psed

Tim

e/Ite

ratio

n

10GigE InfiniBand DDR

Page 10: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

10

STAR-CCM+ Productivity Results• Two cases are presented

– Single job over the entire systems– Four jobs, each on two cores per server

• Productivity increases by allowing multiple jobs to run simultaneously– Up to 30% increase in the system productivity

Higher is better HP MPI over InfiniBand

STAR-CCM+ Productivity(Auto Aerodynamics Test)

0

20000

40000

60000

80000

100000

120000

4 8 12 16 20 24

Number of Nodes

Itera

tions

/Day

1 Job 4 Jobs

Page 11: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

11

STAR-CCM+ Productivity Results - Interconnect

• Test case– Four jobs, each on two cores per server

• InfiniBand DDR provides higher productivity compared to 10GigE– Up to 25% more iterations per day

• InfiniBand maintains consistent scalability as cluster size increases

Higher is better HP MPI

STAR-CCM+ Productivity(Auto Aerodynamics Test)

0

20000

40000

60000

80000

100000

120000

4 8 12 16 24Number of Nodes

Itera

tions

/Day

10GigE InfiniBand DDR

Page 12: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

12

STAR-CD Benchmark Results - Interconnect

• Test case– Single job over the entire systems– Input Dataset (A-Class)

• InfiniBand DDR enhances performance and scalability– Up to 34% and 10% more jobs/day compared to GigE and 10GigE respectively– Performance advantage of InfiniBand increases as cluster size scales

Platform MPIHigher is better

STAR-CD Performance Advantage of InfiniBand and 10GigE over GigE

(AClass)

0%

5%

10%

15%

20%

25%

30%

35%

4 8 16 20 24Number of Nodes

Perc

enta

ge

10GigE InfiniBand DDR

Page 13: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

13

Maximize STAR-CD Performance per Core

STAR-CD Benchmark Results (AClass)

0

4000

8000

12000

16000

20000

4-cores/proc 2-cores/proc 1-core/proc

Number of Nodes

Tota

l Ela

psed

Tim

e (s

)

16 Cores 32 Cores 48 CoresLower is better

• Test case– Single job over the entire systems– Using one, two or four cores in each quad-core AMD processor

• Remaining cores kept idle• Using partial cores per simulation improves single job run time• It is recommended to have multiple simulations simultaneously

– For maximum productivity (slide 10)

Platform MPI

Page 14: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

14

STAR-CCM+ Profiling – MPI Functions

• MPI_Testall, MPI_Bcast, and MPI_Recv are the mostly used MPI functions

STAR-CCM+ MPI Profiling

1100

100001000000

10000000010000000000

MPI_Testall

MPI_Bcast

MPI_Reduce

MPI_Scatter

MPI_Recv

Message Size

Num

ber

of M

essa

ges

4 Nodes 8 Nodes 16 Nodes 22 Nodes

Page 15: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

15

STAR-CCM+ Profiling – Data Transferred

STAR-CCM+ MPI Profiling

01020304050

[0..64B]

[65..256B]

[257B..1KB][1..4KB]

[4..16KB]

[16..64KB]

[64..256KB]

[256KB..1M][1..4M]

[4M..infinity]

Message Size

Num

ber o

f Mes

sage

s (M

illio

ns)

4 Nodes 8 Nodes 16 Nodes 22 Nodes

• Most MPI messages are within 4KB in size• Number of messages increases with cluster size

Page 16: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

16

STAR-CCM+ Profiling – Timing

• MPI_Testall, MPI_Bcast, and MPI_Recv have relatively large overhead

• Overhead increases with cluster size

STAR-CCM+ MPI Profiling

01000200030004000500060007000

MPI_Testall

MPI_Bcast

MPI_Reduce

MPI_Scatter

MPI_Recv

Message Size

Tota

l Ove

rhea

d (s

)

4 Nodes 8 Nodes 16 Nodes 22 Nodes

Page 17: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

17

STAR-CCM+ Profiling Summary

• STAR-CCM+ was profiled to determine networking dependency • Most used message sizes

– <4KB messages – Number of messages increases with cluster size

• Interconnects effect to STAR-CD performance– Both interconnect latency and throughput influences STAR-CCM+

performance due to its messaging pattern

Page 18: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

18

STAR-CD MPI Profiling

02468

1012

MPI_Allgather

MPI_Allreduce

MPI_Barrier

MPI_Bcast

MPI_Sendrecv

MPI Functions

Num

ber

of M

essa

ges

(Mill

ions

)

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

STAR-CD Profiling – MPI Functions

• MPI_SendRecv and MPI_Allreduce are the mostly used MPI functions

Page 19: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

19

STAR-CD MPI Profiling(MPI_Sendrecv)

01234567

[0-128B>

[128B-1K>[1K-8K>

[8K-256K>

[256K-1M>

Message Size

Num

ber

of M

essa

ges

(Mill

ions

)

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes

STAR-CD Profiling – Data Transferred

• Most point-to-point MPI messages are within 1KB to 8KB in size• Number of messages increases with cluster size

Page 20: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

20

STAR-CD MPI Profiling(MPI_Allreduce)

0

0.5

1

1.5

2

2.5

[0-128B>

[128B-1K>[1K-8K>

[8K-256K>

[256K-1M>

Message Size

Num

ber o

f Mes

sage

s (M

illio

ns)

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

STAR-CD Profiling – Data Transferred

• Most MPI collective messages are smaller than 128Bytes• Number of messages increases with cluster size

Page 21: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

21

STAR-CD Profiling – Timing

STAR-CD MPI Profiling

0

5000

10000

1500020000

25000

30000

35000

MPI_Allgather

MPI_Allreduce

MPI_Barrier

MPI_Bcast

MPI_Sendrecv

MPI Functions

Tota

l Ove

rhea

d (s

)

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

• MPI_Allreduce and MPI_Sendrecv have large overhead• Overhead increases with cluster size

Page 22: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

22

MPI Performance Comparison – MPI_Sendrecv

Higher is better

• HP MPI demonstrates better performance for large message

InfiniBand DDR

MPI_Sendrecv(192 Processes)

0

200

400

600

800

1000

1200

1400

128 256 512 1024 2048 4096 8192Message Size (Bytes)

Ban

dwid

th(M

B/s

ec)

Platform MPI HP-MPI

MPI_Sendrecv(32 Processes)

0

200

400

600

800

1000

1200

1400

128 256 512 1024 2048 4096 8192Message Size (Bytes)

Ban

dwid

th(M

B/s

ec)

Platform MPI HP-MPI

Page 23: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

23

MPI Performance Comparison – MPI_Allreduce

Lower is better

• HP MPI demonstrates lower MPI_Allreduce runtime for small messages

InfiniBand DDR

MPI_Allreduce(32 Processes)

0

2

4

6

8

10

12

14

1 2 3 4 5 6Message Size (Bytes)

Tota

l Tim

e (u

sec)

Platform MPI HP-MPI

MPI_Allreduce(192 Processes)

0

5

10

15

20

25

30

1 2 3 4 5 6Message Size (Bytes)

Tota

l Tim

e (u

sec)

Platform MPI HP-MPI

Page 24: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

24

STAR-CD Benchmark Results - MPI

• Test case– Single job over the entire systems– Input Dataset (A-Class)

• HP-MPI has slightly better performance with CPU affinity enabled

Lower is better

STAR-CD Benchmark Results (A-Class)

01000

20003000

40005000

60007000

80009000

10000

4 8 16 20 24Number of Nodes

Tota

l Ela

psed

Tim

e (s

)

Platform MPI HP-MPI

Page 25: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

25

STAR-CD Profiling Summary

• STAR-CD was profiled to determine networking dependency • Majority of data transferred between compute nodes

– Medium size messages– Data transferred increases with cluster size

• Most used message sizes– <128B messages – MPI_Allreduce– 1K-8K – MPI_Sendrecv

• Total number of messages increases with cluster size• Interconnects effect to STAR-CD performance

– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Sendrecv) influences application performance

Page 26: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

26

STAR-CD Performance with Power Management

InfiniBand DDRLower is better

• Test Scenario– 24 servers, 4-Cores/Proc

• Nearly identical performance with power management enabled or disabled

STAR-CD Benchmark Results (AClass)

1000

1200

1400

1600

1800

Power ManagementEnabled

Power ManagementDisabled

Tota

l Ela

psed

Tim

e (s

)

Page 27: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

27

STAR-CD Benchmark – Power Consumption

• Power management reduces 2% of total system power consumption

InfiniBand DDR

STAR-CD Benchmark Results (AClass)

5000

5500

6000

6500

7000

7500

Power ManagementEnabled

Power ManagementDisabled

Pow

er C

onsu

mpt

ion

(Wat

t)

Lower is better

Page 28: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

28

Power Cost Savings with Power Management

• Power management saves 248$/year for the 24-node cluster• As cluster size increases, bigger saving are expected

InfiniBand DDR24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

STAR-CD Benchmark Results Power Cost Comparison

10000

11000

12000

13000

Power ManagementEnabled

Power ManagementDisabled

$/Ye

ar

Page 29: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

29

Power Cost Savings with Different Interconnect

• InfiniBand saves ~$1200 and ~$4000 power to finish the same number of STAR-CD jobs compared to 10GigE and GigE– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

Power Cost Savings (InfiniBand vs 10GigE and GigE)

0

1000

2000

3000

4000

5000

10GigE GigE

Pow

er C

ost S

avin

gs ($

)

24 Node Cluster $/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

Power Management enabled

Page 30: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

30

Conclusions• STAR-CD and STAR-CCM+ are widely used CFD simulation software• Performance and productivity relies on

– Scalable HPC systems and interconnect solutions– Low latency and high throughout interconnect technology– Reasonably process distribution can dramatically improves performance per core

• Interconnect comparison shows– InfiniBand delivers superior performance in every cluster size– Low latency InfiniBand enables unparalleled scalability

• Power management provide 2% saving in power consumption– Per 24-node system with InfiniBand– $248 power savings per year for 24-node cluster– Power saving increases with cluster size

• InfiniBand saves power cost– Based on the number of jobs can be finished by InfiniBand per year– InfiniBand enables $1200 and $4000 power savings compared to 10GigE and GigE

Page 31: STAR CCM+ and STAR CD Performance and Profilinghpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf · 3 STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD

3131

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein