MPI and OFA Divergent interests? Dan Caldwell, VP WW Channel Sales Scali, Inc

MPI and OFA

Divergent interests?

Dan Caldwell, VP WW Channel SalesScali, Inc

April 2008 OFA Presentation2

The role of OFA in HPC

The traditional role of Open Fabrics focuses on cables that connect motherboards, switches, and storage in clusters

Growth of processors in HPC expected to be 30% in 2010 (per IDC), and number of cores per processor is 55% (estimate) and increasing

Therefore the bulk of the “HPC Interconnect Fabric” is moving from cables to the server motherboard and the various local buses


MPI Performance in HPC systems

Scali has always been the performance MPI leader– http://www.supercomputingonline.com/article.php?sid=

15357

However, we suggest that OFA adopt a recommended performance measurement methodology– Productivity, or ‘jobs per day’, on a cluster– How do you know how well OFA is doing?

OFA must examine the performance within a multi-core node– A 16 core “Personal Supercomputer” will still run legacy

MPI applications

http://www.supercomputingonline.com/article.php?sid=15357

http://www.supercomputingonline.com/article.php?sid=15357


Single Node, 8 cores, osu_bw, 8 byte, Intel Xeon 3.00GHz (X5365)


MPI enables other HPC functionality

Infiniband Trunking– Combining IB channels for greater throughput– Done for Sun / Tsukuba University in Japan

Suspend / Resume and Checkpoint – Restart– Generic functionality - not application specific– Demonstrated job migration with HPC4U in Brussels,

February 11, 2008

And…. MPI based Power Management– Today – AMD / Barcelona only


”Waiting as fast as it can” (CPU spinning) – examples (AMD Barcelona Quad Core):

Star-cd Alltoallv count vs. latency

1

10

100

1000

10000

100000

>1 >2 >5 >10 >20 >50 >100 >200 >500 >1000

latency in milliseconds

Nu

mb

er o

f ca

lls

per

seg

men

t

Star-cd Alltoallv time spent vs. latency

0

10

20

30

40

50

60

70

80

90

100

>1 >2 >5 >10 >20 >50 >100 >200 >500 >1000

latency in milliseconds

Tim

e in

sec

on

ds

per

seg

men

t

Power Saving Potential Zone

Power Saving Potential Zone

Considering 20 microseconds to change power states in an Quad core Opteron, latency above 10 milliseconds in an MPI collectivecan trigger a ”throttle down”. In a call with a 10 millisecondlatency, 20 microseconds to throttle down plus 200 to re-establishfull speed would use only 2.5% of the time of the call.


Real World Initial Tests - SPEC MPI2007

Test Base APM Perf hitPowerdown events (#)

Power Down Core Seconds

APM core sec %

Perf hit vs power

savings107.leslie3d 7717 8471 9.8% 126585 20148.46 29.7% 3.04113.GemsFDTD 5038 5137 2.0% 44108 6765.36 16.5% 8.38129.tera_tf 3624 3651 0.7% 261171 4023.04 13.8% 18.49

Power Down Core Seconds: cores seconds powered down from 2Ghz to 1GhzAPM Core Sec: Low power core - seconds divided by total full power core seconds (column D * 8); or total power savings %Perf Hit / Power Savings: Ratio of performance hit vs power saving - higher is better

leslie3d: Fluid DynamicsGemsFDTD: Computational Electromagneticstera_tf: Physics / Hydrodynamics


Conclusions

MPI and OFA need to co-exist, but we both need to expand our reach in HPC

Open Fabrics needs to embrace new functionality, performance metrics, and definitions of ‘fabric’

Power savings, user-model (core affinity policies), best practices in interconnect design, and emphasis on continued performance across the entire HPC system is critical to the relevancy of OFA.

Documents

MPI and OFA Divergent interests? Dan Caldwell, VP WW Channel Sales Scali, Inc