31
© 2010 Voltaire Inc. November 19, 2010 Voltaire Fabric Collective Accelerator™ (FCA) Ghislain de Jacquelot [email protected]

Voltaire fca en_nov10

Embed Size (px)

Citation preview

  • 1. 2010 Voltaire Inc. November 19, 2010 Voltaire Fabric Collective Accelerator (FCA) Ghislain de Jacquelot [email protected]

2. 2010 Voltaire Inc. 2 MPI Collectives Percentage Collective Operations = Group Communication (All to All, One to All, All to One) Synchronous by nature = consume many Wait cycles on large clusters Popular examples: Reduce Allreduce Barrier Bcast Gather Allgather 0 10 20 30 40 50 60 70 80 90 100 ANSYS FLUENT SAGE CPMD LSTC LS- DYNA CD-Adapco STAR-CD Dacapo Collective Operations % of MPI Job Runtime Your cluster might be spending half its time on idle collective cycles 3. 2010 Voltaire Inc. 3 The Challenge: Collective Operations Scalability Grouping algorithms are unaware of the topology and inefficient Network congestion due to All-to-All communication Slow nodes & OS involvement impair scalability and predictability The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric Expected Actual 4. 2010 Voltaire Inc. 4 The Voltaire InfiniBand Fabric: Equipped for the Challenge 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 . 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 . ++ + + Grid Director Switches: Fabric Processing Power Unified Fabric Manager (UFM): Topology Aware Orchestrator Fabric computing in use to address the collective challenge 5. 2010 Voltaire Inc. 5 Introducing: Voltaire Fabric Collective Accelerator 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 . 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 . ++ + + Grid Director Switches: Fabric Processing Power Breakthrough performance with no additional hardware Grid Director Switches: Collective operations offloaded to switch CPUs FCA Agent: Inter-core processing localized & optimized Unified Fabric Manager (UFM): Topology Aware Orchestrator FCA Manager: Topology-based collective tree Separate Virtual network IB multicast for result distribution Integration with job schedulers 6. 2010 Voltaire Inc. 6 Efficient Collectives with FCA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3. 1st tier offload 648 4. 2nd tier offload (result at root) 11664 1. Pre-config 2. Inter-core processing 36 36 36 36 36 648 648 5. Result distribution (single message) 6. Allreduce on 100K cores in 25 usec 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 64836 36 7. 2010 Voltaire Inc. 7 FCA Benefits: Slashing Job Runtime Slashing Runtime Eliminating Runtime Variation OS jitter eliminated in switches Traffic congestion significantly lower number of messages Cross-application interference collectives offloaded on a private virtual network IMB Allreduce 2048 Cores 0 500 1000 1500 2000 2500 3000 3500 4000 usec Completion Time Distribution Server-based Collectives FCA-based Collectives FCA: 3000usec 8. 2010 Voltaire Inc. 8 FCA Benefits: Unprecedented Scalability on HPC Clusters 1 10 100 1000 10000 0 200 400 600 800 1000 1200 ompi-Allreduce-bynode ompi-Barrier-bynode FCA-Allreduce FCA-Barrier Extreme performance improvement on raw collectives Scale according to number of switch hops, not number of nodes O(log18) As process count increases % of time spent in MPI increases % of time spent in collectives increases Enabling capability computing on HPC clusters > 100X > 50% 9. 2010 Voltaire Inc. 9 Additional Benefits Simple, fully integrated No changes to application required Tolerance to higher oversubscription (blocking) ratio Same performance at lower cost Enables use of non-blocking collectives Part of future MPI implementations FCA guarantees no computation power penalty 10. 2010 Voltaire Inc. 10 FCA What is the alternative/competitive solution? FCA NIC-based offload Topology aware Network Congestion Elimination Fabric switches offload computation Result distribution based on IB multicast Support non-blocking collectives OS noise reduction Expected MPI Job runtime Improvement 30-40% 1-2% A Fabric Wide Challenge requires a Fabric Wide Solution 11. 2010 Voltaire Inc. 11 Benchmarks 1/4 12. 2010 Voltaire Inc. 12 FCA Impact on Fluent Rating: Higher is Better! 2800 3000 3200 3400 3600 3800 Rating 88 Ranks aircraft_2m InfiniBand InfiniBand + FCA 0 1000 2000 3000 4000 5000 Rating 88 Ranks eddy_417k InfiniBand InfiniBand + FCA 3500 3600 3700 3800 3900 4000 4100 Rating 88 Ranks sedan_4m InfiniBand InfiniBand + FCA 42 44 46 48 50 52 54 56 Rating 88 Ranks truck_111m InfiniBand InfiniBand + FCA Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1 13. 2010 Voltaire Inc. 13 Benchmarks 2/4 14. 2010 Voltaire Inc. 14 System Configuration Newest installation: Nodes type: NEC HPC 1812Rb-2 CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard System Configuration: 186 nodes 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) OS: CentOS 5.4 Open MPI: 1.4.1 FCA:1.0_RC3 rev 2760 UFM: 2.3 RC7 Switch: 3.0.629 24 x DDR 24 x DDR 4 x QDR4 x QDR 15. 2010 Voltaire Inc. 15 IMB (Pallas) Benchmark Results Collective latency (usec) 10 100 1000 10000 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) ompi-Allreduce ompi-Reduce ompi-Barrier FCA-Allreduce FCA-Reduce FCA-Barrier Up to 100X Faster Collective run time reduction (%) - FCA vs Open MPI 0% 20% 40% 60% 80% 100% 0 500 1000 1500 2000 2500 Number of ranks Allreduce Reduce Barrier Up to 99.5% Runtime Reduction 16. 2010 Voltaire Inc. 16 Open Foam CFD Aerodynamic Benchmark (64 cores) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 Seconds Open MPI 1.4.1 Open MPI 1.4.1 + FCA OpenFOAM - I OpenFOAM Open source CFD solver produced by a commercial company, OpenCFD Used by many leading automotive companies 17. 2010 Voltaire Inc. 17 Benchmarks 3/4 18. 2010 Voltaire Inc. 18 System Configuration Nodes type: NEC HPC CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost DDR HCA System Configuration: 700 nodes 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed) OS: Scientific Linux 5.3 Open MPI: 1.4.1 FCA:1.1 UFM: 2.3 Switch: 3.0.629 30 x DDR 30 x DDR 3 x QDR3 x QDR 19. 2010 Voltaire Inc. 19 OpenFOAM - II ERCOFTAC UFR 2-02 http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder Used in many areas of engineering, including civil and environmental Run with OpenFOAM (pimpleFoam solver) 0 500 1000 1500 2000 2500 3000 3500 4000 ERCOFTAC UFR 2-02: Flow past a square cylinder (256 cores) Open MPI 1.4.1 FCA 20. 2010 Voltaire Inc. 20 Molecular Dynamics: LS1-Mardyn The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous at the beginning of simulation time. "agglo" uses a custom reduce operator (not supported by FCA), while split uses a standard one >95% Improvement 21. 2010 Voltaire Inc. 21 Benchmarks 4/4 22. 2010 Voltaire Inc. 22 Setup 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz Voltaire QDR InfiniBand Platform MPI 8.0 Fluent version 12.1 Star-CD version 4.12 192 cores per enclosure 23. 2010 Voltaire Inc. 23 Fluent 192 Cores Rating: Higher is Better 1000 1050 1100 1150 1200 1250 1300 PMPI PMPI + FCA truck_poly_14m truck_poly_14m 1100 1150 1200 1250 1300 1350 1400 1450 PMPI PMPI + FCA truck_14m truck_14m 0 20 40 60 80 100 120 140 160 180 PMPI PMPI + FCA truck_111m truck_111m 24. 2010 Voltaire Inc. 24 Star-CD A-Class benchmark 192 cores Runtime Lower is Better 25. 2010 Voltaire Inc. November 19, 2010 Logistics & Roadmap 26. 2010 Voltaire Inc. 26 FCA Ordering & Packaging SWL-00347 FCA Add-on License for 1 node SWL-00344 UFM-FCA Bundle License for 1 node Switch CPU software shipping automatically on all switches starting from version 3.0 Recommended to upgrade to latest version FCA Add-on package includes: FCA Manager - add-on to UFM OMA - host add-on for Open MPI (not required for other MPIs once supported) Bundle includes the above as well as UFM itself FCA license is installed on the UFM server 27. 2010 Voltaire Inc. 27 FCA Roadmap FCA v1.1 (Available Q2 2010) Collective Operations MPI_Reduce, MPI_Allreduce (MAX & SUM) MPI_Bcast Integer & floating point (32/64), up to 8 elements (128 byte) MPI_Barrier Topologies Fat Tree HyperScale Torus MPI Open MPI SDK available for MPI integration FCA v2.0 (Available Q4 2010) Allgather Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc) Increased Message size for Bcast, Reduce & Allreduce 28. 2010 Voltaire Inc. 28 FCA SDK Integration with Additional MPIs Easy to use software development kit Integration to be performed by MPI vendor Package includes: Documentation High level & flow presentation Software packages Dynamically linked library binary only Header files Sample application 29. 2010 Voltaire Inc. 29 Coming Soon: Platform MPI (formerly HP MPI) Support Platform MPI version 8.x - Q3 2010 Initial benchmarking expected end of Q2 2010 Other MPI vendors evaluating the technology as well Leveraging Voltaire SDK Platform MPI 8.x (formerly HP-MPI) 30. 2010 Voltaire Inc. 30 Voltaire Fabric Collective Accelerator Summary Fabric computing offload Combination of SW & HW in a single solution Offloading blocking computational tasks Algorithms leveraging the topology for computation (trees) Extreme MPI performance & scalability Capability computing on commodity clusters Two orders of magnitude, hundred-times faster collective runtime Scale by number of hops, not number of nodes Variation eliminated - Consistent results Transparent to the application Plug & play - No need for code changes Accelerate your fabric! 31. 2010 Voltaire Inc. November 19, 2010 Thank You