Upload
darlene-davidson
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
www.openfabrics.org
Open Fabrics BOFSupercomputing 2009
Tziporet Koren, Gilad Shainer, Yiftah Shahar, Stan SmithHal Rosenstock, Jeff Squyres, DK Panda, Bob Woodruff, Betsy Zeller
Rev. 1.0
2www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
3www.openfabrics.org
Linux: OFED Components
HCA/NIC Drivers IB: IBM, Mellanox, QLogic iWARP: Chelsio, Intel
Core: Verbs, mad, SMA, CMA, SA cache IPoIB SDP SRP and SRP Target iSER RDS Qlogic_VNIC uDAPL OSM Diagnostic tools iSER Target NFS-RDMA
Bonding module Open iSCSI MPI Components
MVAPICH Open MPI MVAPICH2 Benchmark tests
Proprietary MPIs: Intel, HP, Platform mpi
Proprietary SMs: Sun, Voltaire, Qlogic, Mellanox
OFA Development Add on
Tested with
4www.openfabrics.org
Update from Sonoma ’09 Session
Progress: Provide user space components in tarballs according to
distros requests
5www.openfabrics.org
OFED 1.4.1 – Released May 2009
New features Added support for RHEL 5.3 and SLES11 NFS/RDMA: In beta quality with support for RHEL 5.2, 5.3 and SLES 10 SP2 Updated MPI packages: MVAPICH 1.1.0-3355, Open MPI 1.3.2 Updated bonding package: ib-bonding-0.9.0-40 Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19 Updated OpenSM version to include critical bug fixes Fixed RDS iWARP support Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca Added a module parameter to control number of MTTs per segment in Mellanox HCAs
(mlx4 & mthca) mstflint update Enhanced OpenSM and management tools, user interface, HA, routing enhancements,
much more, too much to list… details in the backup slides
6www.openfabrics.org
OFED 1.4.2 – Released August 2009
New features Critical bug fixes only Fixes to NES (Intel iWarp) driver Fixes to support running with Lustre installed NFS/RDMA critical bug fixes
Minimal QA Thus, recommended only for people hitting these critical bugs
7www.openfabrics.org
OFED 1.5 – Release December 2009
New features Added support for RedHat EL5.4 and EL4.8 and SLES 10 SP 3 Added support for kernel.org 2.6.29 and 2.6.30 uDAPL scalability enhancements, new UCM provider Hardware driver for new Qlogic QDR HCA All user space packages released as tar balls for easier distro integration MVAPICH2 1.4 OpenMPI 1.3.3 Several new enhancements to OpenSM and management tools for improved
scalability, performance, QoS, routing, etc. (see backup slides for details) Bug fixes SDP Zero Copy, and other performance improvements
OFED 1.5-RDMAoE Branch Experimental branch of OFED-1.5 that also includes support for Mellanox RDMAoE For those that want to try out this new technology Open Fabrics board has voted to include this code in OFED/WinOF
• Which release should it go into ?OFED-1.5 ? or wait till the code is accepted upstream and there is a standard spec ?
8www.openfabrics.org
OFED 1.5 OS Matrix
List of Supported Kernels for OFED 1.5 RHEL4: up6, up7, up8 RHEL5: up2, up3, up4 SLES10: SP2, SP3 SLES 11 Fedora Core 11* OpenSuSE 11* Kernel.org: 2.6.18-2.6.30 * minimal QA for these versions.
9www.openfabrics.org
OFED 1.6 Plans
Preliminary Schedule Release at Nov 2010 Detailed schedule will be derivative from the above
Preliminary Feature List: Kernel.org: 2.6.33 and 2.6.34 SRIOV support Mellanox Vnic for BridgeX MMU notification for MPI (if accepted by the kernel) New HW from vendors (if any) RDMAoE (if not already in an earlier release)
10www.openfabrics.org
OFED 1.6 OS Matrix
kernel.org: kernel 2.6.33 and 2.6.34 RHEL4: up6, up7, up8 (maybe drop at all if RHEL 6 is out –
lets talk in meeting) RHEL5: up2, up3, up4, up5 RHEL6 SLES10: SP2, SP3, SP4 SLES 11: SP1 Fedora Core: latest OpenSuSE: latest
•new for OFED 1.6 in bold•drop support for items in blue
11www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
12www.openfabrics.org
Windows OpenFabrics (WinOF)
WinOF 2.1 Released 9/30/2009 Winverbs fully integrated into IB core feature set. OFED Compatibility layers
• libibverbs, libmad, libumad, librdmacm
OFED Diagnostics on OFED Compat layers• Ibaddr, ibnetdiscover, ibroute, ibstat, saquery, sminfo…
Installer fully integrated with DriverStore + PNP. OFED uDAT/uDAPL code base on Windows. Server 2008 HPC integration Numerous Bug fixes.
13www.openfabrics.org
WinOF Roadmap
WinOF 2.2 Release target Q1’2010, freeze in Q4’09
Features:• Windows 7 & Server 2008 R2 fully supported.• NDIS 6.0 IPoIB driver based on WHQL’ed source.• OpenSM 3.3.3 (WinOF 2.1 @ 3.0.0 ~OFED 1.2+).• SRP multi-path fixes.
WinOF 2.3 Release target Q4’2010, freeze early Q4’2010 Connected Mode IPoIB.
14www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
15www.openfabrics.org
OFA Scalability
Challenges and GoalsInfrastructure ScalabilityULPs/Apps ScalabilityPossible Improvements
16www.openfabrics.org
Challenges and Goals
Scale out to 10K-20K or more nodes Performance Reliability Sometimes hard to differentiate feature from scalability
Focus additional attention/resources on issues Get ready for more detailed discussion at Sonoma
17www.openfabrics.org
Infrastructure Scalability/Features Improved multicore affinity/awareness/support
Binding to specific hw threads in a core
• e.g. http://arstechnica.com/hardware/news/2009/09/ibms-8-core-power7-twice-the-muscle-half-the-
transistors.ars
Interrupt distribution
Binding HCAs/RNICs to numa nodes
Multicast Reliable multicast
• New IBTA optional feature
Better UD multicast performance• Small message mcast latency, with just two members in the mcast group, is 2x to 3x that of unicast
latency between the same pair
Flow control for SRQ New IBTA optional feature
CM extension
18www.openfabrics.org
Infrastructure Scalability/Features
Fault tolerance Application transparent fault detection, isolation, recovery Multiple HCAs/NICs with transparent failover
IB monitoring Performance counters, throughput, hotspots, degraded
links This is IB's Achilles' heel...
• Need much better monitoring tools discover congestion, bottlenecks
Adaptive routing HCA out-of-order delivery Switch logic for state info & adaptive algorithm, etc.
19www.openfabrics.org
Infrastructure Scalability
SA aspects
Primarily PathRecord
• OpenSM
• SA client
RDMA CM
Resolve route
• ARP query scalability
Resolve address
• SA PathRecord query scalability
20www.openfabrics.org
Infrastructure Scalability
CM Higher abstraction model
• Current APIs are cumbersome & difficult to use OpenSM
Stateful failover• Replication
• Eliminate client re-registration Congestion manager
21www.openfabrics.org
Possible Infrastructure Improvements
Adaptive MAD retransmission Better duplicate transaction handling by SA (and MAD ?) SA scalability in terms of PathRecord responses
More parallelization• Shadow DB ?• SA distribution beyond node
Tunable retry mechanism for various components RDMA CM API addition and ACM (Assistant to the IB CM)
Does this address higher abstraction model requested ?
22www.openfabrics.org
Possible ULPs/Apps Improvements
MPI Don’t query PathRecords per core Hardware collective support
• Common API Reliable multicast ummunotify
BoIB (Boot over IB) SM improvements for handling non responsive SMAs as node transitions from boot
ROM to kernel infiniband as boot interface without ethernet suspenders
Bonding Load balancing
• Not just active/standby (failover)
DHCP Use raw (mmap) rather than BSD socket interface due to inadequate performance
Others ?
23www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
24www.openfabrics.org
MPI Distribution in OFED: Rationale
Open source MPI’s initially included to “bootstrap” the OFED project
MPI was the main user for OFED, so this seemed like a natural pairing
Made it (significantly) easier for customers to get their MPI jobs running on InfiniBand
Also necessary for political buy-in: unify under one, standard verbs API (vs. different MVAPI stacks)
QA testing of MPI + OFED is still extremely valuable
This is not a discussion of removing MPI + OFED QA
25www.openfabrics.org
MPI Distribution in OFED: Pros
MPI is still the most common OFED “customer”
HPC customers get network stack + MPI in one package
Helps rapid MPI deployment on new clusters (out-of-box)
MPI-selector function allows to select MPI stack of choice during the
installation
Customers get QA assurance of specific MPI + OFED version
tuples
Helps to test multiple functionalities of the OFED stack and
IB/iWARP fabric with comprehensive suite of MPI-level
benchmarks
26www.openfabrics.org
MPI Distribution in OFED: Cons
MPI’s have their own QA cycles
MPI+OFED QA testing is more for OFED, not MPI
Bundling induces project scheduling difficulties between OFED and various MPI packages
RedHat and SuSE both say “Don’t do this!”
They both already include the open source MPI’s
Makes it more difficult for them to take OFED drops
Many users will download the latest-n-greatest MPIs anyway – not the ones included in OFED
27www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
28www.openfabrics.org
OFA Solutions for Ethernet clusters for HPC
Would like to get some community feedback on
success stories for building HPC clusters using
Ethernet.
What works well ?
Things that need improvement to make it easier ?
Other ?
29www.openfabrics.org
Agenda
Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap
Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap
Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community
30www.openfabrics.org
Backup Slides
31www.openfabrics.org
Open SM – OFED 1.4.1 and 1.4.2
Versions in OFED 1.4.2:
libibumad-1.3.1, libibmad-1.3.1, opensm-3.3.1, infiniband-diags-1.5.1
User Interface:
Unified configuration file
Configuration reloading on the fly
Improved Plugin interface – multiple plugins are supported
Mlticast: Ipv6 Solicited Node consolidation
Better diagnostic tools (new - ibsendtrap)
HA:
• OpenSM will query Standby SMs periodically
• Standby OpenSM notifies Master SM about priority change (Trap 144)
32www.openfabrics.org
Open SM - OFED 1.4.2 Routing
Cached routing (- -R ftree,updn,minhop)
LMC improvements:
Preserve base LIDs routes
Ensure LMC paths balancing over different switches/chassis
Ordered paths balancing
Ports are sorted by switch loads
Port order file option (--guid_routing_order_file option)
Better LASH support: Mesh geometry analysis, Paths balancing over multiple links
General
Port IDs for Up/Down
Min hop weights
Connecting root nodes with Up/Down
Connecting IO nodes with FastTree
33www.openfabrics.org
Open SM - new features in OFED 1.5
Scalability & performance
Optimized SL2VL setup
Parallel LFTs setup
Parallel MFTs setup
Routing & multicast
FTree improvements
Routing engine reloading
Mesh switch reordering
optimizations
MGID to MLID compression
QoS improvements
SL2VL setup optimization
QoS/LASH co-exist
Major bug fix
MCG join/leave fixes
Clean delayed MCG deletion