View
100
Download
3
Embed Size (px)
DESCRIPTION
A presentation about UCS and usNIC to the Math & Computer Science and Leadership Computing Facility divisions at Argonne National Laboratory (ANL). Presented to ANL by Dave Goodell (Cisco) on 2014-09-02.
Citation preview
1© 2013 Cisco and/or its affiliates. All rights reserved.
Introduction to Cisco UCS and Userspace NIC (usNIC) Argonne National Laboratory September 2, 2014
Dave [email protected]
© 2013 Cisco and/or its affiliates. All rights reserved. 2
Record-settingIntel Ivy Bridge
1U and 2U servers(with GPU Support)
Lowlatency Ethernet
Yes,really!
10 & 40 Gbps top-of-rack
& Core Switching
1.6 usecs
190nsecs
Up to 1.5 TB RAM
10 & 40 Gbps!
© 2013 Cisco and/or its affiliates. All rights reserved. 3
Unified Computing System Innovations Performance optimized for any type of workloadIntegrated Design
Service Profiles
UCS Manager
Unified Fabric
Virtualized I/O
Form Factor Independence
UCS Central
Low Latency
Agility and reduced time to deploy and provision applications
Role-based management, automation, ease of integration
Centralized, multi-domain management, alerting and visibility
Simplified infrastructure
Security isolation per application, scale, improved performance
Supports both blades and rack mount servers in a single domain
Low Latency over Industry Standard Ethernet networking
© 2013 Cisco and/or its affiliates. All rights reserved. 4
LAN
Traditional Network
Ethernet FC
Ethernet FCFC
Unified Fabric
LAN
Ethernet FC
Unified Fabric for HPC & HPDAConsolidating the messaging/interconnect network
Cluster
Infiniband
DCB, FCoE & Low Latency
© 2013 Cisco and/or its affiliates. All rights reserved. 5
Low Latency Messaging on Ethernet• Benefits
• Low Latency Ethernet delivers high performance while retaining all the advantages of managing unified network fabric
• HPC Compute Clusters can coexist with Enterprise IT under same management framework
• Leverage True Hybrid Solutions From All IT Resources
• Simplifies Procurement
• Accelerates Deployment
• Non Intrusive
• Extends the Product Life Cycle / Reusability
Lower CAPEX and OPEX
© 2013 Cisco and/or its affiliates. All rights reserved. 6
Unified HPC Fabric
One wire to rule them all:• OS Mgmt Traffic (e.g., ssh)• Server Hardware Mgmt• File System / IO Traffic• MPI / Application Traffic
10 & 40 Gbps EthernetWith QoS
Cisco CIMCRich XML Interface
Unified Management
HPC Networking / Routing
© 2013 Cisco and/or its affiliates. All rights reserved. 7
Network QoS
eth0
eth1
eth2
Host Port Switch Port
VLAN 27, MTU 1500B, Bandwidth: 100 Mbps
VLAN 42, MTU 9000B, Bandwidth: 2Gbps
VLAN 64, MTU 9000B, Bandwidth: Not limited
eth2
RX/TX Queue Pairs
CPUMPI
Process
PCIe Physical FunctionVirtual Functions
SSHProcess eth0
Isolated HW Resource
© 2013 Cisco and/or its affiliates. All rights reserved. 8
Cisco UCS B-SeriesCharacteristics• Up to 20 Chassis (160 Blades)
• 3840 CPU Cores• 20 Gbps Bandwidth/Blade• Burst Capacity up to 80 Gbps
• Single Wire Management• Enterprise & HPC• Pod Architecture• Scalable
96 or 48 Ports
5.3 usecsAny to Any
Latency
Up to 82.94 TeraFLOPs(Intel Ivy Bridge)
© 2013 Cisco and/or its affiliates. All rights reserved. 9
Cisco UCS C-Series
C220 M3 - 1RU Dual Socket Rack Server (Up to 384 GB RAM)
C240 M3 - 2RU Dual Socket Compute OR Storage Rack Server
C420 M3 - 2RU Dual OR Quad Socket Server (Upto 1.5 TB RAM)
3rd Party GPU Expansion
3rd Party GPU Expansion
3rd Party GPU Expansion
© 2013 Cisco and/or its affiliates. All rights reserved. 10
Cisco Nexus Ultra Low Latency Ethernet Switching
190nsecs
<500nsecs
<500nsecs
<500nsecs
Nexus 354848 Port x 10 Gbps12 x 40 Gbps
Nexus 3172PQ72 Port x 10 Gbps6 x 40 Gbps
Nexus 3132Q32 Port x 40 Gbps
Nexus 90009504 - 144 Port x 40 Gbps9508 - 288 Port x 40 Gbps9516 - 576 Port x 40 Gbps
Port-to-Port Latency
© 2013 Cisco and/or its affiliates. All rights reserved. 11
usNIC Low Level Details
© 2013 Cisco and/or its affiliates. All rights reserved. 12
App to App Latency Components
App to App Latency Factors
OS Kernel
vEth
Application
usNIC
OS Kernel
vEth
TCP/IP
Application
TCP/IP
usNIC
0 1 2 3 4 5 6 7 8 9 10
Middle Ware Kernel NIC Network
Latency (usecs)
TCP/IP usNIC
Kernel Overhead
2.02 usecsKernel Bypass using SRIOV
9.42 usecs
HW Resourceisolation using
IOMMU
Dual Functionality!
© 2013 Cisco and/or its affiliates. All rights reserved. 13
Cisco Userspace NIC (usNIC) overview
• Direct access to NIC hardware from Linux userspace
Operating System bypass
via the Linux Verbs API (UD)
• Utilizes Cisco Virtual Interface Card (VIC) for ultra-low Ethernet latency
2nd generation 80Gbps Cisco ASIC
2 x 10Gbps Ethernet ports, or
2 x 40Gbps Ethernet ports
PCI and mezzanine form factors
• Half-round trip (HRT) ping-pong latencies (Intel E5-2690 v2 servers):
Raw back to back: 1.57μs
MPI back to back: 1.85μs
Through MPI+N3548: 2.02μs
These numbers keep going
down
© 2013 Cisco and/or its affiliates. All rights reserved. 14
Cisco 2x10|40Gb Virtual Interface Card (VIC)• 2nd generation VIC:
Can present itself 256 times on the PCI bus
Has enough hardware queues / buffering for 256 actual NICs
• Created for virtualizationDesigned for hypervisor bypass
• Intent:Each vNIC assigned to a single virtual machine
Can therefore bypass hypervisor
“Bare metal” network performance in a VM
Virtual NIC
PCI bus
Virtual NICVirtual NIC
Virtual NICVirtual NIC
Virtual NICVirtual NIC
Physical VIC
© 2013 Cisco and/or its affiliates. All rights reserved. 15
PCIe Single Root IO Virtualization (SR-IOV)
VIC
Physical port Physical port
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fa
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fb
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fc
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fd
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fe
vNICPCI Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:ff
© 2013 Cisco and/or its affiliates. All rights reserved. 16
PCI Passthrough for Hypervisor Bypass(non-usNIC)
VM
Guest kernel
Guest driver
App VM
Guest kernel
Guest driver
App
Host driver
VM
Guest kernel
Guest driver
App
VIC
Hypervisorvirtual switch
PCI PF
data path
PCI PF
© 2013 Cisco and/or its affiliates. All rights reserved. 17
SR-IOV for Hypervisor Bypass(non-usNIC)
VM
Guest kernel
Guest driver
App VM
Guest kernel
Guest driver
App
Host driver
VM
Guest kernel
Guest driver
App
VIC
PCI VF
Hypervisor
PCI VF
virtual switch
PCI PF
data path
© 2013 Cisco and/or its affiliates. All rights reserved. 18
VM
User space driver
App
User process
VM
User space driver
App
User process
VM App
User process
SR-IOV for OS Bypass (usNIC)
Host driver
VIC
PCI VF
Hypervisor
PCI VF
virtual switch
PCI PF
data path Host OS
Host TCP/IPstack
© 2013 Cisco and/or its affiliates. All rights reserved. 19
OS bypass: Software architecture comparison
Application
Kernel
Cisco VIC hardware
TCP stack
General Ethernet driver
Cisco VIC driver
Userspace
Userspace sockets library
Userspace verbs library
Cisco VIC hardware
Application
Verbs IB core
Cisco USNIC driver
Bootstrappingand setup
Send and receivefast path
usNICTCP/IP
© 2013 Cisco and/or its affiliates. All rights reserved. 20
Ethernet OS bypass applied to MPI
MPI
MPI receivesL2 framesdirectly fromthe VIC
Userspace verbs library
Cisco VIC hardware
MPI directly injects L2 frames (with UDP/IP payloads)
© 2013 Cisco and/or its affiliates. All rights reserved. 21
I/O MMUSR-IOV NIC
OS Bypass Architecture
VIC
Classifier
x86 Chipset VT-d
MPI process
QPQPQueue pair
MPI process
InboundL2 frames
OutboundL2 frames
© 2013 Cisco and/or its affiliates. All rights reserved. 22
PCIe Single Root IO Virtualization (SR-IOV)
VIC
VF VF VF
VF VF VF
Physical port Physical port
Physical Function (PF) Physical Function (PF)MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff
VF VF VF
VF VF VF
QPQP
QPQP
QPQP
QPQP
© 2013 Cisco and/or its affiliates. All rights reserved. 23
PCIe Single Root IO Virtualization (SR-IOV)
VICPF (MAC)
VF
VF
VF
VF
VF
VF
PF (MAC)
VF
VF
VF
VF
VF
VF
MPI process
MPI processPhysical
portPhysical
port
Intel IO MMUQP QP QP QP
© 2013 Cisco and/or its affiliates. All rights reserved. 24
Intel I/O MMU• Used for physical virtual memory translation
• usnic verbs driver programs (and de-programs) the IOMMU
Intel IO MMUVICVirtualUserspace
process
Virtual
Physical
RAM
Virtual P
hysi
cal
© 2013 Cisco and/or its affiliates. All rights reserved. 25
usNIC Higher Level Details
© 2013 Cisco and/or its affiliates. All rights reserved. 26
Quick Quiz• Do you know what these are?
MAC address
IP Subnet
ARP
GID
LID
GRH
© 2013 Cisco and/or its affiliates. All rights reserved. 27
Standard Ethernet + UDP/IP• Manage your Ethernet network however you want
• Manage and monitor UDP/IP traffic with standard tools
• Can use IP routing + ECMP to create spine+leaf (Clos) networks
• Incrementally grow deployments without rejiggering existing sub-cluster subnet config
• No additional cost for IP: Cisco switches route L2/L3 at same speed
© 2013 Cisco and/or its affiliates. All rights reserved. 28
Routable UDP/IP Considerations• Design Principle: Behave like OS network stack as much as
possible!
• ExamplesRouting
ARP
UDP/IP port usage + visibility
MAC in L2 frames
• Can’t always achieve full parityexotic routing configurations (e.g., ip rule add blackhole …)
tcpdump (no OS in datapath*)
© 2013 Cisco and/or its affiliates. All rights reserved. 29
Routable UDP/IP Considerations
1. call ibv_create_qp()
2. allocates a full Linux UDP socket w/ port in OS tables
3. pass to kmod w/ create_qp command
4. bump refcount before installing filter, prevents freeing socket before QP destruction
MPI
libibverbs
libusnic_verbs
usnic_verbs.kokernel
userspace
shows up in lsof/netstat
© 2013 Cisco and/or its affiliates. All rights reserved. 30
MPI + Routable UDP/IP• Open MPI natively supports multi-rail
• Open MPI automagic configuration philosophy (when possible)
• VICs have 2 ports, can have >1 VIC per server
• Want to avoid artificial contentionpair local interfaces with remote interfaces
• Remote MPI process might be on the same subnet, might not
• Nontrivial software problem
© 2013 Cisco and/or its affiliates. All rights reserved. 31
NIC A1
NIC A2
NIC B1
NIC B2
Host A Host B
P1
Example Interface Pairing
possible connectivityOMPI selected pairing
Key
NIC A1
NIC A2
NIC B1
NIC B2
Host A Host B
NIC A1
NIC A2
NIC B1
NIC B2
Host A Host B
P1
P1
P2
P2
P2
valid pairing 1
valid pairing 2
before pairing
an MPI process
© 2013 Cisco and/or its affiliates. All rights reserved. 32
Simple Routing Scenario
Host A
NIC A1
NIC A2
Host B
NIC R1a
NIC R2a
NIC R1b
NIC R2b
NIC B1
NIC B2
Subnet S1
Subnet S2
Switch (does not need L3 capability)
© 2013 Cisco and/or its affiliates. All rights reserved. 33
NIC A1
NIC A2
NIC B1
NIC B2
Host A Host BA1 can reach B1 and B2
A2 can only reach B1
NIC A1
NIC A2
NIC B1
NIC B2
NIC A1
NIC A2
NIC B1
NIC B2
Case 1 (sub-optimal)• A2 cannot pair with
any interface on Host B
• reduces aggregate bandwidth
Host A
Host A
Host A
Host B
Case 2 (desired)• Both Host A interfaces
can pair with Host B interfaces
Matching Logic Must Watch For Sub-optimal Pairings
© 2013 Cisco and/or its affiliates. All rights reserved. 34
Performance
© 2013 Cisco and/or its affiliates. All rights reserved. 35
B2B 40 GbE IMB PingPong (OMPI v1.8)
1.88 µs on this SB machine
© 2013 Cisco and/or its affiliates. All rights reserved. 36
B2B 40 GbE IMB PingPong (OMPI v1.8)
© 2013 Cisco and/or its affiliates. All rights reserved. 37
Open Source• Everything above the
firmware is open source
• Open MPIDistributing in Cisco Open MPI v1.6.5 (soon to be v1.8.2)
Upstream in Open MPI v1.7.3 and beyond (current stable is v1.8.1)
• Libibverbs plugin
• Verbs kernel module
© 2013 Cisco and/or its affiliates. All rights reserved. 38
Roadmap: Fall / Winter 2014• 3rd Generation VIC
2 x 40G and PCIe gen 3
More MPI offload to hardware
• Software update (expected this week)Upgrade transport from custom L2 protocol to UDP
Key rationale point: Cisco switches L2 and L3 at same speed
Allows switching usNIC traffic around data center
Allows easier monitoring and policy control of usNIC traffic
Kernel + userspace support for RHEL 7.0, SLES 12
Open MPI optimizations for 3rd generation VIC
Thank you.