39
1 © 2013 Cisco and/or its affiliates. All rights reserved. Introduction to Cisco UCS and Userspace NIC (usNIC) Argonne National Laboratory September 2, 2014 Dave Goodell [email protected]

2014/09/02 Cisco UCS HPC @ ANL

  • View
    100

  • Download
    3

Embed Size (px)

DESCRIPTION

A presentation about UCS and usNIC to the Math & Computer Science and Leadership Computing Facility divisions at Argonne National Laboratory (ANL). Presented to ANL by Dave Goodell (Cisco) on 2014-09-02.

Citation preview

Page 1: 2014/09/02 Cisco UCS HPC @ ANL

1© 2013 Cisco and/or its affiliates. All rights reserved.

Introduction to Cisco UCS and Userspace NIC (usNIC) Argonne National Laboratory September 2, 2014

Dave [email protected]

Page 2: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 2

Record-settingIntel Ivy Bridge

1U and 2U servers(with GPU Support)

Lowlatency Ethernet

Yes,really!

10 & 40 Gbps top-of-rack

& Core Switching

1.6 usecs

190nsecs

Up to 1.5 TB RAM

10 & 40 Gbps!

Page 3: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 3

Unified Computing System Innovations Performance optimized for any type of workloadIntegrated Design

Service Profiles

UCS Manager

Unified Fabric

Virtualized I/O

Form Factor Independence

UCS Central

Low Latency

Agility and reduced time to deploy and provision applications

Role-based management, automation, ease of integration

Centralized, multi-domain management, alerting and visibility

Simplified infrastructure

Security isolation per application, scale, improved performance

Supports both blades and rack mount servers in a single domain

Low Latency over Industry Standard Ethernet networking

Page 4: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 4

LAN

Traditional Network

Ethernet FC

Ethernet FCFC

Unified Fabric

LAN

Ethernet FC

Unified Fabric for HPC & HPDAConsolidating the messaging/interconnect network

Cluster

Infiniband

DCB, FCoE & Low Latency

Page 5: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 5

Low Latency Messaging on Ethernet• Benefits

• Low Latency Ethernet delivers high performance while retaining all the advantages of managing unified network fabric

• HPC Compute Clusters can coexist with Enterprise IT under same management framework

• Leverage True Hybrid Solutions From All IT Resources

• Simplifies Procurement

• Accelerates Deployment

• Non Intrusive

• Extends the Product Life Cycle / Reusability

Lower CAPEX and OPEX

Page 6: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 6

Unified HPC Fabric

One wire to rule them all:• OS Mgmt Traffic (e.g., ssh)• Server Hardware Mgmt• File System / IO Traffic• MPI / Application Traffic

10 & 40 Gbps EthernetWith QoS

Cisco CIMCRich XML Interface

Unified Management

HPC Networking / Routing

Page 7: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 7

Network QoS

eth0

eth1

eth2

Host Port Switch Port

VLAN 27, MTU 1500B, Bandwidth: 100 Mbps

VLAN 42, MTU 9000B, Bandwidth: 2Gbps

VLAN 64, MTU 9000B, Bandwidth: Not limited

eth2

RX/TX Queue Pairs

CPUMPI

Process

PCIe Physical FunctionVirtual Functions

SSHProcess eth0

Isolated HW Resource

Page 8: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 8

Cisco UCS B-SeriesCharacteristics• Up to 20 Chassis (160 Blades)

• 3840 CPU Cores• 20 Gbps Bandwidth/Blade• Burst Capacity up to 80 Gbps

• Single Wire Management• Enterprise & HPC• Pod Architecture• Scalable

96 or 48 Ports

5.3 usecsAny to Any

Latency

Up to 82.94 TeraFLOPs(Intel Ivy Bridge)

Page 9: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 9

Cisco UCS C-Series

C220 M3 - 1RU Dual Socket Rack Server (Up to 384 GB RAM)

C240 M3 - 2RU Dual Socket Compute OR Storage Rack Server

C420 M3 - 2RU Dual OR Quad Socket Server (Upto 1.5 TB RAM)

3rd Party GPU Expansion

3rd Party GPU Expansion

3rd Party GPU Expansion

Page 10: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 10

Cisco Nexus Ultra Low Latency Ethernet Switching

190nsecs

<500nsecs

<500nsecs

<500nsecs

Nexus 354848 Port x 10 Gbps12 x 40 Gbps

Nexus 3172PQ72 Port x 10 Gbps6 x 40 Gbps

Nexus 3132Q32 Port x 40 Gbps

Nexus 90009504 - 144 Port x 40 Gbps9508 - 288 Port x 40 Gbps9516 - 576 Port x 40 Gbps

Port-to-Port Latency

Page 11: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 11

usNIC Low Level Details

Page 12: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 12

App to App Latency Components

App to App Latency Factors

OS Kernel

vEth

Application

usNIC

OS Kernel

vEth

TCP/IP

Application

TCP/IP

usNIC

0 1 2 3 4 5 6 7 8 9 10

Middle Ware Kernel NIC Network

Latency (usecs)

TCP/IP usNIC

Kernel Overhead

2.02 usecsKernel Bypass using SRIOV

9.42 usecs

HW Resourceisolation using

IOMMU

Dual Functionality!

Page 13: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 13

Cisco Userspace NIC (usNIC) overview

• Direct access to NIC hardware from Linux userspace

Operating System bypass

via the Linux Verbs API (UD)

• Utilizes Cisco Virtual Interface Card (VIC) for ultra-low Ethernet latency

2nd generation 80Gbps Cisco ASIC

2 x 10Gbps Ethernet ports, or

2 x 40Gbps Ethernet ports

PCI and mezzanine form factors

• Half-round trip (HRT) ping-pong latencies (Intel E5-2690 v2 servers):

Raw back to back: 1.57μs

MPI back to back: 1.85μs

Through MPI+N3548: 2.02μs

These numbers keep going

down

Page 14: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 14

Cisco 2x10|40Gb Virtual Interface Card (VIC)• 2nd generation VIC:

Can present itself 256 times on the PCI bus

Has enough hardware queues / buffering for 256 actual NICs

• Created for virtualizationDesigned for hypervisor bypass

• Intent:Each vNIC assigned to a single virtual machine

Can therefore bypass hypervisor

“Bare metal” network performance in a VM

Virtual NIC

PCI bus

Virtual NICVirtual NIC

Virtual NICVirtual NIC

Virtual NICVirtual NIC

Physical VIC

Page 15: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 15

PCIe Single Root IO Virtualization (SR-IOV)

VIC

Physical port Physical port

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:fa

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:fb

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:fc

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:fd

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:fe

vNICPCI Physical Function (PF)

MAC address: aa:bb:cc:dd:ee:ff

Page 16: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 16

PCI Passthrough for Hypervisor Bypass(non-usNIC)

VM

Guest kernel

Guest driver

App VM

Guest kernel

Guest driver

App

Host driver

VM

Guest kernel

Guest driver

App

VIC

Hypervisorvirtual switch

PCI PF

data path

PCI PF

Page 17: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 17

SR-IOV for Hypervisor Bypass(non-usNIC)

VM

Guest kernel

Guest driver

App VM

Guest kernel

Guest driver

App

Host driver

VM

Guest kernel

Guest driver

App

VIC

PCI VF

Hypervisor

PCI VF

virtual switch

PCI PF

data path

Page 18: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 18

VM

User space driver

App

User process

VM

User space driver

App

User process

VM App

User process

SR-IOV for OS Bypass (usNIC)

Host driver

VIC

PCI VF

Hypervisor

PCI VF

virtual switch

PCI PF

data path Host OS

Host TCP/IPstack

Page 19: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 19

OS bypass: Software architecture comparison

Application

Kernel

Cisco VIC hardware

TCP stack

General Ethernet driver

Cisco VIC driver

Userspace

Userspace sockets library

Userspace verbs library

Cisco VIC hardware

Application

Verbs IB core

Cisco USNIC driver

Bootstrappingand setup

Send and receivefast path

usNICTCP/IP

Page 20: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 20

Ethernet OS bypass applied to MPI

MPI

MPI receivesL2 framesdirectly fromthe VIC

Userspace verbs library

Cisco VIC hardware

MPI directly injects L2 frames (with UDP/IP payloads)

Page 21: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 21

I/O MMUSR-IOV NIC

OS Bypass Architecture

VIC

Classifier

x86 Chipset VT-d

MPI process

QPQPQueue pair

MPI process

InboundL2 frames

OutboundL2 frames

Page 22: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 22

PCIe Single Root IO Virtualization (SR-IOV)

VIC

VF VF VF

VF VF VF

Physical port Physical port

Physical Function (PF) Physical Function (PF)MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff

VF VF VF

VF VF VF

QPQP

QPQP

QPQP

QPQP

Page 23: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 23

PCIe Single Root IO Virtualization (SR-IOV)

VICPF (MAC)

VF

VF

VF

VF

VF

VF

PF (MAC)

VF

VF

VF

VF

VF

VF

MPI process

MPI processPhysical

portPhysical

port

Intel IO MMUQP QP QP QP

Page 24: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 24

Intel I/O MMU• Used for physical virtual memory translation

• usnic verbs driver programs (and de-programs) the IOMMU

Intel IO MMUVICVirtualUserspace

process

Virtual

Physical

RAM

Virtual P

hysi

cal

Page 25: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 25

usNIC Higher Level Details

Page 26: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 26

Quick Quiz• Do you know what these are?

MAC address

IP Subnet

ARP

GID

LID

GRH

Page 27: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 27

Standard Ethernet + UDP/IP• Manage your Ethernet network however you want

• Manage and monitor UDP/IP traffic with standard tools

• Can use IP routing + ECMP to create spine+leaf (Clos) networks

• Incrementally grow deployments without rejiggering existing sub-cluster subnet config

• No additional cost for IP: Cisco switches route L2/L3 at same speed

Page 28: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 28

Routable UDP/IP Considerations• Design Principle: Behave like OS network stack as much as

possible!

• ExamplesRouting

ARP

UDP/IP port usage + visibility

MAC in L2 frames

• Can’t always achieve full parityexotic routing configurations (e.g., ip rule add blackhole …)

tcpdump (no OS in datapath*)

Page 29: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 29

Routable UDP/IP Considerations

1. call ibv_create_qp()

2. allocates a full Linux UDP socket w/ port in OS tables

3. pass to kmod w/ create_qp command

4. bump refcount before installing filter, prevents freeing socket before QP destruction

MPI

libibverbs

libusnic_verbs

usnic_verbs.kokernel

userspace

shows up in lsof/netstat

Page 30: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 30

MPI + Routable UDP/IP• Open MPI natively supports multi-rail

• Open MPI automagic configuration philosophy (when possible)

• VICs have 2 ports, can have >1 VIC per server

• Want to avoid artificial contentionpair local interfaces with remote interfaces

• Remote MPI process might be on the same subnet, might not

• Nontrivial software problem

Page 31: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 31

NIC A1

NIC A2

NIC B1

NIC B2

Host A Host B

P1

Example Interface Pairing

possible connectivityOMPI selected pairing

Key

NIC A1

NIC A2

NIC B1

NIC B2

Host A Host B

NIC A1

NIC A2

NIC B1

NIC B2

Host A Host B

P1

P1

P2

P2

P2

valid pairing 1

valid pairing 2

before pairing

an MPI process

Page 32: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 32

Simple Routing Scenario

Host A

NIC A1

NIC A2

Host B

NIC R1a

NIC R2a

NIC R1b

NIC R2b

NIC B1

NIC B2

Subnet S1

Subnet S2

Switch (does not need L3 capability)

Page 33: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 33

NIC A1

NIC A2

NIC B1

NIC B2

Host A Host BA1 can reach B1 and B2

A2 can only reach B1

NIC A1

NIC A2

NIC B1

NIC B2

NIC A1

NIC A2

NIC B1

NIC B2

Case 1 (sub-optimal)• A2 cannot pair with

any interface on Host B

• reduces aggregate bandwidth

Host A

Host A

Host A

Host B

Case 2 (desired)• Both Host A interfaces

can pair with Host B interfaces

Matching Logic Must Watch For Sub-optimal Pairings

Page 34: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 34

Performance

Page 35: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 35

B2B 40 GbE IMB PingPong (OMPI v1.8)

1.88 µs on this SB machine

Page 36: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 36

B2B 40 GbE IMB PingPong (OMPI v1.8)

Page 37: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 37

Open Source• Everything above the

firmware is open source

• Open MPIDistributing in Cisco Open MPI v1.6.5 (soon to be v1.8.2)

Upstream in Open MPI v1.7.3 and beyond (current stable is v1.8.1)

• Libibverbs plugin

• Verbs kernel module

Page 38: 2014/09/02 Cisco UCS HPC @ ANL

© 2013 Cisco and/or its affiliates. All rights reserved. 38

Roadmap: Fall / Winter 2014• 3rd Generation VIC

2 x 40G and PCIe gen 3

More MPI offload to hardware

• Software update (expected this week)Upgrade transport from custom L2 protocol to UDP

Key rationale point: Cisco switches L2 and L3 at same speed

Allows switching usNIC traffic around data center

Allows easier monitoring and policy control of usNIC traffic

Kernel + userspace support for RHEL 7.0, SLES 12

Open MPI optimizations for 3rd generation VIC

Page 39: 2014/09/02 Cisco UCS HPC @ ANL

Thank you.