41
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 Cisco’s Journey From Verbs to Libfabric Abondon the shackles of Verbs Embrace the freedom of Libfabric Jeffrey M. Squyres Cisco Systems 23 September 2015

Cisco's journey from Verbs to Libfabric

Embed Size (px)

Citation preview

Page 1: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1

Cisco’s Journey From Verbs to Libfabric

Abondon the shackles of Verbs

Embrace the freedom of Libfabric

Jeffrey M. Squyres Cisco Systems 23 September 2015

Page 2: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2

Application

Kernel

Cisco VIC ethX port

TCP stack

General Ethernet driver

enic.ko

Userspace sockets API userspace library

Application

Verbs IB core

usnic.ko

Send and receive fast path

usNIC TCP/IP

Page 3: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3

Verbs is a fine API. …if you make InfiniBand hardware.

Page 4: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4

...but now there’s this libfabric thing (see libfabric.org community for details)

Page 5: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5

Keep in mind, Cisco already supports UD Verbs

Page 6: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6

•  Monotonic enum •  Could not add popular Ethernet values

1500

9000

•  usNIC verbs provider had to lie (!) …just like iWARP providers

•  MPI had to match verbs device with IP interface to find real MTU

Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096

Page 7: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7

•  Integer (not enum) endpoint attribute

Libfabric

Page 8: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8

•  Integer (not enum) endpoint attribute

Libfabric

DONE

Page 9: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9

•  Mandatory GRH structure InfiniBand-specific header

•  40 bytes UDP header is 42 bytes

…and a different format

•  Breaks ib_ud_pingpong •  usnic verbs provider used “magic”

ibv_port_query() to return extensions pointers

E.g., enable 42-byte UDP mode

Verbs

et len chk smac dmac …

ver len next

hop

sgid dgid

UDP header: 42 bytes

GRH: 40 bytes

Page 10: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10

•  FI_MSG_PREFIX and ep_attr.msg_prefix_size

Libfabric

et len chk smac dmac …

payload

Page 11: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11

•  FI_MSG_PREFIX and ep_attr.msg_prefix_size

Libfabric

et len chk smac dmac …

payload

DONE

Page 12: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12

•  Tuple: (device, port) Usually a physical device and port

Does not match virtualized VIC hardware

•  Queue pair •  Completion queue

Verbs

Machine (64GB total)

NUMANode P#0 (32GB)

Socket P#0

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#5

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#6

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#7

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#8

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#9

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth5

usnic_1

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

NUMANode P#1 (32GB)

Socket P#1

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#10

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#11

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#12

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#13

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#14

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#15

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#19

PCI 1000:0073

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth7

usnic_3

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

Indexes: physical

Date: Sat Mar 14 09:27:31 2015

ibv_device ibv_port

QP QP CQ

QP

Page 13: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13

•  Maps nicely to SR-IOV •  Fabric à PCI physical function (PF) •  Domain à PCI virtual function (VF) •  Endpoint à Resources in VF

Machine (64GB total)

NUMANode P#0 (32GB)

Socket P#0

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#5

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#6

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#7

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#8

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#9

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth5

usnic_1

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

NUMANode P#1 (32GB)

Socket P#1

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#10

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#11

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#12

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#13

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#14

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#15

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#19

PCI 1000:0073

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth7

usnic_3

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

Indexes: physical

Date: Sat Mar 14 09:27:31 2015

Libfabric

fi_fabric

fi_domain

fi_endpoint (resources in domain)

EP EP CQ

EP

Page 14: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14

•  GID and GUID No easy mapping back to IP interface

•  usnic verbs provider encoded MAC in GID

Still cumbersome to map back to IP interface

•  Could use RDMA CM …but that would be a ton more code

Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];

Page 15: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15

•  Can use IP addressing directly

Libfabric

Everything is awesome

Page 16: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16

•  Can use IP addressing directly

Libfabric

Everything is awesome DONE

Page 17: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17

•  Generic send call ibv_post_send(…SG list…)

Lots of branches

•  Wasteful allocations •  No prefixed receive •  Branching in completions

Verbs

Page 18: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18

•  Multiple types of send calls fi_send(buffer, …)

•  Variable-length prefix receive Provider-specific

•  Fewer branches in completions

Libfabric

Page 19: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19

1.9

1.95

2

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

0.1 1 10 100

Tim

e (m

icro

seco

nds)

Buffer size

Open MPI with usNIC: IMB PingPong Latency

imb-pingpong-ompi-1.8-verbs.outimb-pingpong-ompi-1.8-libfabric.out

Page 20: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20

61000

62000

63000

64000

65000

66000

67000

68000

69000

1e+06

Band

wid

th (m

egab

its/s

econ

d)

Buffer size

Open MPI with usNIC: IMB SendRecv Bandwidth

imb-sendrecv-ompi-1.8-verbs.outimb-sendrecv-ompi-1.8-libfabric.out

Page 21: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21

•  Performance issues •  Memory registration still a problem •  No MPI-style tag matching •  One-sided capabilities do not match MPI •  Network topology is a separate API

Verbs

Page 22: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22

•  Performance happiness •  Many MPI-helpful features:

Tag matching

One-sided operations

Triggered operations

•  Inherently designed to be more than just point-to-point

•  More work to be done… but promising MMU notify

Network topology

Libfabric

Page 23: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23

•  Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers

Especially problematic with new VIC features over time

•  Conclusion: possible (obviously), but not preferable

•  Whole API designed with multiple vendor hardware models in mind

•  Much easier to match our hardware to core Libfabric concepts

•  Conclusion: much more preferable than verbs

Libfabric Verbs

Page 24: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24

Ok, so let’s do libfabric!

Page 25: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25

Does it play well with MPI?

Page 26: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

MPI_Send(…)

Page 27: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27

•  Inherently multi-device •  Round-robin for

small messages •  Striping for large messages

•  Major protocol decisions and MPI message matching driven by an Open MPI engine

Byte Transport Layer (BTL) plugins

Page 28: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28

Matching Transport Layer (MTL) plugins

•  Most details hidden by network API •  MXM •  Portals •  PSM

•  As a side effect, must handle: •  Process loopback •  Server loopback (usually via shared memory)

Page 29: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

•  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC (verbs)

•  MXM •  Portals •  PSM •  PSM2

Page 30: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30

•  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

•  MXM •  Portals •  PSM •  PSM2 •  ofi

libfabric

Page 31: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31

libfabric

usnic BTL ofi MTL

•  Cisco developed •  usNIC-specific •  OFI point-to-point / UD •  Tested with usNIC

•  Intel developed •  Provider neutral •  OFI tag matching •  Tested with PSM / PSM2

Page 32: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32

Bootstrapping

Message passing

There are two main parts of the usNIC BTL

Page 33: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33

verbs bootstrapping

verbs message passing

These two parts were previously written to the Verbs API

Page 34: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34

verbs bootstrapping

verbs message passing

sideband bootstrapping

1.  Find the corresponding ethX device 2.  Obtain MTU 3.  Open usNIC-specific configuration

options

Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping

Page 35: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35

verbs bootstrapping

verbs message passing

sideband bootstrapping

libfabric bootstrapping

à

libfabric message passing à

Now let’s convert to use the libfabric API

Page 36: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36

verbs bootstrapping

verbs message passing

sideband bootstrapping

libfabric bootstrapping

à

libfabric message passing à Pretty much a ~1:1 swap of verbs à libfabric calls

Bootstrapping sequence totally different / not comparable

…but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)

Page 37: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Page 38: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Both libfabric usage models co-exist (and play well with each other)

inside a single MPI implementation.

Proof positive of successful co-design

of libfabric and MPI implementations.

Page 39: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Page 40: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40

•  Libfabric is the Way Forward for Cisco

Open community Matches our hardware Performance benefits Features benefits

•  Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!)

http://libfabric.org

Page 41: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41

Thank you.