11
Evolution of the netmap architecture -- Page 1/21 L < > T H local Evolution of the netmap architecture Evolution of the netmap architecture Luigi Rizzo, Università di Pisa http://info.iet.unipi.it/~luigi/vale/ these slides at http://info.iet.unipi.it/~luigi/netmap/talk-coseners.html Starting point Packet I/O in commodity OS is not fast enough typically ~1 Mpps/core, vs 14.88 Mpps on 10 Gbit/s links insufficient for software packet processing nodes several ad-hoc or proprietary solution (Click-based, PacketShader, DPDK) We tried to come up with a more generic solution. Ideally: as fast as (or better than) others device-independent, OS independent developed friendly Netmap (2011) 10..40 times faster than raw sockets, PF_PACKET, bpf 14.88 Mpps with 1 core at 900 MHz pcap library on top of netmap as of 2012, also available on Linux

n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Evolution of the netmap architecture -- Page 1/21L < > T H localEvolution of the netmap architecture

Evolution of the netmap architectureLuigi Rizzo, Università di Pisa http://info.iet.unipi.it/~luigi/vale/

these slides at http://info.iet.unipi.it/~luigi/netmap/talk-coseners.html

Starting pointPacket I/O in commodity OS is not fast enough

typically ~1 Mpps/core, vs 14.88 Mpps on 10 Gbit/s linksinsufficient for software packet processing nodesseveral ad-hoc or proprietary solution (Click-based, PacketShader, DPDK)

We tried to come up with a more generic solution. Ideally:as fast as (or better than) othersdevice-independent, OS independentdeveloped friendly

Netmap (2011)10..40 times faster than raw sockets, PF_PACKET, bpf14.88 Mpps with 1 core at 900 MHzpcap library on top of netmapas of 2012, also available on Linux

Page 2: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Next, get rid of hardwareUse netmap key ideas for virtual switches

simplify experiments with high speed network applicationsuseful to interconnect VMs

VALE (2012)Virtual Local Ethernet implenting ethernet learning bridgeup to 20 Mpps (64 byte frames) or 70 Gbit/s (1500 byte frames)

And then, how fast networking is in VMsFocus is mostly on bulk TCP

emulation of network devices very poorparavirtualized devices (vendor specific) reasonably fast for TCP20-30 Gbit/s using tricks (good ones)

We want to deal with high packet rates, tooSDN must be implemented, not just Definedsome applications have high pps requirements

Fast QEMU (2013)accelerated network I/O path in qemuparavirtualized e1000guest-guest rates on top of e1000: >1 Mpps with sockets, 5 Mpps with netmapMore at http://info.iet.unipi.it/~luigi/papers/20130520-rizzo-vm.pdf

Page 3: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Availabilitynetmap and VALE are in standard FreeBSD distributions (HEAD and stable/9)same code also runs on Linux as an add-on modulesupport for multiple NICs (Intel, Realtek, Mellanox, Nvidia)QEMU enhancements submitted to QEMU-dev

Netmap design principlesDesign principles:

no requirement/reliance on special hardwarefeaturesamortize costs over large batches (syscalls)remove unnecessary work (copies, mbufalloc/free)reduce runtime decisions (a single frameformat)modifying device drivers is permitted,as long as the code can be maintained

Page 4: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Netmap data structures and APIAccessopen("/dev/netmap")ioctl(fd, NIOCREG, arg) -->disconnect datapath from the OS;mmap(..., fd, 0)

Transmitfill buffers, update netmap ringioctl(fd,NIOCTXSYNC) queues thepackets

Receiveioctl(fd,NIOCRXSYNC) reports newly received packetsprocess buffers, update netmap ring

poll() and select() are used for synchronizationPOLLIN and POLLOUT select the rings to monitor

Netmap performance65 cycles/packet between the wire anduserspace (14.88 Mpps at 900 MHz)good scalability with CPU frequency andnumber of cores

Ported several apps (OVS, Click, ipfw, ...)with 5-10x speedup

netmap exposes bottlenecks inapplicationssome kernel functions may now runmore efficiently in userspace than in thekernel

Page 5: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

VALE, a high performance Virtual Local Ethernet

NIOCREG valeX:Y dynamically creates switch instances (X) and ports (Y)same API as physical netmap ports, but separate memory regionseach switch runs the Learning Bridge algorithm (and now, also OVS)

Operation is sender-driven:each incoming packet is dispatched to the correct destination(s)writes are non-blocking, packets dropped when destination queue fullreads are blockingthe cost of forwarding/copying is charged to the sender.

VALE performanceMulti-stage forwarding to amortize locking and cache miss costs

1. Fetch a batch of packets, prefetch payload2. Compute destinations for each packet in the batch3. Forward traffic iterating on interfaces

Page 6: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

VALE PerformanceVALE: our software switchNIC: switch/NIC-supported forwardingTAP: linux/BSD bridge, in-kernel OVSVALE achieves 18-20 Mpps / 70 Gbit/s

Virtual Machine network performanceEmulation of network peripherals historically slow,due to the following:

mmio 100 times more expensive than on baremetal, due to VM exitsinterrupts can be expensiveVMM/host/switch performance

Paravirtualized peripherals (virtio, vmxnet, xenfront)address only one part of the problem:

VMM, host and switch can still be limiting

Page 7: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Our workA set of host, guest, VMM modifications that can be used depending on operationalconstraints

1. proper interrupt moderation (VMM only)2. send combining (guest only)3. e1000 paravirtualization (host-guest)

btw, FreeBSD's DEVICE_POLLING (Rizzo 2001) is almost as good asparavirtualization!

4. use VALE instead of linux bridge (host, VMM)5. clean up the backend-frontend datapath (VMM)

Send combiningSend combining reduces VM exits on TX:

when there are pending interrupt, postponetransmissions until the next interrupt arrivesespecially useful when combined withmoderation

IMPLEMENTATION: ~100 lines, only in the guest device driverPERFORMANCE: TX rate, Kpps (guest-guest, TAP backend)

Function 1 VCPU 2 VCPU ------------------------------------ moderation: 24 -> 90 65 -> 87

send comb.: 24 -> 23 65 -> 301

both: 24 -> 322 65 -> 334

Page 8: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

e1000 paravirtualizationReduce VM exits using a shared memory mailbox(CSB) between the VCPU and the IO thread

the first I/O or interrupt acts as a "kick" to wakeup a process on the other side;later, guest driver and I/O thread chase eachother exchanging read/write position through theCSB;

Data buffers and descriptors are in shared memory(guest physical, host virtual)IMPLEMENTATION

no need to create a brand new device, easy addon for any modern NIC~100 lines in the guest driver, ~100 lines in the frontend

e1000 paravirtualization performanceAlmost completely remove kicks and interrupts as packet rate increases24 -> 492 Kpps (1 VCPU), 65 -> 507 Kpps (2 VCPU)performance equivalent to that of virtiodoes not incur the latency of interrupt moderationreliability/portability/stability advantages

Page 9: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Cleaning up the hypervisor datapathData moves from frontend -> backend -> switch

access guest descriptors and bufferscopy to hypervisor memory (backend)pass to the switch

Initial throughput was limited to ~3 Mpps even before goingthrough the switch

removed unnecessary address translationoptimized copy routine

Current code can drive the switch at 10 Mpps

Guest-guest, TAP, sockets \ normal send_combining paravirt.itr \ 1 cpu 2 cpu 1 cpu 2 cpu 1 cpu 2 cpu \ +-------------------------------------------------------- | 0 | 24 65 23 301 492 507 | | | 1 | 22 68 23 303 | | | 100 | 80 79 322 334 | | |1000 | 90 87 293 323 | |

Page 10: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

Guest-guest, VALE, sockets \ normal send_combining paravirt.itr \ 1 cpu 2 cpu 1 cpu 2 cpu 1 cpu 2 cpu \ +-------------------------------------------------------- | 0 | 24 65 23 301 492 507 | | 27 112 27 650 1080 1080 | 1 | 22 68 23 303 | | 25 97 24 670 | 100 | 80 79 322 334 | | 129 125 850 860 |1000 | 90 87 293 323 | | 147 140 960 930

(Black: TAP, Red: VALE; ~500 Kpps with 1500-byte frames)~5 Mpps or 25-Gbit/s with netmap within the guest.

Evolution of the netmap architectureAfter this experience, we decided to add features that proved useful, but trying to avoidfeature bloat or additions that impact performance.

"transparent mode" for netmap: packets not marked by user application are sent to theother sidedirect connection between VALE and NIC (as a real host bridge)programmable destination lookup function in VALE (now we can hook to OVS)indirect buffers and scatter-gather I/O in netmap (optimize bulk I/O)

Page 11: n of the netmap architectu v Evolution of the netmap ......TAP: linux/BSD bridge, in-kernel OVS VALE achieves 18-20 Mpps / 70 Gbit/s Virtual Machine network performance Emulation of

AcknowledgementsMany contributions come from colleagues and former students:

Matteo Landi, Gaetano Catalli, Marta Carbone, Giuseppe Lettieri, Vincenzo Maffione,Michio Honda

Funding from EU Projects (CHANGE, OPENLAB), and companies (Netapp, Google).http://info.iet.unipi.it/~luigi/vale/[email protected]