22
Linux Multiqueue Networking David S. Miller Basic Premise Receive Side Multiqueue Transmit Side Multiqueue Existing Design New Design Implementation Notes Linux Multiqueue Networking David S. Miller Red Hat Inc. Netfilter Workshop, ESIEA, Paris, 2008

Linux Multiqueue Networking

Embed Size (px)

Citation preview

Page 1: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

Linux Multiqueue Networking

David S. Miller

Red Hat Inc.

Netfilter Workshop, ESIEA, Paris, 2008

Page 2: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

NETWORKING PACKETS AS CPU WORK

Each packet generates some CPU workBoth at protocol and application levelFaster interfaces can generate more workSophisticated networking can generate more workFirewalling lookups, route lookups, etc.

Page 3: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

CPU EVOLUTION

Multiprocessor systems used to be uncommonThis is changing... significantlyCommon systems will soon have 64 cpu threads, ormoreLots of CPU processing resourcesThe trouble is finding ways to use them

Page 4: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

END TO END PRINCIPLE

Basic premise: Communication protocol operationsshould be defined to occur at the end-points of acommunications system, or as close as possible to theresource being controlledThe network is dumbThe end nodes are smartComputational complexity is pushed as far to the edgesas possible

Page 5: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

END-TO-END APPLIED TO “SYSTEM”

The end-to-end principle extends all the way into yourcomputerEach cpu in your system is an end nodeThe “network” pushes work to your systemInternally, work should be pushed down further to theexecution resourcesWith multi-threaded processors like Niagara, this ismore important to get right than ever before

Page 6: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

NETWORK DEVICE

Devices must provide facilities to balance the load ofincoming network workBut they must do so without creating new problems(such as reordering)Packet reordering looks like packet loss to TCPIndependant “work” should call upon independant cpusto process itPCI MSI-X interruptsMultiqueue network devices, with classification facilities

Page 7: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

SOFTWARE SOLUTIONS

Not everyone will have multiqueue capable networkdevices or systems.Receive balancing is possible in softwareOlder systems can have their usefulness extended bysuch facilities

Page 8: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

NAPI, “NEW API”

Basic abstraction for packet reception under Linux,providing software interrupt mitigation and loadbalancing.Device signals interrupt when receive packets areavailable.Driver disables chip interrupts, and NAPI context isqueued up for packet processingKernel generic networking processes NAPI context insoftware interrupt context, asking for packets from thedriver one at a timeWhen no more receive packets are pending, the NAPIcontext is dequeued and the driver re-enabled chipinterruptsUnder high load with a weak cpu, we may processthousands of packets per hardware interrupt

Page 9: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

NAPI LIMITATIONS

It was not ready for multi-queue network devicesOne NAPI context exists per network deviceIt assumes a network device only generates oneinterrupt source which must be disabled during a pollCannot handle cleanly the case of a one to manyrelationship between network devices and packet eventsources

Page 10: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

FIXING NAPI

Implemented by Stephen HemmingerRemove NAPI context from network device structureLet device driver define NAPI instances however it likesin it’s private data structuresSimple devices will define only one NAPI contextMultiqueue devices will define a NAPI context for eachRX queueAll NAPI instances for a device execute independantly,no shared locking, no mutual exclusionIndependant cpus process independant NAPI contextsin parallel

Page 11: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

OK, WHAT ABOUT NON-MULTIQUEUE

HARDWARE?

Remote softirq block I/O layer changes by Jens AxboeProvides a generic facility to execute software interruptwork on remote processorsWe can use this to simulate network devices withhardware multiqueue facilitiesnetif_receive_skb() is changed to hash the flow of apacket and use this to choose a remote cpuWe push the packet input processing to the selectedcpu

Page 12: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

CONFIGURABILITY

The default distribution works well for most people.Some people have special needs or want to controlhow work is distributed on their systemConfigurability of hardware facilities is variableSome provide lots of control, some provide very littleWe can fully control the software implementation

Page 13: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

IMPETUS

Writing NIU driverPeter W’s multiqueue hacksTX multiqueue will be pervasive.Robert Olsson’s pending paper on 10GB routing (foundTX locking to be bottleneck in several situations)

Page 14: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

PRESENTATIONS

Tokyo 2008Berlin 2008Seattle 2008Portland 2008Implementation plan changing constantlyEnd result was different than all of these designsFull implementation will be in 2.6.27

Page 15: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

PICTURE OF TX ENGINE

QDISCdev_queue_xmit() ->

dev->queue_lock

hard_start_xmit

TX lock

set SKB queue mapping

DRIVER

TXQ TXQ TXQ

Page 16: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

DETAILS TO NOTE

Generic networking has no idea about multiple TXqueuesParallelization impossible because of single queue andTX lockOnly single root QDISC is possible, sharing is notallowed

Page 17: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

PICTURE OF DEFAULT CONFIGURATION

dev_queue_xmit

TXQ

TXQ

TXQ

qdisc->q.lock

qdisc->q.lock

qdisc->q.lock

TX lock

TX lock

TX lock

driver

Page 18: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

PICTURE WITH NON-TRIVIAL QDISC

TXQ

TXQ

TXQ

qdisc->q.lock

TX lock

TX lock

TX lock

SKB

skb

driverSKB

Page 19: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

THE NETWORK DEVICE QUEUE

Direction agnostic, can represent both RX and TXHolds:

Backpointer to struct net_devicePointer to ROOT qdisc for this queue, and sleepingqdiscTX lock for ->hard_start_xmit() synchronizationFlow control state bit (XOFF)

Page 20: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

QDISC CHANGES

Holds state bits for qdisc_run() schedulingHas backpointer to referring dev_queueTherefore, current configured QDISC root is alwaysqdisc->dev_queue->qdisc_sleepingqdisc->q.lock on root is now used for treesynchronizationqdisc->list of root qdisc replaces netdev->qdisc_list

Page 21: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

TX QUEUE SELECTION

Simple hash function over the packet FLOWinformation (source and destination address, sourceand destination port)Driver or driver layer is allowed to override this queueselection function if the queues of the device are usedfor something other than flow seperation or prioritizationExample: WirelessIn the future, this function will disappear.To be replaced with socket hash marking of transmitpackets, and RX side hash recording of receivedpackets for forwarding and bridging

Page 22: Linux Multiqueue Networking

LinuxMultiqueueNetworking

DavidS. Miller

BasicPremise

Receive SideMultiqueue

TransmitSideMultiqueue

ExistingDesign

New Design

ImplementationNotes

IMPLEMENTATION AND SEMANTIC ISSUES

Synchronization is difficult.People want to use queues for other tasks such aspriority scheduling, will be supported by sch_multiq.cmodule from Intel in 2.6.28TX multiqueue interactions with the packet schedulerand it’s semantics are still not entirely clearMost of the problems are about flow control