Upload
networksguy
View
1.728
Download
1
Tags:
Embed Size (px)
Citation preview
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
Linux Multiqueue Networking
David S. Miller
Red Hat Inc.
Netfilter Workshop, ESIEA, Paris, 2008
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
NETWORKING PACKETS AS CPU WORK
Each packet generates some CPU workBoth at protocol and application levelFaster interfaces can generate more workSophisticated networking can generate more workFirewalling lookups, route lookups, etc.
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
CPU EVOLUTION
Multiprocessor systems used to be uncommonThis is changing... significantlyCommon systems will soon have 64 cpu threads, ormoreLots of CPU processing resourcesThe trouble is finding ways to use them
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
END TO END PRINCIPLE
Basic premise: Communication protocol operationsshould be defined to occur at the end-points of acommunications system, or as close as possible to theresource being controlledThe network is dumbThe end nodes are smartComputational complexity is pushed as far to the edgesas possible
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
END-TO-END APPLIED TO “SYSTEM”
The end-to-end principle extends all the way into yourcomputerEach cpu in your system is an end nodeThe “network” pushes work to your systemInternally, work should be pushed down further to theexecution resourcesWith multi-threaded processors like Niagara, this ismore important to get right than ever before
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
NETWORK DEVICE
Devices must provide facilities to balance the load ofincoming network workBut they must do so without creating new problems(such as reordering)Packet reordering looks like packet loss to TCPIndependant “work” should call upon independant cpusto process itPCI MSI-X interruptsMultiqueue network devices, with classification facilities
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
SOFTWARE SOLUTIONS
Not everyone will have multiqueue capable networkdevices or systems.Receive balancing is possible in softwareOlder systems can have their usefulness extended bysuch facilities
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
NAPI, “NEW API”
Basic abstraction for packet reception under Linux,providing software interrupt mitigation and loadbalancing.Device signals interrupt when receive packets areavailable.Driver disables chip interrupts, and NAPI context isqueued up for packet processingKernel generic networking processes NAPI context insoftware interrupt context, asking for packets from thedriver one at a timeWhen no more receive packets are pending, the NAPIcontext is dequeued and the driver re-enabled chipinterruptsUnder high load with a weak cpu, we may processthousands of packets per hardware interrupt
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
NAPI LIMITATIONS
It was not ready for multi-queue network devicesOne NAPI context exists per network deviceIt assumes a network device only generates oneinterrupt source which must be disabled during a pollCannot handle cleanly the case of a one to manyrelationship between network devices and packet eventsources
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
FIXING NAPI
Implemented by Stephen HemmingerRemove NAPI context from network device structureLet device driver define NAPI instances however it likesin it’s private data structuresSimple devices will define only one NAPI contextMultiqueue devices will define a NAPI context for eachRX queueAll NAPI instances for a device execute independantly,no shared locking, no mutual exclusionIndependant cpus process independant NAPI contextsin parallel
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
OK, WHAT ABOUT NON-MULTIQUEUE
HARDWARE?
Remote softirq block I/O layer changes by Jens AxboeProvides a generic facility to execute software interruptwork on remote processorsWe can use this to simulate network devices withhardware multiqueue facilitiesnetif_receive_skb() is changed to hash the flow of apacket and use this to choose a remote cpuWe push the packet input processing to the selectedcpu
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
CONFIGURABILITY
The default distribution works well for most people.Some people have special needs or want to controlhow work is distributed on their systemConfigurability of hardware facilities is variableSome provide lots of control, some provide very littleWe can fully control the software implementation
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
IMPETUS
Writing NIU driverPeter W’s multiqueue hacksTX multiqueue will be pervasive.Robert Olsson’s pending paper on 10GB routing (foundTX locking to be bottleneck in several situations)
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
PRESENTATIONS
Tokyo 2008Berlin 2008Seattle 2008Portland 2008Implementation plan changing constantlyEnd result was different than all of these designsFull implementation will be in 2.6.27
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
PICTURE OF TX ENGINE
QDISCdev_queue_xmit() ->
dev->queue_lock
hard_start_xmit
TX lock
set SKB queue mapping
DRIVER
TXQ TXQ TXQ
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
DETAILS TO NOTE
Generic networking has no idea about multiple TXqueuesParallelization impossible because of single queue andTX lockOnly single root QDISC is possible, sharing is notallowed
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
PICTURE OF DEFAULT CONFIGURATION
dev_queue_xmit
TXQ
TXQ
TXQ
qdisc->q.lock
qdisc->q.lock
qdisc->q.lock
TX lock
TX lock
TX lock
driver
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
PICTURE WITH NON-TRIVIAL QDISC
TXQ
TXQ
TXQ
qdisc->q.lock
TX lock
TX lock
TX lock
SKB
skb
driverSKB
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
THE NETWORK DEVICE QUEUE
Direction agnostic, can represent both RX and TXHolds:
Backpointer to struct net_devicePointer to ROOT qdisc for this queue, and sleepingqdiscTX lock for ->hard_start_xmit() synchronizationFlow control state bit (XOFF)
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
QDISC CHANGES
Holds state bits for qdisc_run() schedulingHas backpointer to referring dev_queueTherefore, current configured QDISC root is alwaysqdisc->dev_queue->qdisc_sleepingqdisc->q.lock on root is now used for treesynchronizationqdisc->list of root qdisc replaces netdev->qdisc_list
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
TX QUEUE SELECTION
Simple hash function over the packet FLOWinformation (source and destination address, sourceand destination port)Driver or driver layer is allowed to override this queueselection function if the queues of the device are usedfor something other than flow seperation or prioritizationExample: WirelessIn the future, this function will disappear.To be replaced with socket hash marking of transmitpackets, and RX side hash recording of receivedpackets for forwarding and bridging
LinuxMultiqueueNetworking
DavidS. Miller
BasicPremise
Receive SideMultiqueue
TransmitSideMultiqueue
ExistingDesign
New Design
ImplementationNotes
IMPLEMENTATION AND SEMANTIC ISSUES
Synchronization is difficult.People want to use queues for other tasks such aspriority scheduling, will be supported by sch_multiq.cmodule from Intel in 2.6.28TX multiqueue interactions with the packet schedulerand it’s semantics are still not entirely clearMost of the problems are about flow control