Upload
kernel-tlv
View
1.159
Download
2
Embed Size (px)
Citation preview
Fun with Network Interfaces
Shmulik LadkaniMarch 2016
This work is licensed under a Creative Commons Attribution 4.0 International License.
On the Menu
● Linux network stack, a (quick) intro○ What’s this net_device anyway?
○ Programming interfaces
○ Frame reception and transmission
● Logical network interfaces○ What?
○ Why?
○ Examples
○ Examples
○ Examples
Agenda
● Goals○ Strengthen foundations
○ Explain interaction of main network stack components
○ Familiarize with building blocks of virtual networks
○ Ease of further research
● Non Goals○ Mastering device driver programming
○ How network gear operates in detail
○ Specific component deep dive
Network Stack Layers
L7
L4
L3
L2
Device-specific L2
network core
ipv4
udptcp icmp igmp gre ipip ...
device drivers
arp ipv6 pppoe ...
usermode appsocket api
Network Core
● Generic functionalities of a network device
● RX○ Processing of incoming frames
○ Delivery to upper protocols
● TX○ Queuing
○ Final processing
○ Hand-over to driver’s transmit method
Struct net_device
● Represents a network interface
● One for each network device in the system○ Either physical device or logical (software) one
Struct net_deviceCommon properties
● Identified by a ‘name’ and ‘ifindex’○ Unique to a network namespace
● Has BSD-like ‘flags’IFF_UP, IFF_LOOPBACK, IFF_POINTOPOINT, IFF_NOARP, IFF_PROMISC...
● Has ‘features’NETIF_F_SG_BIT, NETIF_F_HW_CSUM_BIT, NETIF_F_GSO_BIT,
NETIF_F_GRO_BIT, NETIF_F_LRO_BIT, NETIF_F_RXHASH_BIT,
NETIF_F_RXCSUM_BIT...
● Has many other fields...
● Holds associated device operationsconst struct net_device_ops *netdev_ops;
Struct net_device_ops
● Interface. Defines all device methods○ Driver implements
● E.g. e1000e_netdev_ops, bcmgenet_netdev_ops …
○ Network-core uses
● Fat interface…○ 44 methods in v3.4
○ 59 methods in v3.14
○ 68 methods in v4.4
○ Few methods #ifdef protected
○ Some are optional
Struct net_device_opsCommon methods
● ndo_open()○ Upon device transition to UP state
● ndo_stop()○ Upon device transition to DOWN state
● ndo_start_xmit()○ When a packet needs to be transmitted
● ndo_set_features()○ Update device configuration to new features
● ndo_get_stats()○ Get device usage statistics
● ndo_set_mac_address()○ When MAC needs to be changed
● Many more...
Stack’s core interfacesFor device implementers
● napi_schedule()○ Schedule driver’s poll routine to be called
● netif_receive_skb()○ Pass a received buffer to network core processing
○ Few other interfaces exist
● netif_stop_queue()○ Stop upper layer from calling device’s ndo_start_xmit
● netif_wake_queue()○ Allow upper layer to call device’s ndo_start_xmit
● More...
Frame Reception__netif_receive_skb_core()
● Deliver to network taps (protocol sniffers)
● Ingress classification and filtering
● VLAN packet handling
● Invoke a specially registered ‘rx_handler’○ May consume packet
● Deliver to the registered L3 protocol handler○ No handler? Drop
net_device->rx_handler
● Per device registered function○ Called internally from ‘__netif_receive_skb_core’
○ Prior delivery to protocol handlers
● Allows special L2 processing during RX
● Semantics○ At most one registered ‘rx_handler’ per device
○ May consume the packet● ‘netif_receive_skb’ will not further process it
○ May instruct ‘netif_receive_skb’ to do “another round”
● Notable users○ bridge, openvswitch, bonding, team, macvlan, macvtap
Frame Transmissiondev_queue_xmit()
● Well, packet is set-up for transmission○ Yeay! Let’s pass to driver’s ndo_start_xmit() !
○ Wait a minute… literally
● Device has no queue?○ Final preps & xmit
● Device has a queue?○ Enqueue the packet
● Using device queueing discipline
○ Kick the queue
○ Will eventually get to “final preps & xmit”● Synchronously or asynchronously
● According to discipline
Software Net Device
● Not associated with a physical NIC
● Provides logical rx/tx functionality○ By implementing the net_device interface
● Allows special-purpose packet processing○ Without altering the network stack
Variants of Logical Devices
● Directly operate on specified net device(s)○ Protocols (vlan, pppoe…)
○ Logical constructs (bridge, bonding, veth, macvlan...)
● Interact with higher network layers ○ IP based tunnels (ipip, gre, sit, l2tp…)
○ UDP based tunnels (vxlan, geneve, l2tp-udp…)
● Other constructs○ May or may not interact with other net devices
○ lo, ppp, tun/tap, ifb...
lo:Loopback interface
static netdev_tx_t loopback_xmit(struct sk_buff *skb,
struct net_device *dev)
{
...
netif_rx(skb); // eventually gets to netif_receive_skb
}
Every transmitted packet is bounced back for reception○ Using same device
network core
ipv4
lo
... ipv6 ... ...
vlan: (circa 2.4)802.1q Virtual LAN interface
● Has an underlying “link” net devicestruct vlan_dev_priv {
u16 vlan_id;
struct net_device *real_dev;
…
● Xmit method○ Tags the packet
○ Queues for transmission on the underlying device
vlan_tci = vlan->vlan_id;
vlan_tci |= …
skb = __vlan_hwaccel_put_tag(skb, vlan->vlan_proto, vlan_tci);
skb->dev = vlan->real_dev;
…
ret = dev_queue_xmit(skb);
● Device is associated with a usermode fd○ write(fd) --> device RX
○ device TX --> ready to read(fd)
● Operation mode○ tun: L3 packets
○ tap: L2 frames
tun/tap: (circa 2.4)Usermode packet processing
tap0fd
network core
ipv4... ... ...
● Usermode VPN applications○ Routing to the VPN subnet is directed to tun device
● E.g. 192.168.50.0/24 dev tun0
○ read(tun_fd, buf)
○ encrypt(buf)
○ encapsulate(buf)
○ send(tcp_sock, buf)
TUN use cases
● VM networking○ Emulator exposes a tap for each VM NIC
○ Emulator traps VM xmit
○ Issues write(tap_fd)
○ Packet arrives at host’s net stack via tap device
TAP use cases
veth: (circa 2.6.24)Virtual Ethernet Pair
● Local ethernet “wire”
● Comes as pair of virtual ethernet interfaces
● veth TX --> peer veth RX○ And vice versa
network core
ipv4... ... ...
veth0 veth1
● Container networking○ First veth in host’s net namespace
○ Peer veth in container’s net namespace
● Local links of a virtual network
● Network emulation
veth use cases
Bridge:802.1d Ethernet Bridging
● Software L2 forwarding switch○ Expands the L2 network across various links
○ 802.1q VLAN capable circa 3.9
● Bridge has multiple slave devices (“ports”)
● rx_handler registered on slave devices○ Picks output devices(s) based on FDB
○ Calls ‘dev_queue_xmit’ on output device
○ Packet consumed! Never enters L3 stack via input device
eth1 eth2eth0
br0
rx_handler() dev_queue_xmit()
● Same / different medium○ Slave devices present L2 ethernet network
Bridging physical networks
wifi0 ethoa3eth1
br0
bnep0 usbnet1
Bridging virtual networks
VM A VM B
tap1 vxlan0tap0
br0
veth0
Container X
veth1
veth2
Container Y
veth3 tunnel to remote bridge
MacVLAN: (circa 2.6.32)MAC Address based VLANs
● Network segmentation based on destination MAC
● Macvlan devices have underlying “lower” device○ Macvlans on same link have unique MAC addresses
● Macvlan xmit○ Calls ‘dev_queue_xmit’ on lower device
● rx_handler registered on lower device○ Look for a macvlan device based on packet’s dest-MAC
○ None found? Return “Pass” (normal processing)
○ Found? Change skb->dev to the macvlan dev, return “Another round”
rx_handler()
eth0
mvlan0 mvlan1 mvlan2
● Network segmentation○ Where 802.1q VLAN can’t be used
MacVLAN use cases
eth0
eth0_data eth0_voip
54.6.30.15/24 10.0.5.94/16
VoIP Service
Internet Access Service
● Lightweight virtual network○ For containers / VMs
○ Various operation modes
● MacVTAP (circa 2.6.34)
○ Each device has a tap-like FD interface
MacVLAN use cases
eth0
mvlan0 mvlan1 mvlan2
Container X Container Y Container Z
ip_gre: (circa 2.2)GRE Tunnel, IP Based
● Global initialization○ Register the IPPROTO_GRE transport protocol handler
ipv4stack ipv6 ...
network core
grestack... ... ... ... ...
...
ip_gre: (circa 2.2)GRE Tunnel, IP Based
● Per device initialization○ Store tunnel instance parameters
○ E.g. encapsulating iph.saddr, iph.daddr
ipv4stack ipv6 ...
network core
grestack... ... ... ... ...
...
gre054.90.24.7
gre181.104.12.5
ip_gre deviceTransmit method
● Routing to remote subnet directed to gre device○ E.g. 192.168.50.0/24 dev gre0
● Install the GRE header
● Consult IP routing for output decision○ Based on tunnel parameters
● Install the encapsulating IP header
● Pass to IP stack for local output
ipv4stack ipv6 ...
network core
grestack... ... ... ... ...
...
gre0 gre1
ip_gre deviceReceive path
● Encapsulating packet arrives on a net device
● IP stack invokes the registered transport handler
● GRE handler looks-up for a matching tunnel instance○ Based on encapsulating IP header fields
● Changes skb->dev to the matched tunnel device
● Skb points to inner packet
● Re-submit to network’s core RX path
ipv4stack ipv6 ...
network core
grestack... ... ... ... ...
...
gre0 gre1
bond: (circa 2.2)Link Aggregation
● Aggregates multiple interfaces into a single “bond”○ Various operating modes:
Round-robin, active-backup, broadcast, 802.3ad...
● Bond device has multiple “slave” devices
● Bond xmit○ Calls ‘dev_queue_xmit’ on slave devices(s)
● rx_handler registered on slave devices○ Changes skb->dev as the bond device, returns “Another round”
eth0 eth1
bond0
rx_handler()dev_queue_xmit()
● Bandwidth aggregation
● Fault tolerance / HA
● L2 based Load Balancing
See similar ‘team’ driver (circa 3.3)
bond use cases