50
Fun with Network Interfaces Shmulik Ladkani March 2016 This work is licensed under a Creative Commons Attribution 4.0 International License .

Fun with Network Interfaces

Embed Size (px)

Citation preview

Fun with Network Interfaces

Shmulik LadkaniMarch 2016

This work is licensed under a Creative Commons Attribution 4.0 International License.

On the Menu

● Linux network stack, a (quick) intro○ What’s this net_device anyway?

○ Programming interfaces

○ Frame reception and transmission

● Logical network interfaces○ What?

○ Why?

○ Examples

○ Examples

○ Examples

Agenda

● Goals○ Strengthen foundations

○ Explain interaction of main network stack components

○ Familiarize with building blocks of virtual networks

○ Ease of further research

● Non Goals○ Mastering device driver programming

○ How network gear operates in detail

○ Specific component deep dive

Disclaimer

● Linux network stack is huge and complex○ Let’s skip some fine details

Network Stack Intro

Diagram by Arnout Vandecappelle

Network Stack Intro

Take II

Diagram by Jan Engelhardt

Network Stack Intro

Take III

Network Stack Layers

L7

L4

L3

L2

Device-specific L2

network core

ipv4

udptcp icmp igmp gre ipip ...

device drivers

arp ipv6 pppoe ...

usermode appsocket api

Network Core

● Generic functionalities of a network device

● RX○ Processing of incoming frames

○ Delivery to upper protocols

● TX○ Queuing

○ Final processing

○ Hand-over to driver’s transmit method

Struct net_device

● Represents a network interface

● One for each network device in the system○ Either physical device or logical (software) one

Struct net_deviceCommon properties

● Identified by a ‘name’ and ‘ifindex’○ Unique to a network namespace

● Has BSD-like ‘flags’IFF_UP, IFF_LOOPBACK, IFF_POINTOPOINT, IFF_NOARP, IFF_PROMISC...

● Has ‘features’NETIF_F_SG_BIT, NETIF_F_HW_CSUM_BIT, NETIF_F_GSO_BIT,

NETIF_F_GRO_BIT, NETIF_F_LRO_BIT, NETIF_F_RXHASH_BIT,

NETIF_F_RXCSUM_BIT...

● Has many other fields...

● Holds associated device operationsconst struct net_device_ops *netdev_ops;

Struct net_device_ops

● Interface. Defines all device methods○ Driver implements

● E.g. e1000e_netdev_ops, bcmgenet_netdev_ops …

○ Network-core uses

● Fat interface…○ 44 methods in v3.4

○ 59 methods in v3.14

○ 68 methods in v4.4

○ Few methods #ifdef protected

○ Some are optional

Struct net_device_opsCommon methods

● ndo_open()○ Upon device transition to UP state

● ndo_stop()○ Upon device transition to DOWN state

● ndo_start_xmit()○ When a packet needs to be transmitted

● ndo_set_features()○ Update device configuration to new features

● ndo_get_stats()○ Get device usage statistics

● ndo_set_mac_address()○ When MAC needs to be changed

● Many more...

Stack’s core interfacesFor device implementers

● napi_schedule()○ Schedule driver’s poll routine to be called

● netif_receive_skb()○ Pass a received buffer to network core processing

○ Few other interfaces exist

● netif_stop_queue()○ Stop upper layer from calling device’s ndo_start_xmit

● netif_wake_queue()○ Allow upper layer to call device’s ndo_start_xmit

● More...

Frame Reception__netif_receive_skb_core()

● Deliver to network taps (protocol sniffers)

● Ingress classification and filtering

● VLAN packet handling

● Invoke a specially registered ‘rx_handler’○ May consume packet

● Deliver to the registered L3 protocol handler○ No handler? Drop

net_device->rx_handler

● Per device registered function○ Called internally from ‘__netif_receive_skb_core’

○ Prior delivery to protocol handlers

● Allows special L2 processing during RX

● Semantics○ At most one registered ‘rx_handler’ per device

○ May consume the packet● ‘netif_receive_skb’ will not further process it

○ May instruct ‘netif_receive_skb’ to do “another round”

● Notable users○ bridge, openvswitch, bonding, team, macvlan, macvtap

Frame Transmissiondev_queue_xmit()

● Well, packet is set-up for transmission○ Yeay! Let’s pass to driver’s ndo_start_xmit() !

○ Wait a minute… literally

● Device has no queue?○ Final preps & xmit

● Device has a queue?○ Enqueue the packet

● Using device queueing discipline

○ Kick the queue

○ Will eventually get to “final preps & xmit”● Synchronously or asynchronously

● According to discipline

Software Network Interfaces

a.k.a logical network interfaces

Software Net Device

● Not associated with a physical NIC

● Provides logical rx/tx functionality○ By implementing the net_device interface

● Allows special-purpose packet processing○ Without altering the network stack

Variants of Logical Devices

● Directly operate on specified net device(s)○ Protocols (vlan, pppoe…)

○ Logical constructs (bridge, bonding, veth, macvlan...)

● Interact with higher network layers ○ IP based tunnels (ipip, gre, sit, l2tp…)

○ UDP based tunnels (vxlan, geneve, l2tp-udp…)

● Other constructs○ May or may not interact with other net devices

○ lo, ppp, tun/tap, ifb...

Loopback

lo:Loopback interface

static netdev_tx_t loopback_xmit(struct sk_buff *skb,

struct net_device *dev)

{

...

netif_rx(skb); // eventually gets to netif_receive_skb

}

Every transmitted packet is bounced back for reception○ Using same device

network core

ipv4

lo

... ipv6 ... ...

VLAN

vlan: (circa 2.4)802.1q Virtual LAN interface

● Has an underlying “link” net devicestruct vlan_dev_priv {

u16 vlan_id;

struct net_device *real_dev;

● Xmit method○ Tags the packet

○ Queues for transmission on the underlying device

vlan_tci = vlan->vlan_id;

vlan_tci |= …

skb = __vlan_hwaccel_put_tag(skb, vlan->vlan_proto, vlan_tci);

skb->dev = vlan->real_dev;

ret = dev_queue_xmit(skb);

TUN/TAP

● Device is associated with a usermode fd○ write(fd) --> device RX

○ device TX --> ready to read(fd)

● Operation mode○ tun: L3 packets

○ tap: L2 frames

tun/tap: (circa 2.4)Usermode packet processing

tap0fd

network core

ipv4... ... ...

● Usermode VPN applications○ Routing to the VPN subnet is directed to tun device

● E.g. 192.168.50.0/24 dev tun0

○ read(tun_fd, buf)

○ encrypt(buf)

○ encapsulate(buf)

○ send(tcp_sock, buf)

TUN use cases

● VM networking○ Emulator exposes a tap for each VM NIC

○ Emulator traps VM xmit

○ Issues write(tap_fd)

○ Packet arrives at host’s net stack via tap device

TAP use cases

VETH

veth: (circa 2.6.24)Virtual Ethernet Pair

● Local ethernet “wire”

● Comes as pair of virtual ethernet interfaces

● veth TX --> peer veth RX○ And vice versa

network core

ipv4... ... ...

veth0 veth1

● Container networking○ First veth in host’s net namespace

○ Peer veth in container’s net namespace

● Local links of a virtual network

● Network emulation

veth use cases

Bridge

Bridge:802.1d Ethernet Bridging

● Software L2 forwarding switch○ Expands the L2 network across various links

○ 802.1q VLAN capable circa 3.9

● Bridge has multiple slave devices (“ports”)

● rx_handler registered on slave devices○ Picks output devices(s) based on FDB

○ Calls ‘dev_queue_xmit’ on output device

○ Packet consumed! Never enters L3 stack via input device

eth1 eth2eth0

br0

rx_handler() dev_queue_xmit()

● Same / different medium○ Slave devices present L2 ethernet network

Bridging physical networks

wifi0 ethoa3eth1

br0

bnep0 usbnet1

Bridging virtual networks

VM A VM B

tap1 vxlan0tap0

br0

veth0

Container X

veth1

veth2

Container Y

veth3 tunnel to remote bridge

MACVLAN

MacVLAN: (circa 2.6.32)MAC Address based VLANs

● Network segmentation based on destination MAC

● Macvlan devices have underlying “lower” device○ Macvlans on same link have unique MAC addresses

● Macvlan xmit○ Calls ‘dev_queue_xmit’ on lower device

● rx_handler registered on lower device○ Look for a macvlan device based on packet’s dest-MAC

○ None found? Return “Pass” (normal processing)

○ Found? Change skb->dev to the macvlan dev, return “Another round”

rx_handler()

eth0

mvlan0 mvlan1 mvlan2

● Network segmentation○ Where 802.1q VLAN can’t be used

MacVLAN use cases

eth0

eth0_data eth0_voip

54.6.30.15/24 10.0.5.94/16

VoIP Service

Internet Access Service

● Lightweight virtual network○ For containers / VMs

○ Various operation modes

● MacVTAP (circa 2.6.34)

○ Each device has a tap-like FD interface

MacVLAN use cases

eth0

mvlan0 mvlan1 mvlan2

Container X Container Y Container Z

GRE

ip_gre: (circa 2.2)GRE Tunnel, IP Based

● Global initialization○ Register the IPPROTO_GRE transport protocol handler

ipv4stack ipv6 ...

network core

grestack... ... ... ... ...

...

ip_gre: (circa 2.2)GRE Tunnel, IP Based

● Per device initialization○ Store tunnel instance parameters

○ E.g. encapsulating iph.saddr, iph.daddr

ipv4stack ipv6 ...

network core

grestack... ... ... ... ...

...

gre054.90.24.7

gre181.104.12.5

ip_gre deviceTransmit method

● Routing to remote subnet directed to gre device○ E.g. 192.168.50.0/24 dev gre0

● Install the GRE header

● Consult IP routing for output decision○ Based on tunnel parameters

● Install the encapsulating IP header

● Pass to IP stack for local output

ipv4stack ipv6 ...

network core

grestack... ... ... ... ...

...

gre0 gre1

ip_gre deviceReceive path

● Encapsulating packet arrives on a net device

● IP stack invokes the registered transport handler

● GRE handler looks-up for a matching tunnel instance○ Based on encapsulating IP header fields

● Changes skb->dev to the matched tunnel device

● Skb points to inner packet

● Re-submit to network’s core RX path

ipv4stack ipv6 ...

network core

grestack... ... ... ... ...

...

gre0 gre1

Bonding

bond: (circa 2.2)Link Aggregation

● Aggregates multiple interfaces into a single “bond”○ Various operating modes:

Round-robin, active-backup, broadcast, 802.3ad...

● Bond device has multiple “slave” devices

● Bond xmit○ Calls ‘dev_queue_xmit’ on slave devices(s)

● rx_handler registered on slave devices○ Changes skb->dev as the bond device, returns “Another round”

eth0 eth1

bond0

rx_handler()dev_queue_xmit()

● Bandwidth aggregation

● Fault tolerance / HA

● L2 based Load Balancing

See similar ‘team’ driver (circa 3.3)

bond use cases

Summary

● Network stack brief

● net_device abstraction

● Logical network interfaces○ Many covered!

○ Many others exist (see: ifb, vrf…)

● Questions?

● Contact me!