32
A RINA light implementation Vincenzo Maffione 20/02/2017

Rlite software-architecture (1)

Embed Size (px)

Citation preview

A RINA light implementationVincenzo Maffione

20/02/2017

Introduction (1)

● A Free and Open Source light implementation of RINA for Linux

● Implementation splitted between user-space and kernel-space

● KISS approach → codebase is clean and essential

● Focus:○ basic functionality - do few things but to them well

○ stability and performance - support deployments with hundreds of nodes

○ minimality - avoid over-engineering

● Main goal: a baseline implementation for future RINA products

● Code and documentation available at https://github.com/vmaffione/rlite

Introduction (2)

● ~ 27 Klocs (not including blanks)○ kernel-space: ~ 9 Klocs

○ user-space: ~ 18 Klocs

■ including tools and example applications

● Written mostly in C (some parts are C++ for convenience)○ C: 14 Klocs

○ C++: 7 Klocs

● Network applications can be written in C

● Python bindings available to write network applications in Python

Introduction (3)

● kernel-space is implemented as a set of out-of-tree kernel modules, which run

on the unmodified Linux kernel.○ Linux Kbuild system is used to build the modules against the running kernel

○ Build time (no parallel make): 3-15 seconds

● user-space is implemented as a set of shared libraries and programs○ CMake is used to configure and build libraries and executables

○ Build time (no parallel make): 15-60 seconds

Basic features (1)

● Applications:○ Flow allocation and deallocation, with QoS specification

○ Application registration and unregistration

○ Data transfer

● Stack administration:○ Creation, deletion and configuration of IPCPs

○ Registration and enrollment among IPCPs

○ Monitoring and inspection

■ inspection of IPCPs in the system

■ inspection of RIBs

■ per-flow statistics

Basic features (2)

● QoS (supported through DTCP):

○ Flow control

○ Retransmission control

○ Maximum allowable gap

○ Simple token-bucket rate-limiting

● Decent performance (detailed performance plots to come)

○ About 9.5 Gbps on a 10 Gbit link without flow control and retransmission

○ About 6 Gbps on a 10 Gbit link with flow control

○ A lot of room for optimizations

● Stability indicators

○ Done 10 days long VM-based experiments with up 35 nodes, two levels of normal DIFs and 50 flows

allocations per second

○ Done experiments with up to 10 levels of DIFs

Architecture overview (1)

rlite-ctl uipcps daemon application

librlite-cdaplibrlite-conf

librlite

/dev/rlite /dev/rlite-io

rlite

shim-eth

shim-loopbackshim-tcp4

normal

user-space

kernel-spaceshim-hv

shim-udp4

Architecture overview (2)

● kernel-space

○ Supports control operations

○ Implements datapath

○ Keeps state

● user-space

○ Libraries to abstract interaction with kernel-space functionalities

○ A daemon to implement management part of (many) IPCPs

○ An iproute2-like command-line tool to administer the stack

● Interactions between kernel-space and user-space only happen through character devices →

therefore through file descriptors

Kernel-space architecture (1)

● Supported functionalities:○ IPCP creation, deletion and configuration (kernel keeps a per-IPCP data structure)

○ Flow (de)allocation (kernel keeps a per-flow data structure)

○ Application (un)registration (kernel keeps a data structure for each registered application)

○ RMT, DTP and DTCP components of the normal IPCP

○ Shim IPCP processes (e.g. interaction with network device drivers)

● State is maintained in kernel-space:○ user-space can crash or be restarted at any time

○ user-space can recover state from kernel

Kernel-space architecture (2)

● User-space interacts with kernel-space only through two character devices○ /dev/rlite for control operations

○ /dev/rlite-io for data transfer and synchronization

● Consequently, interactions only happen through file descriptors

● Both are “cloning devices”○ each open() creates a new independent kernel-space instance

● Both devices support blocking and non-blocking operation○ Standard poll() and select() widely used with the devices

Kernel-space architecture (3)

● /dev/rlite used for control operations○ flow (de)allocation

○ Application (un)registration

○ IPCP creation, deletion and configuration

○ Management of PDU forwarding table

○ interactions between user-space and kernel-space parts of IPCPs

○ inspection and monitoring operations on flows and IPCPs

○ ...

Kernel-space architecture (4)

● Control operations follow a request/response paradigm:○ write() to the control device to submit a request message

○ Response messages (not always present) can be read through read()

● The control device is used to avoid ioctls() and netlink○ Easier porting to other OSes (e.g. FreeBSD)

● Request and response messages are represented by packed structs and are

serialized/deserialized during the user-space ←→ kernel-space transition○ support for string (de)serialization

○ support for (apn, api, aen, aei) name (de)serialization

Kernel-space architecture (5)

● /dev/rlite-io for data transfer and synchronization○ read()

○ write()

○ select(), poll(), epoll()

● Application workflow:○ Use the control device to allocate a flow (kernel-space object)

○ Bind the flow to a newly-created data transfer file descriptor - this is the only task performed by

means of ioctl()

○ Use the data transfer file descriptor to exchange SDUs and/or wait for events

○ Close file descriptor to deallocate the associated flow

● Special binding mode to exchange management SDUs

Kernel-space architecture (6)

● Usual abstract factory pattern to manage different types of IPCPs○ normal: implementation of the regular IPCP

○ shim-loopback: supports system-local IPC, with optional queued mode to decouple TX and RX

code-paths, and optional packet drop emulation

○ shim-eth: uses network device drivers to transmit and receive SDUs, sharing the device with

the Linux network stack

○ shim-udp4: tunnels RINA traffic over UDP socket; mostly implemented in user-space, only

data transfer is implemented in kernel-space

○ shim-tcp4: same as shim-udp4, but using a TCP socket; deprecated, since it duplicates

flow-control and congestion control done in higher layers

○ shim-hv: uses VMPI devices to transmit and receive SDUs

Some kernel-space internals

● Reference counters widely used to manage lifetime of objects (e.g. IPCPs,

flows, registered applications, PDUs)

● sk_buff-like approach to avoid copies throughout the datapath

● dynamic allocation of PDU buffers○ The amount of header space to reserve at allocation time is precomputed by the user-space

daemon, depending on the local IPCP dependency graph

● All PDU queues are limited in size to keep memory usage under control

● Deferred work (workqueues) used only when necessary, to keep latency low○ Example: driver transmission routine directly executes in the context of an application write()

system call, when possible

Architecture overview

rlite-ctl uipcps daemon application

librlite-cdaplibrlite-conf

librlite

/dev/rlite /dev/rlite-io

rlite

shim-eth

shim-loopbackshim-tcp4

normal

user-space

kernel-spaceshim-hv

shim-udp4

user-space libraries

● librlite (written in C)○ main library, abstracts interactions with the rlite control device (/dev/rlite)

○ provides common utilities and helpers (application names, flow specification, control

messages, ...)

○ provides an API for RINA applications

● Other libraries○ librlite-conf (C): extends librlite with kernel-space IPCP management functionalities

○ librlite-cdap (C++): CDAP implementation based on Google Protocol Buffer

librlite - Overview

● librlite provides API calls to interact with control device instances○ Validation, serialization and deserialization of control messages in both directions (user →

kernel, kernel → user)

● It defines a POSIX-like APIs for applications:○ Reminiscent of the socket API, to ease porting of existing socket applications...

○ … yet with the full power of RINA API (QoS support and complete naming scheme)

○ Easy to learn for grown-up network developers!

○ Documentation available at https://github.com/vmaffione/rlite/blob/master/include/rina/api.h

○ Other resources: https://github.com/IRATI/stack/wiki/Application-API

librlite - Application API

● Main API calls:○ int rina_open() → fd

■ Opens a control device instance, returning a file descriptor.

○ int rina_flow_alloc(dif_name, local_name, remote_name, flowspec, flags) → fd

■ Issues a flow allocation request and possibly wait for the associated response. Returns a file descriptor to be

used for data transfer.

○ int rina_register(fd, dif_name, appl_name, flags)

■ Register an application into a given DIF.

○ int rina_register(fd, dif_name, appl_name, flags)

■ Unregister an application from a given DIF.

○ int rina_flow_accept(fd, flags) → remote_appl, flowspec

■ Wait and possibly accept an incoming flow request, where the destination application is one of the ones

registered to the control device referred by fd. Returns a file descriptor to be used for data transfer.

librlite-conf

● It is the backend for the rlite-ctl stack administration tool

● Exports the management and inspection functionalities:○ IPCP creation

○ IPCP deletion

○ IPCP configuration

○ Fetch of current flows (with related statistics)

○ Dump state of a specific flow

○ Synchronization with uipcps daemon, to wait for the user-space part of an IPCP to show up

○ ...

librlite-cdap

● CDAP implementation using Google Protocol Buffer as concrete syntax

● Provides CDAP message constructors, serializers and deserializers

● Provides CDAP connections object to send and receive CDAP messages

● Each CDAP connection wraps a file descriptor○ In this way CDAP can be used over arbitrary file descriptors

○ Primarily meant to be used with /dev/rlite-io file descriptors

○ No dependencies on other parts of rlite, can be reused as a stand-alone component

Architecture overview

rlite-ctl uipcps daemon application

librlite-cdaplibrlite-conf

librlite

/dev/rlite /dev/rlite-io

rlite

shim-eth

shim-loopbackshim-tcp4

normal

user-space

kernel-spaceshim-hv

shim-udp4

Uipcps daemon - Overview

● A multi-threaded single-process daemon that implements management part of some IPCPs

● When an IPCP is created by the kernel, the daemon gets notified, and creates the corresponding

user-space IPCP (uipcp)

● For regular IPCPs, it implements:

○ Flow allocation RIB objects

○ Directory Forwarding Table RIB objects

○ Enrollment RIB objects and enrollment state machines

○ Routing RIB objects

○ Address allocation RIB objects

● For shim-tudp4 IPCPs it implements UDP sockets setup and dynamic UDP port allocation

● For shim-tcp4 IPCPs it implements TCP connection setup and teardown for both client and server

side (connect(), accept(), etc.)

Uipcps daemon - Internals

● A custom event-loop thread for each IPCP

● An additional thread that implements a UNIX socket server to serve requests coming from the

rlite-ctl tool (or other future agents)

● Abstract factory pattern to manage different types of uipcps

● Reference counters used to manage uipcps lifetime

● Subsystems:

○ UNIX socket server, written in C

○ uipcps container for generic uipcp management (creation, deletion, …), written in C

○ shim-udp4 and shim-tcp4 user-space implementation, written in C

○ normal IPCP user-space implementation, written in C++ manly because of CDAP

● C++ code confined inside the uipcp-normal statically linked library.

Uipcps daemon - Subsystems

rlite-ctl

uipcp daemon

librlite-cdap

librlite

application

unixserver

uipcpscontainer

normalshimudp4

Uipcp daemon - Event loop

● A custom event-loop on top of rlite control devices

● The event-loop thread to select() over many file descriptors○ rlite control devices: when events happen on the control device, event-specific callbacks get

executed

○ Other file descriptors: when an event is ready on one of those, an user-provided callback gets

executed

● Supports timers, that can be used to execute a callback after a certain

amount of time

Uipcp daemon - Advanced features

● The uipcp-containers module keeps track of the IPCPs in the local system

and the flows allocated among them○ This information is maintained in a graph of local IPCPs

○ A node for each IPCP, an edge for each inter-IPCP flow

○ Graph used for automatic computation of:

■ per-IPCP Maximum SDU size (using the constraints provided by shim DIFs)

■ per-IPCP PCI header space to be reserved at kernel buffer allocation

○ Result of computation is pushed to the kernel for optimized operation

● Optional automatic re-enrollment triggers to create N-1 flows where they are

missing

rlite-ctl

● An ip-route2-like command-line tool to administer and monitor IPCP

processes

● Functionalities:○ IPCP creation and deletion

○ IPCP configuration

○ Registration of an IPCP to a DIF

○ Enrollment between a local IPCP and a remote IPCP

○ Show list of IPCPs

○ Show RIB of a DIF

○ Show list of flows

○ Dump state of a specific flow

Common functionalities

● Common code is compiled both in user-space and kernel-space, to ease

maintenance:○ Serialization and deserialization routines of control messages across user/kernel interface

■ Table-based serialization/deserialization, adding a new message is straightforward

○ Helper functions for RINA names - (APN, API, AEN, AEI) tuples.

Available RINA application

● Example applications:○ rinaperf: multi-threaded client/server capable of parallel flow allocation, implementing basic

connectivity and performance testing: ping, request-response, unidirectional bandwidth

○ rina-echo-async: single-threaded event-loop based client/server tool, capable of concurrent

flow allocation and concurrent flow management

● Real application○ nginx: RINA port of the popular Nginx server

○ dropbear: RINA port of the Dropbear ssh client/server

○ rina-gw: Event-loop application acting as an application gateway between a RINA network and

an IP network

■ It forwards TCP connections over RINA flows and the other way around

Demo

● RINA/TCP gateway, to make TCP/IP world interact with RINA world

● Minimally patched Nginx Web Server runs over RINA

TCP/IP NETWORK

Proxy host

Client host 1

Web browser

rina-gw

Server host 1

patched nginx

RINA NETWORK

Client host 2

Web browser

RINA flow

TCP connection

Demo

● RINA/TCP gateway, to make TCP/IP world interact with RINA world

● Minimally patched Nginx Web Server runs over RINA

VM A

patched nginx

VM B

rina-gw Browser

n.1.DIF (normal)

Shim-eth (e.1.DIF)

TCP