Network Router Resource Virtualisation

Department of Computing ScienceUniversity of Glasgow

MASTERS RESEARCH THESISMSCI, 2004/2005

Network Router Resource Virtualisation

by

Ross McIlroy

Abstract

The Internet was created as a best effort service, however, there is now considerable interest in appli-cations which transport time sensitive data across the network. This project looks into a novel networkrouter architecture, which could improve the Quality of Service guarantees which can be provided tosuch flows. This router architecture makes use of virtual machine techniques, to assign an individualvirtual routelet to each network flow requiring QoS guarantees. In order to evaluate the effectivenessof this virtual routelet architecture, a prototype router was created, and its performance was measuredand compared with that of a standard router. The results suggest promise in the virtual routelet archi-tecture, however, the overheads incurred by current virtual machine techniques prevent this architecturefrom outperforming a standard router. Evaluation of these results suggests a number of enhancementswhich could be made to virtual machine techniques in order to effectively support this virtual routeletarchitecture.

Acknowledgements

I would like to acknowledge the help and advice received from the following people, during the courseof this project.

My project supervisor, Professor Joe Sventek for the invaluable advice, support and encouragement hegave throughout this project.

Dr Peter Dickman and Dr Colin Perkins for the help and advice which they freely provided whenever Ihad questions.

Jonathan Paisley for his invaluable assistance when setting up the complex network routing required bythis project.

The Xen community for their help with any questions I had and their willingness to provide new unre-leased features for the purposes of this project. Special thanks to Keir Fraser for his advice when I wascreating the VIF credit limiter, and Stephan Diestelhorst for providing an unfinished version of the sedfscheduler.

The Click Modular Router community for help and advice when writing the Click port to the Linux 2.6kernel, especially Francis Bogsanyi for his preliminary work and Eddie Kohler for his advice.

The technical support staff within the department, especially Mr Gary Gray, Mr Naveed Khan and MrStewart MacNeill from the support team, and all of the department technicians, for their help in buildingthe experimental testbed and providing equipment for the project.

The other MSci students, especially Elin Olsen, Stephen Strowes and Christopher Bayliss, for answeringvarious questions and for generally keeping me sane.

My parents for their support and their help proof reading this report.

i

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Survey 32.1 Router Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Quality of Service Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Virtualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Similar Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Approach 14

4 Integration of Open Source Projects 184.1 Overview of Open Source Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 The Click Modular Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.2 MPLS RSVP-TE daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 QoS Routelet 265.1 Routelet Guest Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1.2 Linux Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1.3 Routelet Startup Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Routelet Packet Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.2 ProcessShim Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Main Router 336.1 Routelet Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Assignment of QoS Network Flows to Routelets . . . . . . . . . . . . . . . . . . . . . . 34

6.2.1 ConnectRoutelet Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2.2 ClickCom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2.3 RSVP-TE Daemon Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Packet Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3.2 Automatic Demultiplexing Architecture Creation . . . . . . . . . . . . . . . . . 406.3.3 MplsSwitch Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3.4 Returning Routelets to the Idle Pool . . . . . . . . . . . . . . . . . . . . . . . . 41

ii

6.4 Routelet Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.1 CPU Time Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.4.2 Network Transmission Rate Allocation . . . . . . . . . . . . . . . . . . . . . . 42

7 Experimental Testbed Setup 447.1 Network Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Network QoS Measurement Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.2.1 Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2.2 Per-Packet Timing Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.3 Accurate Relative Time Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.4 Network Traffic Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.5 Straining Router with Generated Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8 Experimental Results and Analysis 518.1 Virtual Machine Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.1.2 Interpacket Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598.1.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2 Network Flow Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638.2.1 Effect of Timer Interrupt Busy Wait . . . . . . . . . . . . . . . . . . . . . . . . 648.2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668.2.3 Interpacket Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.2.4 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.3 Future Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9 Evaluation 729.1 Experimental Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729.2 Virtualisation Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.2.1 Linux OS for each Routelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.2.2 Context Switch Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.2.3 Routlet Access to NIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.2.4 Classifying Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759.2.5 Soft Real Time Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

10 Conclusion 7610.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7610.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7610.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7810.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Bibliography 81

iii

Chapter 1

Introduction

There is considerable interest in applications that transport isochronous data across a network, for exam-ple teleconferencing or Voice-over-IP applications. These applications require a network that providesQuality of Service (QoS) guarantees. While there are various networks that provide some form of QoSprovisioning, there has been little investigation into the allocation and partitioning of a router’s resourcesbetween flows, so that network flows can only access their own allocation of resources, and cannot affectthe QoS provided to other flows by the network.

This project is based on the hypothesis that virtual machine techniques can be used within networkrouters to provide effective partitioning between different network flows, and thus provide Quality ofService guarantees to individual network flows. The aim of this research project is to create a prototypeexperimental router which uses virtualisation techniques to provide QoS guarantees to network flows, inorder to prove or disprove this hypothesis.

1.1 Background

The Internet was created as a best effort service. Packets are delivered as quickly and reliably as possible;however, there is no guarantee as to how long a packet will be in transit, or even whether it will bedelivered. There is an increasing interest in the creation of networks which can provide Quality ofService (QoS) guarantees. Such networks would be able to effectively transport isochronous data, suchas audio, video or teleconference data feeds, since each feed could have guaranteed bounds on latencyand throughput.

A number of attempts have been made to tackle Quality of Service issues in computer networks. Var-ious approaches have been investigated, including resource reservation protocols, packet schedulingalgorithms, overlay networks and new network architectures. Various networks have been built usingthese approaches, however one of the problems they face is allocating a fixed proportion of the networkrouter resources to each stream. For example, if a denial of service attack is attempted on one of thesestreams, the router could be overwhelmed, and not able to provide the quality of service needed by theother streams. Some method is needed for partitioning the router resources among streams, so that trafficon one stream is minimally affected by the traffic of other streams.

One option is to have separate routers for each stream. This over-provisioning is obviously very ex-pensive and inflexible; however, it is currently the standard commercial method of guaranteeing certainbounds on quality of service to customers. This research pproject investigates the feasibility of creatinga single routing system, which is built up of a number of virtual routelets. Each QoS flow is assignedits own individual virtual routelet in each physical router along the path it takes through the network.

1

These virtual routelets are scheduled by a virtual machine manager, and have a fixed resource allocationdependent upon the QoS required by the flow. This resource partitioning between routelets should pro-vide a similar level of independence between flows as separate routers, but at a much reduced cost. Thissystem should also allow for greater flexibility in the network, with dynamic flow creation and dynamicadjustment of flow QoS requirements.

Process sharing queueing schemes, such as Weighted Fair Queueing (WFQ) [27], are often used to guar-antee bounds on transmission latency and throughput for network flows. However, I believe that fullrouter resource virtualisation would offer a number of advantages over WFQ. Firstly, WFQ is typicallyplaced just before packet transmission. The packet has already been fully processed by the router beforeit reaches this stage. Therefore, although the router’s outgoing transmission resources are fairly sharedbetween network flows, the internal memory and processing resources of the router are not fairly sharedbetween flows. Another advantage router virtualisation offers over WFQ is its ability to support smarterqueueing within network flows. In WFQ, when a network flow is scheduled to transmit a packet, itsimply transmits the next packet on its queue in FIFO order. A virtual routelet would have much moreflexibility in choosing which packet to send, or which packets to drop. For example, if a routelet ser-vicing an MPEG [16] stream become overloaded, could choose to drop packets containing difference (Por B) frames, rather than the key (I) frames, thus ensuring a gradual degradation in performance, ratherthan a sharp drop-off.

1.2 Document Outline

The remainder of this dissertation is presented in the following format. Chapter 2 surveys relevantliterature in this research area, and provides a broad background of the various aspects to this project.Chapter 3 outlines the overall approach taken during this project and gives a high level design of theprototype router which was created. Chapter 4 discusses some of the work which needed to be performedin order to integrate the various open source subsystems used by the prototype router. The designand implementation of the virtual QoS routelet is described in Chapter 5. Chapter 6 describes themodifications which were made to the main router, in order to support the creation and assignmentof routelets to network flows. In order to measure the virtual routelet prototype’s performance, anexperimental testbed was set up. The creation of this testbed is described in Chapter 7, and the resultsand analysis of the experimental results are presented in Chapter 8. These results are evaluated in moredetail, with regard to the virtual routelet architecture, in Chapter 9. Finally, the conclusions drawn fromthis research are presented in Chapter 10.

2

Chapter 2

Survey

The aim of this project is to create a router which can allocate a fixed proportion of its resources toeach QoS flow it handles. To do this, it is important to first understand how a standard router works.Section 2.1 discusses papers involving both hardware and software routers, which were investigated tounderstand routing techniques.

There have been a number of efforts which have approached the issue of Quality of Service in theInternet. These efforts tackle different areas required by truly QoS-aware networks, such as flow speci-fication, routing, resource reservation, admission control and packet scheduling. As part of this projectproposal, I have investigated efforts in a number of these areas, and describe my findings in Section 2.2.

This project will make use of virtualisation techniques to partition the individual routlets from eachother. There have been a number of virtualisation systems created, each with its own aims and objectives.Section 2.3 discusses the virtualisation techniques and systems which were investigated as part of thisproposal.

Some efforts are strongly related to this project as they attempt to provide fully QoS-aware networkswith resource partitioning. Section 2.4 discusses the techniques used by these networks, and explainshow their approach differs from that taken by this project.

2.1 Router Technology

The first stage in creating a router which provides QoS guarantees, is to understand how a standard routeroperates. There are a number of papers which describe the internal workings of modern IP routers (see[22] or [29] for details). A router’s primary function is to forward packets between separate networks.Routers, or collections of routers form the core of an internetwork. Each router must understand theprotocols of its connected networks and translate packet formats between these protocols. Packets areforwarded according to the destination address in the network layer (e.g. the IP destination). Eachrouter determines where packets should be forwarded, by comparing this destination address with en-tries in a routing table. However, new services are calling upon routers to do much more than simplyforward packets. For example, multicast, voice, video and QoS functions call upon the router to analyseadditional information within the packets to determine how they should be handled.

A router consists of three main components: network line cards, which physically connect the router toa variety of different networks; a routing processor, which runs routing protocols and builds the routingtables; and a backplane, which connects the line cards and the routing processor together.

3

The routing process can be seen more clearly by describing the stages involved in the processing of apacket by a typical IP router. When a packet arrives from a network, the incoming network interface cardfirst processes the packet according to the data-link layer protocol. The data-link layer header is strippedoff; the IP header is examined to determine if it is valid; and the destination address is identified. Therouter then performs a longest match lookup (explained in the following paragraph) in the routing tablefor this destination address. Once a match has been found, the packet’s time to live field is decremented,a new header checksum is calculated, and the packet is queued for the appropriate network interface.This interface card then encapsulates the IP packet in an appropriate data-link-level frame, and transmitsthe packet.

Each router is unable to store a complete routing table for the whole internetwork. However, IP addresseshave a form of geographical hierarchy, where nodes on the same network have a similar class of IPaddress. This allows entries in the routing table to aggregate entries for similar addresses that exitthe router on the same interface card. In other words, routers know in detail the addresses of nearbynetworks, but only know roughly where to send packets for distant networks. The routing table storesaddress prefix / mask pairs. The address prefix is used to match the entry to the destination address, andthe mask is used to specify which digits of the address prefix are important. A longest match lookuptries to find the entry that has the most bits in common with that of the destination address. For example,if the address 120.33.195.2 is matched against 120.0.0.0/8, 120.33.1.4/16, and 120.33.225.0/18 in therouting table, the entry 120.33.225.0/18 will be chosen.

The hardware architecture used in routing systems has evolved over time. Initially, the architecture wasbased upon that of a typical workstation. Network line cards were connected together with a CPU usinga shared bus (e.g. ISA or PCI); and all routing decisions and processing was done by the central CPU.Figure 2.1 shows the architecture of this type of router. The problem with this approach was the singleshared bus. Every packet processed has to traverse this bus twice, once from the incoming network linecard to the CPU and once from the CPU to the outgoing line card. Since this is a shared bus, only onepacket can traverse it at any time. This restricts the overall number of packets which can be processedby a single router to the bandwidth of the shared bus, which is often lower than that of the sum of thenetwork cards’ bandwidths.

CPU

Example Packet Route

Line LineLineCardCard Card

Shared Bus

Figure 2.1: The architecture of a first generation, “Shared Bus” router.

The use of switching fabrics, instead of shared busses, as a backplane increased the amount of networktraffic a router could cope with, by allowing packets to traverse the backplane in parallel (Figure 2.2).The bandwidth of the backplane is no longer shared among the network interface cards. A switchingfabric can be implemented in a number of ways; for example, a crossbar switch architecture could be

4

used, where each line card has a point to point connection to every other line card. Another option is touse shared memory, where each line card can transfer information by writing and reading from the samearea of memory.

Card

CPU

Line

LineCard(Input)

(Input)

(Output)

(Output)

Line

LineCard

Card

Switch Fabric

Example Packet Route

Figure 2.2: The architecture of a router which uses a switching fabric as its backplane.

The next component which comes under stress from increased network traffic is the shared CPU. Asingle CPU simply cannot cope with the amount of processing which is needed to support multi-gigabit,and therefore multi-mega-packet flows. An approach used by many modern routers is to move as muchof the packet processing to the line cards as possible. A forwarding engine is placed on each of thenetwork line cards (or in the case of [29], near the NICs). A portion of the network routing table iscached in each forwarding engine, allowing the forwarding engines to forward packets independently.A central processor is still used to control the overall routing, build and distribute the routing tables, andhandle packets with unusual options or addresses outside the forwarding engine’s cached routing table.

A number of other improvements have been made to router technology, including improved lookup al-gorithms, advanced buffering techniques and sophisticated scheduling algorithms. The major develop-ments in router hardware, however, have been a movement towards expensive custom hardware, workingin parallel, to provide the performance required for multi-gigabit networks.

Software routers run on commodity hardware, and allow a normal PC to perform network routing.Software routers do not have the performance or packet throughput of custom built hardware routers,but they are considerably cheaper and provide more flexibility than hardware routers. They are also anideal tool for router research, as new protocols, algorithms or scheduling policies can be implementedby simply changing the code, rather than building or modifying hardware components. This project willmake use of a software router for these reasons.

The Zebra platform1 is a popular open source software router. It has initiated a number of spinoffprojects, such as Quagga2 and ZebOS3, a commercial version of Zebra (all collectively referred to asZebra hereafter). Zebra runs on a number of UNIX-like operating systems (e.g. Linux, FreeBSD, Solarisetc.) and provides similar routing functionality to commercial routers. The Zebra platform is modularin design. It consists of multiple processes, each of which deals with a different routing protocol. Forexample, a typical Zebra router will consist of a routing daemon for the open shortest path first (OSPF)protocol, and another for the border gateway protocol (BGP). These will work together to create an

1http://www.zebra.org2http://www.quagga.net3http://www.ipinfusion.com/products/advanced/products advanced.html

5

overall routing table, under the control of a master Zebra daemon. This architecture allows protocols tobe added at any time without affecting the whole system.

There have been a number of studies which have compared the features and performance of Zebra tostandard commercial hardware routers. Fatoohi et al [14] compared the functionality and usability ofthe Zebra router with Cisco routers. They find that Zebra provides much of the features found in Ciscorouters with a similar user interface. They do not compare the relative performance of the two routertypes, however. Ramanath [32] investigates Zebra’s implementation of the BGP and OSPF protocols,but concentrates on the experimental set-up, with little post experimental analysis.

One problem with Zebra is that the open source implementations do not provide MPLS (MultiprotocolLabel Switching) support. MPLS routing support is a requirement of this project, so another softwarerouter is required. NIST Switch [7] is a research routing platform, specifically created for experimenta-tion in novel approaches to QoS guaranteed routing. It is based on two fundamental QoS concepts, localpacket labels which define the characteristics of packets queueing, and traffic shaping measures appliedto overall packet routes. NIST Switch supports MPLS, and uses these MPLS labels to define the QoScharacteristics and route packets. It also uses RSVP, with Traffic Engineering (TE) extensions, to signalQoS requests and distribute labels. This use of RSVP was also an important requirement in this projectas RSVP will be used to request router resource requests. MPLS and RSVP are explained in more detailin Section 2.2, and their use in this project is described in Chapter 3 and Chapter 6.

NIST Switch only runs on the FreeBSD operating system. Lai et al [20] compared Linux, FreeBSD andSolaris, running on the x86 architecture. They found that FreeBSD has the highest network performance,making it an ideal operating system on which to base a router. However, the difference in networkperformance, cited by Lai et al, was not significant enough to make a major difference to the performanceof the router created in this project. Linux has also been ported to the para-virtualised, virtual machinemonitors required as part of this project, whereas FreeBSD has not. It was, therefore, decided that aLinux software router was required. Van Heuven et al [18] describes a MPLS capable Linux router,created by merging a number of existing projects, including as a base, the NIST Switch router.

A number of modular routers were also investigated as a means of providing additional functionality ina modular fashion. The PromethOS router architecture, described in [19] and based upon the work in[12], provides a modular router framework, which can be dynamically extended using router plugins.The PromethOS framework is based upon the concept of selecting the implementation of the processingcomponents within the router, based upon the network flow of the packet currently being processed.When a plugin is loaded, it specifies which packets it will deal with by providing a packet filter. Whena packet arrives at a PromethOS router, it traverses the network stack in the usual way. However, as ittraverses the network stack, it comes across gates, points where the flow of execution can branch off toplugin instances. The packet is compared against the filters specified by the plugin instances, and thensent to the correct plugin. New plugins can be created, and can be inserted at various points along thenetwork stack, using these gates. This allows a variety of different functions to be added to the router,such as new routing algorithms, different queueing technique or enhancements to existing protocols.However, it still uses the standard IP network stack. This means that it cannot be used to investigateother network protocols.

The Click Modular Router [24] takes a different approach. Click is not a router itself; instead, it is a soft-ware framework which allows the creation of routers. A Click router consists of a number of modules,called elements. These elements are fine-grained components, which perform simple operations, suchas packet queueing or packet classification. New elements can easily be built by writing a C++ classwhich corresponds to a certain interface. Elements are plugged together using a simple language tocreate a certain router configuration. The same elements can be used to create different routers, by usingthis simple language to rearrange the way in which they are organised. These router configurations are

6

compiled down to machine code, which runs within the Linux Kernel to provide a fast, efficient router.The approach taken by the Click router provides a flexible method of creating new routers easily.

2.2 Quality of Service Techniques

There are a number of issues which need to be addressed when creating a QoS-aware network. Firstly,the overall routing algorithm needs to take QoS into account, dealing with low quality links, or failuresin network nodes. QoS flows also require a mechanism for reservation of routing resources on a network.For a flow to reserve the resources it requires, there needs to be some universal description of the flow’srequirements - a flow specification. As network resources are finite, some form of admission controlis required to prevent network resources from becoming overloaded. Finally, the packet schedulingalgorithms, used by the network routers, need to be aware of the QoS requirements of each flow, so thatbounds on latency and throughput can be guaranteed.

Multiprotocol label switching (MPLS), described in [9] and [39], is a network routing protocol whichuses labels to forward packets. This approach attempts to provide the performance and capabilities oflayer 2 switching, while providing, the scalability and heterogeneity of layer 3 routing. It is therefore,often referred to as layer 2.5 routing. These labels also allow an MPLS network to take QoS issues intoaccount while routing packets by, for example, engineering traffic.

In conventional (layer 3) routing, a packet’s header is analysed by each router. This information is usedto find an element in a routing table, which determines the packet’s next hop. In MPLS routing, thisanalysis is performed only once, at an ingress router. A Label Edge Router (LER), placed at the edge ofthe MPLS domain, analyses incoming packets, and maps these packets to an unstructured, fixed length,label. Many different headers can map to the same label, as long as these packets all leave the MPLSdomain at the same point. Labels, therefore, define forwarding equivalence classes (FECs). An FECdefines, not only where a packet should be routed, but also how it should be routed across the MPLSdomain.

This labelling process should not interfere with the layer 3 header, or be affected by the layer 2 technol-ogy used. The label is therefore added as a shim between the layer 2 and the layer 3 headers. This shimis added by the ingress LER as a packet enters the MPLS domain, and removed by the egress LER as itleaves the MPLS domain.

Once a packet has entered the MPLS domain, and has been labelled by the LER, it traverses the do-main along a label-switched path (LSP). An LSP is a series of labels along the path from the sourceto the destination. Packets traverse these LSPs by passing through label switching routers (LSRs). AnLSR analyses an incoming packet’s label, decides upon the next hop which that packet should take andchanges the label to the one used by the next hop for that packet’s FEC. The LSR can process high packetvolumes, because a label is a fixed length, unstructured value, so a label lookup is fast and simple.

Each LSR (and LER) assigns a different label to the same FEC. This means that globally unique labelsare not required for each FEC, as long as they are locally unique. Each LSR needs to know of the labelsthat neighbouring LSRs use for each FEC. There are a number of ways in which MPLS can distributethis information. The Label Distribution Protocol was designed specifically for this purpose, however,it is not commonly used in real networks. Existing routing protocols, such as the Open Shortest PathFirst (OSPF) or Border Gateway Protocol (BGP), have been enhanced so that label information can be“piggybacked” on the existing routing information. This means that traffic going from a source to adestination will always follow the same route. Another option is to use RSVP with traffic engineeringextensions (RSVP-TE) to explicitly engineer routes through the MPLS domain. This allows different

7

routes to be taken by traffic, travelling to the same destination, depending on its QoS class. For exam-ple, a low latency route could be chosen for voice traffic, while best effort traffic, going to the samedestination, travels through a more reliable, but slower route.

The Resource ReSerVation Protocol (RSVP) [45] is a protocol which can be used to reserve networkresources needed by a packet flow. It reserves resources along simplex, i.e. only single direction, routes.One novel feature of this protocol is that it is receiver-oriented, i.e. the receiver, rather than the sender,initiates the resource reservation, and decides on the resources required. This allows heterogenous data-flows to be accommodated; for example, lower video quality for modem users, than for broadband users.

An RSVP session is initiated by a PATH message, sent from the sender to the receiver. This PATHmessage traverses the network, setting up a route between the sender and receiver in a hop-by-hopmanner. The PATHmessage contains information about the type of traffic the sender expects to generate.Routers on the path can modify this data to specify their capabilities. When the receiver applicationreceives the PATH message, it examines this data and generates a message, RESV, which specifies theresources required by the data flow. This RESV message traverses the network in the reverse of the routetaken by the PATH message, informing routers along this path of the specified data flow. Each routeruses an admission control protocol to decide whether it has sufficient resources to service the flow. If arouter cannot service a flow, it will deny the RSVP flow, and inform the sender.

The RSVP protocol is independent of the admission protocol used, so routers can make any decisionthey wish to decide on the paths it can support. RSVP is also independent of the specification used todescribe the flow (the flowspec). As long as the sender, receiver and routers agree on the format, anyinformation can be transmitted in the flowspec to inform routers of the resources required by a flow.RSVP is unaffected by the routing protocol used. PATH messages can be explicitly routed, as in RSVP-TE, and the RESV message follows this route in reverse, thus reserving resources along the explicitlyrouted path.

RSVP can support dynamic changes in path routes, as well as dynamic changes in the data-flow char-acteristics of paths. It does this by maintaining soft, rather than hard, state of a reserved path in thenetwork. Data sources periodically send PATH messages to maintain or update the path state in a net-work. Meanwhile, data receivers periodically send RESVmessages to maintain or update the reservationstate in the network. If a network router has not received PATH or RESV messages from an RSVP flowfor some time, the state of that flow is removed from the router. This makes RSVP robust to networkfailures.

RSVP was created specifically to support the Quality of Service model called Integrated Services ([4]and [43]). The Integrated Services approach attempts to provide real time QoS guarantees for everyflow which requires them. This requires every router to store the state of resources requested by everyflow, which could involve a huge amount of data. RSVP addresses this somewhat by having a numberof multicast reservation types (e.g. no-filter, fixed-filter and dynamic filter) which allow individualswitches to decide how resource reservation requests can be aggregated. However, the state required foran Integrated Service router, especially in the core of the network, is still formidable.

The differentiated services, or Diffserv model ([3] and [46]) takes a different approach. Each packetcarries with it information on how it should be treated. For example, a Type of Service (TOS) byte,carried in the packet header, could specify whether this packet requires low-delay, high-throughput orreliable transport. The problem with this approach is that only simple requirements can be specified (e.g.low latency, rather than latency less than 10ms), so that each packet does not incur excessive overheadscarrying the QoS requirements. There is also no way to perform admission control, so it is impossibleto guarantee the QoS for individual flows. All that can be guaranteed is that certain classes of traffic willbe treated better than other classes.

MPLS is in the middle ground between the two extremes of Integrated Services and Diffserv. It provides

8

the capability to set up flows of traffic with certain requirements, much like Integrated Services, howeverForwarding Equivalency Classes can be used to aggregate common flows together. Also, the use ofunstructured labels greatly simplifies the filtering of traffic into flows, compared with the filtering onsource and destination address (as well as other possible IP header fields) used by Integrated Service.

This is a simplified view of Internet QoS techniques, with specific emphasis on those aspects of thetechnologies which will be used in this project. A more detailed survey is provided by Xiao et al [44].

2.3 Virtualisation Techniques

This project will use virtualisation techniques to partition resources between virtual routelets. Virtu-alisation was developed by IBM in the mid 1960’s, specifically for mainframe computers. Currently,there are a number of virtualisation systems available, each of which addresses different issues. Hard-ware constraints require that the router created by this project must run on standard, commodity x86hardware. This proposal, therefore, concentrates on x86 virtualisation techniques, as opposed to moregeneral virtualisation techniques, since the x86 architecture is an inherently difficult platform to virtu-alise, compared with other processor architectures.

Virtualisation software allows a single machine to run multiple independent guest operating systemsconcurrently, in separate virtual machines. These virtual machines provide the illusion of an isolatedphysical machine for each of the guest operating systems. A Virtual Machine Monitor (VMM) takescomplete control of the physical machine’s hardware, and controls the virtual machines’ access to it.

VMware [38] and other full-virtualisation systems attempt to fully emulate the appearance of the under-lying hardware architecture in each of the virtual machines. This allows unmodified operating systemsto run within a virtual machine, but presents some problems. For example, operating systems typicallyexpect unfettered, privileged access to the underlying hardware, as it is usually their job to control thishardware. However, guest OSs within a virtualised system cannot have privileged mode access, other-wise they could interfere with other guest OS instances by, for example, modifying their virtual memorypage tables. The VMM, on the other hand, controls the underlying physical hardware in a virtualisedmachine, and therefore has privileged mode CPU access. The guest OSs must be given the appearanceof having access to privileged CPU instructions, without being able to harm other OSs.

Fully virtualised systems, such as VMware, typically approach this problem by allowing guest OSs tocall these privileged instructions, but by having them handled by the VMM when they are called. Theproblem is that the x86 fails silently when a privileged instruction is called within a non-privileged envi-ronment, rather than providing a trap to a handling routine. This means that VMware has to dynamicallyrewrite portions of the guest OS kernel at runtime, inserting calls to the VMM where privileged instruc-tions are needed. VMware also keeps shadow copies of each guest OS’s virtual page tables, to ensurethat guest OS instances do not overwrite each other’s memory. However, this requires a trap to the VMMfor every update of the page table. These requirements mean that an operating system, running withina fully virtualised system, will have severely reduced performance, compared to the same operatingsystem running on the same physical hardware without the virtualisation layer.

A number of projects have attempted to reduce this performance overhead, at the cost of not providingfull physical architecture emulation. Para-virtualisation is an approach where the virtual machines donot attempt to fully emulate the underlying hardware architecture of the machine being virtualised.Instead, an idealised interface is presented to the guest operating systems, which is similar, but notidentical to that of the underlying architecture. This interface does not include the privileged instructionswhich cause problems in fully virtualised systems, replacing them with calls which transfer control tothe VMM (sometimes known as Hypercalls). This change in interface means that existing operating

9

systems need to be ported to this new interface before they will run as guest operating systems. However,the similarity between this interface and the underlying architecture, means that the application binaryinterface (ABI) remains unchanged, so application programs can run unmodified on the guest OSs. Theidealised interface substantially reduces the performance overhead of virtualisation, and also providesthe guest OSs with more information. For example, information on both real and virtual time allowsbetter handling of time sensitive operations, such as TCP timeouts.

The Denali Isolation Kernel ([41] and [42]) uses these para-virtualisation techniques to provide an ar-chitecture which allows a large number of untrusted Internet services to be run on a single machine.The architecture provides isolation kernels for each untrusted Internet service. These isolation ker-nels are small guest operating systems which run in separate virtual machines, providing partitioningof resources and achieving safe isolation between untrusted application code. The use of simple guestoperating systems, and the simplified virtual machine interface, are designed to support Denali’s aim ofscaling to more than 10,000 virtual machines in one, commodity hardware, machine.

The problem with using Denali, however, is that the interface has been oversimplified to support thescaling needed to run tens of thousands of virtual machines. There have been a number of omissionsmade in the application binary interface, compared with the x86 architecture. For example, memorysegmentation is not supported. These features are expected by standard x86 operating systems andapplications. It would therefore be very difficult to port standard operating systems (such as Linux orFreeBSD) to Denali, and even if they were ported, many applications would also have to be rewrittento run on Denali. The Denali project has therefore provided a simple guest OS called Ilwaco, whichruns under a Denali virtual machine. Ilwaco is simply implemented as a library, much like an ExokernellibOS, with applications linking against the OS. There is no hardware protection boundary in Ilwaco, soessentially each virtual machine hosts a single-user, single-application, unprotected operating system.Since Ilwaco was created specifically for Denali, and is so simple, there is no routing software availablefor it. Therefore, if Denali is used as the visualisation system in this project, a software router wouldneed to be written from scratch.

Xen [2] is an x86 virtual machine monitor that uses para-virtualisation techniques to reduce the virtu-alisation overhead. Although it uses an idealised interface like Denali, this interface was specificallycreated to allow straightforward porting of standard operating systems, and provide an unmodified ap-plication binary interface. A number of common operating systems, such as Linux and NetBSD, havebeen ported to the Xen platform, and because of the unmodified ABI, unmodified applications run onthese ports.

The Xen platform consists of a Xen VMM (virtual machine monitor) layer above the physical hardware.This layer provides virtual hardware interfaces to a number of Domains. These domains are effectivelyvirtual machines running the ported guest OSs, however the guest OSs, and their device drivers, areaware that they are running on Xen. The Xen VMM is designed to be as simple as possible, so althoughit must be involved in data-path aspects (such as CPU scheduling between domains, data block accesscontrol etc.), it does not need to be aware of higher level issues, such as how the CPU should be sched-uled. For this reason, the policy (as opposed to the mechanism) is separated from the VMM, and run ina special domain, called Domain 0. Domain 0 is given special privileges. For example, it has the abilityto start and stop other domains, and is responsible for building the memory structure and initial registerstate of a new domain. This significantly reduces the complexity of the VMM, and prevents the need foradditional bootstrapping code in ported guest OSs.

There are two ways in which the Xen VMM and the overlying domains can communicate - synchronouscalls from a domain to Xen can be made using hypercalls, and asynchronous notifications can be sentfrom Xen to an individual domain using events. Hypercalls made from a domain perform a synchronoustrap to the Xen VMM, and are used to perform privileged operations. For example, a guest OS could

10

perform a hypercall to request a set of page table updates. The Xen VMM validates these updates, tomake sure the guest OS is allowed to access the memory requested, and then performs the requestedupdate. These hypercalls are analogous to system calls in conventional operating systems. So that eachhypercall does not involve a change in address space, with the associated TLB flush and page missesthis would entail, the first 64MB of each domain’s virtual memory is mapped on to the Xen VMM code.

Asynchronous events allow Xen to communicate with individual domains. This is a lightweight noti-fication system which replaces the traditional device interrupt mechanism, used by traditional OSs toreceive notifications from hardware. Device interrupts are caught by the Xen VMM, which then per-forms the minimum amount of work necessary to buffer any data and determine the specific domain thatshould be informed. This domain is then informed using an asynchronous event. This specific domainis not scheduled immediately, but is informed of the event when it is next scheduled by the VMM. Atthis time, the device driver on the guest OS performs any necessary processing. This approach limits thecrosstalk between domains, i.e. a domain will not have its scheduled time eaten into by another domainservicing device interrupts.

Xen provides a “Safe Hardware Interface” for virtual machine device I/O. This interface is discussed indetail by Fraser et al [15]. Domains can communicate with devices in two ways. They can either use alegacy interface, or an idealised, unified device interface.

The legacy interface allows domains to use legacy devices drivers. Domains, however, cannot sharedevices to enforce isolation between domains. If a legacy device driver crashes, it will affect the domainit is running in, but not any other domains.

The unified device interface provides an idealised hardware interface for each class of devices. Thisis intended to reduce the cost of porting device drivers to this safe hardware interface. The benefit ofthis safe hardware interface is that it allows sharing of devices between domains. To enforce isolationbetween domains, even when they share a device, device drivers can be run within Isolated DriverDomains (IDDs). These IDDs are effectively isolated virtual machines, loaded with the appropriatedevice driver. If a device driver crashes, the IDD will be affected, but each domain which shares thatdriver will not crash. In fact, Xen can detect driver failures, and restart the affected IDD, providingminimal disturbance to domains using that device.

Guest operating systems communicate with drivers, running within IDDs, using device channels. Thesechannels use shared memory descriptor rings to transfer data between the IDD and domains (see Fig-ure 2.3). Two pairs of producer, consumer indices are placed around the ring, one for data transferbetween the domain and the IDD, and the other for data transfer between the IDD and the domain. Theuse of shared memory descriptor rings avoids the need for an extra data copy between the IDD anddomain address space.

Although the use of these descriptor rings is a suitable approach for low latency devices, it does notscale to high bandwidth devices. DMA capable devices can transfer data directly into memory, but if adevice is shared, and the demultiplexing between domains is not performed in hardware, then there isno way of knowing into which domain’s memory the data should be transferred. In this case, the data istransferred to some memory controlled by Xen. Once I/O has been demultiplexed, Xen knows to whichdomain this data should be transferred, but the data is in an address space which is not controlled by thatdomain. The data is mapped into the correct domain’s address space, using a page previously grantedby this domain (i.e. this domain’s page table is updated, so that the domain’s granted virtual page nowpoints to the physical memory which contains the data to be transferred). This page table update avoidsany additional data copying due to the virtualisation layer.

The use of these techniques means that Xen has a very low overhead compared with other virtualisationtechnologies. The performance of Xen was investigated in [2], which, in a number of benchmarks,found that the average virtualisation overhead of Xen was less than 5%, compared with the standard

11

Figure 2.3: This diagram describes the I/O ring used to transfer data between Xen and guest OSs

Linux operating system. This is contrasts with overheads of up to 90% in other virtualisation systemssuch as VMware and User-Mode Linux. These results have been repeated by a team of research studentsin [11].

2.4 Similar Work

There have been a number of efforts which have attempted to produce QoS-aware networks with someform of resource partitioning. Many of these projects use programmable networks. Programmablenetworks [5] can be used to rapidly create, manage and deploy network services, based on the demandsof users. There are many different levels of network programmability, directed by two main trends - theOpen Signalling approach and the Active Networks approach. The Open Signalling approach argues fornetwork switches and routers to be opened up with a set of open programming interfaces. This approachis similar to that of telecommunication services, with services being set up, used, then pulled down.The Active Network community supports a more dynamic deployment of services. So called, “activepackets” are transported by active networks, which at one extreme could contain code to be executed byswitches and routers as the packet traverses the network.

A number of research projects have created programmable networks, in an attempt to support differingobjectives. Many of these programmable networks have been created in an attempt to ease networkmanagement, however, some have had the objective of creating Quality of Service aware networks.The Darwin project [8] is an attempt to create a set of customisable resource management mechanismsthat can support value added services, such as video and voice data streams, which have specific QoSrequirements. This project is focused on creating a network resource hierarchy, which can be used toprovide various levels of network service to different data streams, depending on their requirements. Thearchitecture contains two main components - Xena and Beagle. Xena is a service broker, which allowscomponents to discover resources, and identify resources needed to meet an application’s requirements.Beagle is a signalling protocol, used to allocate resources by contacting the owners of those resources.

The Darwin architecture has been implemented for routers running FreeBSD and NetBSD. However,this project’s focus is on providing a middleware environment for value added network services, not onthe underlying mechanisms needed to guarantee quality of service flows within a network. As such, the

12

prototype implementation of the Darwin router simplifies the aspect of resource partitioning betweenQoS flows. Routing resources are managed by delegates, which run within a Java Virtual Machine,with resource partitioning managed using static priorities. This does not provide the required level ofperformance, or real time guarantees, required for a truly QoS aware router.

Members of the DARPA active networking program have developed a router operating system calledNodeOS. This provides a node kernel interface for all routers within an active network. The NodeOSinterface defines a domain as an abstraction which supports accounting and scheduling of the resourcesneeded for a particular flow. Each domain contains the following set of resources - an input channel, anoutput channel, a memory pool and a thread pool. When traffic from a certain flow enters the router, itconsumes allocated network bandwidth from its domain’s input and output channels. CPU cycles andmemory usage are also charged to the domain’s thread and memory pools, as the packet is processedby the router. This framework allows resources to be allocated to QoS flows as required; however, theNodeOS platform is simply a kernel interface, and does not provide details on how the resources usedby domains should be partitioned.

The Genesis Kernel [6] is another network node operating system which supports active networks. Thiskernel supports spawning networks, which automate creation, deployment and management of networkarchitectures, “on-the-fly”. The Genesis Kernel supports the automatic spawning of routlets. Virtualoverlay networks can be set up over the top of the physical network. Routlets are spawned to process thetraffic from these virtual networks. Further child virtual networks can be spawned above these virtualnetworks. These child networks will automatically inherit the architecture of their parent networks, thuscreating a hierarchy of function from parent to child networks.

This architecture is created with the intention of automating the support of network management. Itsupports a virtual network life cycle, through the stages of network profiling to capture the “blueprint”of a virtual network architecture, the spawning of that virtual network, and finally the management tosupport the network. It does not, however, deal with resource management, and so does not provideservice guarantees to QoS flows.

The Sago project [10] has a similar aim to this project. The aim is to virtualise the resources of a net-work, to provide QoS guarantees for individual flows. The Sago platform uses virtual overlay networksto provide protection between separate networks which use the same physical hardware. It consists oftwo main components - a global resource manager (GRM), and local packet processing engine (LPPE).The GRM oversees the allocation of nodes and links on the underlying network. A LPPE is placed ateach network node, and performs the actual resource management. The GRM uses a separate adminis-trative network to signal the creation of virtual overlay networks. These virtual overlay networks havebandwidth and delay characteristics, which can be used to provide end to end QoS guarantees.

The Sago platform, however, requires significant extra complexity compared with standard networks.For example, two physical networks are needed, one for data transmission, and an entirely separate net-work to deal with control signals. This requires significant extra complexity in the network, increasingthe overall cost and the possibility of failures. Also, although Sago attempts to provide end-to-end QoSguarantees, it does not deal specifically with the partitioning of resources within a network router.

Although there are many efforts which have attempted to provide QoS guarantees to network flows, veryfew have specifically dealt with the partitioning of resources among QoS flows. Of the efforts whichdid deal with router resources, none could provide strict guarantees on the router resources available tonetwork flows.

13

Chapter 3

Approach

This chapter describes the overall approach taken during this research project. Over the course of thisproject, the original approach changed as problems were encountered and details were further investi-gated. Any major changes from the original approach, made during the course of this project, are alsodescribed here.

The goal of this project is to produce an experimental router which demonstrates the feasibility of us-ing virtual machine techniques to partition router resources between independent flows, thus providingguarantees on each flow’s quality of service. It is not necessary that this router exhibit commercial speedor robustness, however, it should be sufficiently usable, such that the partitioning between flows and theservice provided to those flows can be investigated. To this end, my original approach involved usingthe work of open source projects, modified so that they meet the needs of this project. Due to hardwareand time constraints on a research project such as this, the router was created as a software router forcommodity x86 hardware.

The overall architecture of the router (known hereafter as a QuaSAR - Quality of Service Aware Router)consists of multiple guest OSs running on top of a single virtual machine manager. Each of these guestOSs runs routing software to route incoming packets. There is one main guest OS router, whose job isto route all of the best effort, non-QoS traffic. The rest of the guest OSs are known as routelets, and areavailable for the routing of QoS traffic flows. When a new QoS flow is initiated, the flow’s requirementsare sent along the proposed route of the flow. When a QuaSAR router receives details of a new flow’srequirements, it first decides whether it has resources available to service the new flow. If there areinsufficient resources available, the flow can be routed as best effort traffic, or the flow can be rejected.If, however, there are enough resources available to support the flow’s requested QoS, then the flow isassociated with a particular routelet. This routelet is then allotted sufficient resources (e.g. CPU time,memory, network bandwidth, etc.) to service that flow. When a packet arrives at a QuaSAR router, itis demultiplexed and sent to the routelet which is dealing with that packet’s flow. If the packet is notpart of any specified flow, it is sent to the main best-effort router. Figure 3.1 gives an overview of thearchitecture of the QuaSAR system.

Since each routelet is running in a separate operating system (to partition its resource usage), QuaSARuses virtualisation software to allow multiple operating system instances to be run on the same ma-chine at the same time. This virtualisation software needs to have as little overhead as possible, sothat QuaSAR can cope with high-volume packet traffic demands. To this end, this project uses a para-virtualisation system, rather than a fully emulated virtualisation system, because performance and scal-ability are more important in the QuaSAR router than full emulation of the original instruction set,especially where the hardware being used (x86) is inherently virtualisation-unfriendly. Both Xen andDenali provide para-virtualisation in an attempt to increase performance and scalability, however Denali

14

Main Best Effort Router

Pac

ket

Dem

ult

iple

xer

Input NIC Channels Output NIC Channels

QuaSAR

QoS Routelets

Figure 3.1: An overview of the QuaSAR architecture.

does not have any well-used operating systems ported to its virtual instruction set, whereas Xen has portsfor both Linux and NetBSD. This project, therefore, uses the Xen virtualisation software because: it isopen source and can be modified for the project’s requirements; it has good performance and scalability;and the availability of Linux or NetBSD guest OSs allow the project to make use of pre-built routingsoftware.

Xen loads guest OSs in separate domains. The guest OS in domain 0 has special privileges, such asbeing able to start new guest OS domains and configure the resources which they can use. The mainbest-effort router is therefore started in domain 0, so that this main router can start, stop and control theQoS routelet instances as new QoS flows arrive, or their requirements change.

RSVP messages are used by hosts to signal the creation of new QoS flows. When the QuaSAR routerreceives an RSVP RESV message, specifying the creation of a new QoS flow, it starts a new routelet, ormodifies an existing routelet, so that packets from the flow are routed by that routelet. RSVP messagescan contain a flowspec, which details the requirements of the new flow being created (e.g. bandwidth,latency etc.). The original approach called for the QuaSAR router to process RSVP flowspecs andautomatically allot enough resources to the routelet assigned to that flow, to meet the flow’s QoS re-quirements. However, the mapping between a flow’s QoS requirements and the resources needed by aroutelet to meet those requirements were not known before experiments were performed on the QuaSARrouter. Therefore, a routelet’s resources are assigned manually in this project, with provisions made forthe automatic allotting of resources when this mapping is discovered.

When packets arrive at the router, they need to be demultiplexed to the appropriate routelet. Xen con-tains a mechanism for sharing network cards, and demultiplexing packets to the appropriate guest OSaccording to a set of rules. The original design of the QuaSAR router called for this mechanism to beused to demultiplex packets to the appropriate routelet, however, it was discovered that this mechanismwas not appropriate for the QuaSAR router (see Section 6.3 for details). Therefore, an architecture wascreated, using the Click Modular Router, to demultiplex packets as they arrived in the router, and sendthem to the routelet processing their flow, or to the best effort router if they are not part of a QoS flow.

15

QuaSAR routes MPLS (Multi-Protocol Label Switching) traffic, since this traffic is already routed asa flow according to its forwarding equivalency class label. The main best-effort router needs to routeany traffic which is sent to it. It, therefore, needs to create, maintain and use, label and routing tables.It also needs to be able to support MPLS label distribution protocols (e.g. LDP or RSVP-TE), routingdistribution protocols (e.g. OSPF or BGP), as well as the RSVP protocol used to create new QoS flows.The main router, therefore, uses an open source implementation of an MPLS Linux router from IBCN(intec broadband communication networks) research group at Ghent University. This was modified sothat it could create and configure QoS routelets as required, but provided a solid basis from which tobuild the router, without writing a completely new router.

The QoS routelets do not have complex routing requirements. They are only routing packets from asingle flow. The work simply involves obtaining a new packet from the flow’s input NIC, processingthe packet (e.g. substituting one label for another), queuing the packet, and sending the packet to theflow’s output NIC. Since only one flow at a time is ever routed by these routelets, they do not need tomaintain label or routing tables. There is also no need for them to understand, or initiate routing or labeldistribution protocols. In fact, if routelets initiated distribution protocols, they could interfere with themain, best-effort router. Instead, the main router keeps track of table entries which affect routelets thathave been created. If one of these entries changes, then the main router can inform the routelet whichwill be affected.

Due to the simple requirements of the QoS routelets, they use the Click Modular Router to route packets.The simple work-flow needed by the routelets can be provided by joining a number of Click elementstogether. New elements can be written in C++ to any extra functionality, such as MPLS label substi-tution, required. The main router can communicate with the routelets by writing to special files whichare created by the Click architecture. When these files are written to, functions within the appropriateelement are invoked, thus allowing routelets to be dynamically reconfigured.

Figure 3.2 gives a more detailed overview of this architecture.

Once the prototype QuaSAR system was built, an important part of this project involved performingexperiments with this prototype, to evaluate the effectiveness of using virtual routelets to service QoSflows. These experiments should ensure that the virtualisation process does not massively reduce theperformance of the router compared with standard routers, and discover if this technique improves theperformance and partitioning between QoS flows.

To this end, a testbed network was set up, with a number of computers acting as communication nodesand one machine acting as a QuaSAR router. The QuaSAR machine contains multiple network linecards, each of which was connected to a separate test network. Synthesised test network traffic was sentacross the QuaSAR router between these networks, to test various aspects of QuaSAR’s performance.The machine running the QuaSAR software also had a standard Linux software router installed, so thatQuaSAR’s performance could be compared with that of a standard router.

Initially, best effort traffic was sent across both the QuaSAR router and the standard router to discoverthe additional overhead incurred by running virtualisation software below the router. The next stageinvolved setting up a Quality of Service flow, and comparing the performance of traffic sent through thisflow to that of best effort traffic. This evaluates whether demultiplexing packets between routelets incurssignificant overhead, or if the simplified forwarding engine in the QoS routelets actually increases per-formance. Badly-behaved flows, which use more network bandwidth than they have reserved, were thenintroduced to discover their effect on well-behaved flows. This evaluates the partitioning that QuaSARprovides between network flows and evaluates QuaSAR’s immunity to denial of service attacks.

These experiments should evaluate whether ensuring QoS by partitioning a router’s resources, usingvirtualisation, is effective. It should also identify the areas in which this concept could be improved withfuture work.

16

Input NIC Channels Output NIC Channels

ClickLinux in Domain 0

IBCN MPLS under Linux Router

Linux in Domain #

XenVMM

Click modular RouterLinux in Domain #

Main Best Effort Router

Pac

ket

Dem

ult

iple

xer

Idle Routelet Pool

QoS Routelets

Figure 3.2: A detailed overview of the QuaSAR architecture.

17

Chapter 4

Integration of Open Source Projects

When beginning this project, a decision was made to build the prototype router from a number of opensource projects. The use of this open source software meant that the creation of a prototype virtualisedrouter was feasible in the time provided by this project, however, using so many open source projectstogether brought a number of problems when trying to integrate them. This section describes where theopen source projects were used within the QuaSAR router, as well as the changes necessary to integratethese external projects.

4.1 Overview of Open Source Projects

The following open source projects were used within the QuaSAR router:

Xen Virtual Machine Monitor

Xen was chosen as the virtualisation technology for the QuaSAR router because of its relativelyminor performance overhead compared with other virtualisation platforms. However, since it usespara-virtualisation to achieve this performance increase, it requires the guest operating systemsrunning on top of it to be modified to support the architecture presented by Xen. The changesrequired to run Linux as a guest OS within a Xen virtual machine (or Domain) are provided as apatch to the source code of a standard Linux build. The most current version of Xen available atthe start of this project was version 2.0.1, which provided a patch for version 2.6.9 of the Linuxkernel. Therefore, without substantial rewriting of this patch, Xen 2.0.1 supports only the 2.6.9version of the Linux kernel.

Requirements: Linux Kernel 2.6.9

Click Modular Router

The Click modular router provides a framework for the creation of flexible routers by the con-nection of a number of simple packet processing components (or elements), in such a way asto provide the overall required routing functionality. Click was used within the QuaSAR routerto implement the routelet’s packet processing code, as this was more easily implemented usingClick’s supporting framework, than if it had been written from scratch. Click also provided theability to quickly create prototype networking code for other purposes through the creation ofnew elements, or the building of new click configurations, which proved invaluable throughoutthe project.

18

Click has the ability to run either as a user-space process, or as a kernel module. For performancereasons QuaSAR is required to fully run within the Linux Kernel, so that each packet does notincur a switch from user to system operation mode. The QuaSAR router therefore needed to runClick as a Kernel module. The Linux kernel must be modified before it can support Click as amodule. This is because Click needs access to packets at various points along the Linux networkstack, and the kernel requires basic runtime support for C++, which the Click elements are writtenin. A Linux kernel patch is supplied with Click to make the necessary changes, however, the latestLinux kernel supported by the click patches was 2.4.22.

Requirements: Linux Kernel 2.4.22

MPLS Linux

QuaSAR’s main, best effort router needs the ability to set up label switched paths and route anyMPLS traffic which it receives. The MPLS Linux open source project provides MPLS support forLinux by modifying the Linux kernel with a source code patch. The most recent version of MPLSLinux, available at the start of this project (1.935), provides a patch for the 2.6.6 version of theLinux kernel. The Linux MPLS project also provides modifications for iproute2 and iptables sothat these tools can be used to modify and manage MPLS routing.

Requirements: Linux Kernel 2.6.6iproute2 2.4.7iptables 1.2.9

IBCN MPLS RSVP-TE daemon

The QuaSAR router uses RSVP-TE to set up MPLS label switched paths (LSPs), distribute labelsand specify a flow specification (flowspec). QuaSAR’s main, best effort router, therefore, uses anMPLS enabled RSVP-TE daemon, from the IBCN research group at Ghent University, to providethese capabilities. This daemon makes use of a number of components, including an MPLSenabled Linux Kernel, and MPLS enabled iproute2 and iptables tools (MPLS support providedby MPLS Linux). However, there has been no support for this daemon for some time, so theversions of MPLS Linux which it uses is now out of date.

Requirements: Linux Kernel 2.4.13Linux MPLS 1.72iproute2 2.4.2iptables 1.2.4

4.2 Integration

As can be seen from the previous section, each of the open source projects used as part of the QuaSARrouter required a different Linux Kernel version. These projects could obviously not be run togetherwithin the router, unless they were modified so that they could all run under the same Linux Kernelversion. I decided to use the 2.6.9 Linux Kernel for the QuaSAR router, mainly because substantialchanges would need to be made to port the Linux Kernel to the Xen virtual machine architecture, there-fore porting a Linux Kernel other that version 2.6.9 would be unfeasible in the time provided for thisproject. This choice of Kernel also had the the advantage of being the most recent of the kernels usedby any of the projects, therefore porting the other projects to this kernel would benefit the community

19

by updating these projects to work under a newer operating system, rather than porting them to olderoperating systems which would be of limited benefit.

The MPLS RSVP-TE daemon did not modify the Linux Kernel, and was therefore less dependent onwhich version of the Kernel used. However, it did rely on MPLS Linux, which does modify the kernel.The RSVP daemon is written to interface with version 1.72 of MPLS Linux, which in turn is written forversion 2.4.13 of the Linux Kernel. The most recent version of MPLS Linux at the time of this project(1.935) is written for the 2.6.6 version of the Linux Kernel, however, it provides a very different interfaceto user-space processes (such as the RSVP daemon) for MPLS management. There were therefore twochoices for integrating these projects into the QuaSAR router: port the 1.72 version of MPLS Linuxto the 2.6.9 Kernel; or rewrite the RSVP daemon to make use of the new MPLS interface provided byversion 1.935 of MPLS Linux. The first of these options would incur a major change to Linux MPLS,and very little change to the RSVP daemon, whereas the second option would incur a minor port ofMPLS Linux from the 2.6.6 to the 2.6.9 kernel, but major modifications to the RSVP daemon. I decidedupon the second of these options, because the modifications to the RSVP daemon are in user levelcode, rather than within the kernel as with MPLS Linux, and so they would be easier to debug. Theconsequences of this choice are described in Section 4.2.2.

The rest of this section discusses how the Click modular router was ported to the 2.6.9 Linux Kernel,and how the MPLS RSVP-TE daemon was ported to MPLS Linux version 1.935.

4.2.1 The Click Modular Router

The kernel level Click Modular Router consists of two main parts: a patch for the Linux Kernel to allowit to support Click; and a kernel module which, when loaded, provides Click’s functionality. To integratethese components into the QuaSAR router, they needed to be ported from the 2.4.22 Linux Kernel to the2.6.9 Kernel. The changes necessary to port these two parts are described below.

Click Kernel Patch

The Click Kernel patch performs four main functions, C++ support, Click Filesystem support, NetworkStack Modifications and Network Driver Polling support. The changes to the Linux Kernel betweenversion 2.4 and 2.6 incurred changes in each of these areas. The patch’s functions are described below,along with a brief description of the changes necessary to port the Click patch to the 2.6.9 kernel.

C++ support

Click is written in C++, whereas the Linux Kernel is written in C, with no runtime support forC++. The patch, therefore, needs to add basic C++ runtime support to the Kernel, for example,mapping the new() function to kmalloc(), adding support for virtual functions, etc. The patchdoes not, however, support full C++ functionality within the kernel, for example, exceptions arenot supported as this would negatively affect the kernel’s performance. The Click module alsomakes uses of a number of kernel header files. The kernel header files are written to be includedin C source files. Since C++ reserves a number of new keywords compared with C (for example,new(), class or the :: symbol), including these Kernel headers within C++ source code couldcause syntax errors.

The C++ runtime support did not require any modification when porting. However the Kernelheader files had changed significantly between 2.4.22 and 2.6.9, therefore the new header files,which are included by Click, needed to be modified, so that they did not cause C++ syntax errors.

20

Xen introduces its own architecture specific header files to Linux, so that it can run on Xen’sidealised virtual machine architecture. When integrating Click with Xen, these additional headerfiles also needed to be modified, so that they were C++ safe.

Click Filesystem support

Click module provides a virtual filesystem which can be used to control and monitor the clickrouter. This filesystem makes use of Linux’s Proc filesystem, however, a small number of changesto the proc filesystem code are required, before the Click Filesystem can be supported.

The Linux proc filesystem code has a slightly different layout in the 2.6 kernel, compared with the2.4 kernel. Therefore, the appropriate locations for the changes, needed for it to support the ClickFilesystem, had to be investigated before these changes could be made in the 2.6.9 kernel.

Network Stack Modifications

A number of Click elements can intercept, or inject packets at various points in the Linux networkstack. Click makes a number of changes to the Linux networking code, to give it access to pointswithin the network stack. These changes also require a minor modification to every networkdevice driver within the Linux Kernel.

Again, the network stack had a slightly different structure in the 2.6 kernel, with a number of extrafeatures, compared with the 2.4 kernel. It was necessary to have an overall understanding of theLinux network stack, and its interactions, before the changes necessary for Click could be addedwithout interfering with the features added between 2.4.22 and 2.6.9 in the network stack ([37]walks through the Linux networking functions which are called when a packet arrives at, or is sentfrom, a host).

A number of new network device drivers have been added to Linux between the 2.4 and 2.6kernels. These new device drivers all needed to be modified to support the network stack changesmade by Click. Xen also adds a virtual network device driver to the Linux Kernel. When Clickwas integrated with Xen, this virtual network device driver also needed to be modified to supportthe changes in the network stack.

Network Driver Polling support

When Click was originally written, the Linux network drivers did not include polling (as opposedto interrupt driven) support. Polling support can greatly improve a router’s performance underheavy workloads, therefore Click added polling support to Linux by modifying a number of thenetwork drivers.

Linux 2.6 includes polling support for network devices, therefore a decision needs to be made asto whether it is better to keep adding Click polling support to the Linux Kernel, or whether Clickshould be modified to support the new polling support provided by Linux. I did not intend toadd polling support into the QuaSAR prototype router, as this would give QuaSAR an advantage,not included in the project’s hypothesis, over the router it was being compared to, and so couldunbalance the experimental results. Since I did not intend to use polling support and there wereunresolved decisions which needed to be made by the Click community, I decided not to includeClick’s polling support in the 2.6.9 patch.

Click Kernel Module

The Click Kernel Module performs the Click routing functions when it is loaded into the Linux Kernel.It consists of four main components: the compiled Click elements that are linked together to provide the

21

routing functionality; a library of functionality used by all elements; router management components;and a Click filesystem used to monitor and configure Click.

Porting the Click Kernel Module to the 2.6.9 kernel involved the following main changes:

Update System Calls

A number of system and kernel calls have changed format between the 2.4 and 2.6 kernel ver-sions. Changes were also made to the way in which modules register their use by other modules.Previously, the macros MOD INC USE COUNT and MOD DEC USE COUNT were used to record amodule’s use count. These macros have now been deprecated, and the details are now dealt withinternally. The Click Kernel Module was modified to reflect these changes.

Update Click Filesystem

The interface between Linux and filesystem implementations changed between the 2.4.22 and2.6.9 versions. A number of functions were added to the Click Filesystem to provide the additionalfunctionality required by the new Linux Kernel.

Once the Click Kernel Module could be compiled and loaded into the 2.6.9 kernel, a problem withthe Click Filesystem became apparent. The Click Filesystem caused the Linux Kernel to crashrandomly after being mounted within the root filesystem. Since this crash occurred within theLinux Kernel, at a seemingly arbitrary point of time, it was extremely difficult to debug. Afterinvestigation, I discovered that the GDB debugging environment could be attached to a User-ModeLinux kernel. User-Mode Linux [13] is a build of Linux which runs as a normal userspace processwithin a standard Linux operating system, allowing the GDB debugging environment to attach toit, as it would with any normal Linux process.

GDB provided the ability to trace through the Kernel source code and probe the value of variablesas the Kernel was running. The call tracing showed that the crash was occurring within Linux’smemory management code, specifically the slab memory allocator. When the slab memory con-trol structures were probed, they were found to have nonsensical values. From this I deducedthat the Click Filesystem was “walking over” unallocated memory (i.e. putting data into memorywhich had not been assigned to it) or was unallocating (or freeing) memory which had never beenallocated. After experimentation, it was found that the Click Filesystem was freeing a memorylocation which was unused in the 2.6 Linux Kernel. Memory for this pointer had not been allo-cated, thus when free() was called upon it, the memory management control structures becamecorrupted. When this free was removed, the Click Filesystem no longer crashed the Linux Kernel.

Integrate Click Module into new Kernel Build Process

The build process for external modules has changed considerably between the 2.4 and 2.6 kernels.Previously, external modules were simply compiled against the appropriate kernel header files,with a set of standard compiler directives asserted. The 2.6 Linux Kernel build process for externalmodules involves pointing the kernel’s kBuild script to the external module’s source code, whichthen builds the module using the kernel’s own build process.

Normally this change would not significantly affect the porting of a module, in fact, this wouldnormally simplify the module’s build process. However Click has an unusual build process, duein part to its simultaneous support for user level and kernel level processes, as well as its use ofC++ within the kernel.

To support user level and kernel level processes simultaneously, Click uses preprocessor com-mands within the source code, with compiler directives creating different object files for a sourcefile, depending on whether it was compiled for the user level or kernel level process. In order that

22

these different object files do not get confused with each other, the object files are not compiledinto the same directory as their source code, but are instead compiled into a user level directoryor a kernel level directory, depending on their target. To easily support this, Click used a vpathcommand within its makefile (used by the kernel module’s build process). This vpath, or virtualpath, command directs the make build tool to search the given paths for source files, but com-piles these source files in the current directory. This allows the object files, compiled for user andkernel level, to remain separate in their own respective directory, even when compiled from thesame source file. However, the Linux Kernel build process does not support the vpath command,so these source files were not found when building the kernel module under the 2.6 kernel. Newcommands were therefore added to the 2.6 Click Kernel module makefile, to identify the locationof the source files explicitly. Since Click elements can be added by users, their locations are notnecessarily known ahead of time. A script was therefore created, which adds the location of newelements’ source code to a file used by the makefile.

The use of C++ within Click also introduced problems for the Linux 2.6 Kernel build process.Since the kernel expects C source files, it does not provide any support for compiling C++ sourcefiles. Commands were added to the Click Kernel module to enable C++ compilation within thekernel build process.

The compilation of the Click Kernel module within the Linux kernel build process introducesan additional problem, that of common symbols in the object files built. Common symbols arevariables without initialised data storage. They are created by the gcc compiler when the compilerfinds a global variable which is initialised within a function, rather than immediately after it isdeclared. For example, the following C or C++ code would create the variable global valueas a common symbol:

int global_value;

void initialise() { count = 1; }

The 2.6 Linux Kernel cannot load modules with common symbols, therefore source code withinthe above code structure needed to be modified. The above case could easily be fixed by ini-tialising global value to any value when it is declared. However, Click contained a numberof instances which were more difficult to fix. For example, Click creates an atomically acces-sible integer type, for use in locking situations. This atomic type redefines the = operator witha section of code to ensure its atomicity. The code represented by the atomic type’s = operatorcannot be compiled outside of a function. Therefore a global variable of this atomic type cannotbe initialised where it is declared, it must be initialised within a function. This problem couldbe circumvented by having a global pointer to the value, rather than declaring the the value itselfglobally. Common symbols for atomic types could thus be removed by using the following code:

atomic_int * global_value = NULL;

void initialise() {global_value = (atomic_int *) kmalloc (sizeof(atomic_int));(*global_value) = 1;

}

The final problem, which this new build process introduced, occurred when the Click Kernel Mod-ule was integrated into a Linux Kernel modified for Xen (XenoLinux). Once the Click module

23

was fully operational within the 2.6 Linux Kernel, it was tested on a XenoLinux kernel. TheXenoLinux kernel would load the Click module correctly, however, as soon as a Click routerconfiguration was loaded, the machine would hang. The problem was eventually traced to a softinterrupt which was called within Click. As discussed previously, XenoLinux contains its ownarchitecture specific header files, so that it can interface with Xen’s idealised virtual machinearchitecture. The external module build process was compiling Click with the standard x86 archi-tecture header files, rather than the Xen architecture header files. When the interrupt was beingcalled, the standard x86 mechanism was being used, which Xen would not allow, therefore theguest operating system hung. To solve this problem, it was necessary to modify the Linux buildprocess to ensure it used Xen’s own architecture specific header files when compiling Click.

4.2.2 MPLS RSVP-TE daemon

As discussed previously, I decided to port the RSVP-TE daemon to a newer version of MPLS Linux,rather than porting an old version of MPLS Linux to the newer kernel. This meant the porting ofMPLS Linux to the 2.6.9 kernel was relatively simple, involving relatively minor changes in functioncall formats and moving the location of some source code which is added to the Linux Kernel for MPLSsupport. However, this choice meant that the RSVP-TE daemon needed to be substantially modified tosupport the new MPLS Linux interface.

Integration with MPLS Linux 1.935

The interface for management of MPLS label switched paths (LSPs) and labelstacks changed substan-tially between MPLS Linux 1.72 and MPLS Linux 1.935. Fortunately, the RSVP-TE daemon used asingle source file to provide a library of MPLS functions to the rest of the daemon. Therefore, insteadof having to search throughout the source code for MPLS operations, re-implementing each one forthe new MPLS Linux interface separately, it was possible to simply re-implement the complete MPLSlibrary file, thus providing a wrapper to the new MPLS interface.

Integration with rtnetlink

The library of MPLS Linux functions, used by the RSVP-TE daemon, gains access to the MPLS man-agement functions through the MPLS administration tool (mplsadmin). However, rather than startingmplsadmin each time it required access to an MPLS management function, the MPLS Linux administra-tion program was compiled within the RSVP-TE daemon. This meant that the RSVP-TE daemon couldaccess these commands directly, rather than having to fork a new process (in the form of mplsadmin)whenever it accessed MPLS management functions. When updating the RSVP-TE daemon to supportthe 1.935 version of MPLS Linux, the original mplsadmin compiled within the RSVP-TE daemon wasreplaced with the 1.935 version of the mplsadmin.

However, replacing the 1.72 mplsadmin with the 1.935 adminstration tool introduced some complica-tions. Both mplsadmin and the RSVP-TE daemon itself use rtnetlink to update the routing table in theLinux Kernel. Both of these tools, therefore, have a library of rtnetlink commands, which they use toprovide this communication. The 1.72 mplsadmin uses the same rtnetlink library as the RSVP-TE tool,therefore, when it is compiled into the RSVP-TE daemon, it can use the rtnetlink library provided bythe RSVP-TE daemon. However, the 1.935 version of the MPLS administration tool uses a newer ver-sion of the rtnetlink library than that used by the RSVP-TE daemon. The rtnetlink library, within theRSVP-TE daemon, was therefore replaced by the newer library, to provide the functionality required

24

by mplsadmin. This required modification of much of the RSVP-TE daemon’s source code which dealtwith rtnetlink communication, in order that it conformed to the format of the updated rtnetlink library.

4.2.3 Summary

Significant effort has been expended to integrate several open source subsystems to produce the QuaSARprototype. This level of effort is summarised in the following table.

Subsystem Name Lines of Code Type of CodeClick Kernel Patch ∼640 Kernel Level Code in CClick Module Patch ∼360 Kernel Level Code in C++/C

Click Module Makefile 305 (∼100 significant lines) Makefile ConfigurationRSVP-TE Daemon Patch ∼3400 User Level C

Some of the 640 lines of code required to patch the 2.6 Linux Kernel, so that it can support the ClickModular Router, came from either previous Click Kernel patches, or from some preliminary work whichhad been made on the 2.6 kernel patch before the start of this project. However, much of the changesrequired significant modification to support the 2.6 kernel, required substantial debugging before itworked, or was brand new. Similarly, some of the Click Module Patch was based upon preliminarywork which had been performed before the start of this project, however, much more of it was brandnew and debugging was required on the preliminary work, since it had never been fully tested. Most ofthe Click Module Makefile was replaced to support the 2.6 build process. Much of this replacement wassimple variable set up, however, about 100 lines required careful thought. These changes were passedback to the Click community, where, with slight modification, they have been integrated into the Clickdevelopment tree.

The RSVP-TE daemon patch required by far the most changed lines. A large portion of these linechanges consisted of simply replacing the MPLS management tool and the rtnetlink source code files.However, a substantial number of these changes were incurred by rewriting the MPLS function libraryand by modifying the RSVP-TE daemon’s code to support the updated rtnetlink library.

25

Chapter 5

QoS Routelet

The original approach of QuaSAR, described in Chapter 3, calls for the creation of simplified routers,or routelets, which provide routing functionality for Quality of Service traffic flows. Many of theseroutelets run within QuaSAR at any one time, such that each network flow requiring QoS guarantees isserviced by its own independent routelet. As such, it is important that each of these routelets uses aslittle of the overall system resources as possible, so that QuaSAR can support a reasonable number ofnetwork flows at any one time.

Fortunately, the routing requirements of a single network flow are relatively simple, therefore the Qua-SAR routelet can have a simple design with modest system requirements. A routelet only routes packetsfrom a single, unidirectional network flow, therefore it only needs to provide static routing and simplepacket processing. When a network flow’s route changes, or its QoS requirements change, the mainQuaSAR router (described in Chapter 6) will simply reconfigure the routelet. The routelet does not,therefore, need to cope with dynamic routing changes, however, it does need to be dynamically recon-figurable.

This Chapter describes the design and implementation of QuaSAR’s QoS routelet. Section 5.1 describesthe operating system configuration used by the routelets. Section 5.2 describes the implementation ofthe routelets’ packet processing and routing functionality.

5.1 Routelet Guest Operating System

Each routelet runs within its own Linux guest operating system. The routelets have simple requirements,and could have been supported by a much simpler operating system than Linux (thus reducing thesystem resources required by each routelet). However, the routelet must be run within a Xen VirtualMachine, therefore it must run on an operating system which has been ported to the Xen virtual machinearchitecture. Although Linux, and a number of other major operating systems, have been ported to theXen virtual machine architecture, no simple, small operating systems run on Xen. Porting one of theseoperating systems to Xen would be beyond the scope of this project, therefore the Linux OperatingSystem was used to support the QuaSAR routelets. However, an attempt was made to use a minimalconfiguration of Linux, so that QuaSAR could support many routelets.

The Linux kernel provides the basis of the Linux operating system, in that it manages the underlyinghardware, however, it does not provide the full functionality of a Linux operating system. A fullyfunctional Linux Operating System requires both a Linux Kernel and a collection of libraries (such asthe GNU C Library) and system tools (such as ifconfig, which is used to set up network interface cards).Although it is possible to set up a Linux system from scratch with these libraries and system tools,

26

they often come bundled in the form of a distribution. The next two sections (5.1.1 and 5.1.2) describehow these two components of the operating system were configured, in order to reduce the resourceutilisation of each QuaSAR routelet. Section 5.1.3 describes the changes made to the routelets’ startupscripts, so that they can process network packets, and communicate with the main, best effort router, assoon as they have started.

5.1.1 Linux Kernel

The QuaSAR routelet runs within a Xen virtual machine, therefore its Linux Kernel was modified withthe XenoLinux patch before being compiled. The routelet also makes use of the Click Modular Router,so the Kernel was also patched to support this (this patch is described in Section 4.2.1). Although aroutelet deals with MPLS packets, the processing of these packets occurs within Click, not within theLinux networking stack, therefore the MPLS Linux support was not required within the routelets’ kernel.

The Linux Kernel’s build process allows sections of the kernel to be excluded from a build if they arenot needed. This meant that the system requirements of each routelet could be reduced by compiling aminimal kernel, where major sections, such as SCSI disk drivers or USB support, were excluded as theywould never be used by the routelet.

5.1.2 Linux Distribution

It is possible to create a custom distribution, by choosing the appropriate libraries and system toolsmanually and configuring the system to use these components. This would allow the creation of an idealminimal distribution for the QuaSAR routelet. However, manually creating a distribution is a very timeconsuming and error prone task, so I chose to investigate whether a pre-built Linux distribution wouldprovide the features required by the routelet. A Linux distribution typically takes up much more harddrive space than the Linux Kernel, therefore it was important that a small distribution was chosen foruse within the routelet. The routelet was required to execute standard Linux programs, so although asmall distribution was preferable, it had to contain the full GNU C Library. It also had to contain (or beable to use) the tools necessary to set up the networking configuration needed by the routelet. A numberof Linux distributions (e.g. SUSE or Redhat) contain the necessary libraries and tools, however, theyare very large and contain features which are unnecessary for the routelet. Similarly, there are manysmall Linux distributions (e.g. 0sys [25]) which do not use the full GNU C Library or have networkingsupport.

The ttyLinux project [36] is an attempt to create a fully featured Linux distribution which is small andcan run on limited resources. ttyLinux was the chosen distribution for the QuaSAR routelets as it can runwith 4Mb of RAM, yet still provides the full GNU C Library. It also provides the networking supportrequired by a routlet.

5.1.3 Routelet Startup Configuration

Each time a routelet is started, a number of tasks need to be performed before that routelet can commu-nicate with the main QuaSAR router, or process network packets. These tasks were integrated into theroutelets’ startup scripts. There are three main tasks which a routelet must perform on startup:

27

Set up Network Devices

Xen virtual network device interfaces (VIFs) are used by QuaSAR to provide communication, and packetpassing between domains (i.e. QoS packets are passed from the main QuaSAR router to the appropriateQoS routelet through a VIF). When a new Xen domain is started, a parameter in the domain creationscript specifies how many VIFs are created for that domain. These VIFs form a point-to-point linkbetween their back-end within a privileged domain (which is creating this new domain), and their front-end within the new domain. The back-end and front-end parts appear as normal network devices to bothdomains, and so normal network messaging can be used to communicate between domains.

Each routelet, when started, is configured with a VIF to correspond to each of the Ethernet devices inthe underlying QuaSAR router. A script on the privileged domain (i.e. the domain with access to theunderlying Ethernet devices - in QuaSAR’s case, the main, best effort router’s domain) connects theback-end of these VIFs to their corresponding physical Ethernet devices using a virtual Ethernet bridge.This means that any packets, sent out of a VIF within a routelet, are sent out of the correspondingphysical device. When QuaSAR demultiplexes packets destined for a routelet (described in Section 6.3),it sends these packets to the routlet’s VIF which corresponds to the physical device which the packetarrived on. These VIFs therefore appear like the corresponding physical Ethernet devices to the routelet,but with only the traffic destined for that routelet.

The routelet routes packets based upon their MPLS label, not based upon their IP header. MPLS is alower level network protocol than IP (layer 2.5 instead of layer 3), therefore the routelets’ VIFs do nothave to be set up with IP addresses. In fact, if they were set up with the addresses of their correspondingEthernet devices, this would cause confusion within the virtual Ethernet bridge being used to connect theVIFs to the Ethernet devices. The VIFs were, therefore, started without being assigned an IP address.The ttyLinux networking startup script could not bring up (or start) network devices without assigningthem an IP address. This script was therefore modified to allow these VIFs to be brought up without anIP address.

Each routelet also starts with an extra control VIF, so that the main QuaSAR router can send controlinformation to each of the routelets. A unique IP address is assigned to each routelet’s control VIF,therefore standard TCP/IP communication can be used for communication between the main router andthe routelets. The ClickCom tool, described in Section 6.2.2, uses this communication link to allow themain router to control the routelet’s Click Router configuration.

Figure 5.1 provides an overview of the routelet network setup1. Although it appears that a packet mustpass though a large number of interfaces to get to a routelet, the packet is never actually copied betweenthese interfaces. Instead, packets move between interfaces by less expensive methods, such as pointerpassing, or page table sharing. Page table sharing is used to move data from one Xen domain to another,for example, moving packets between the front-end and back-end of Xen virtual network interfaces. Thisapproach involves manipulating the memory page tables of the two guest operating systems involved,to transfer ownership of the packet’s memory page(s) from one guest operating system to the other.Therefore, transfer of packets between the back-end VIFs in domain 0, and the routelet’s front-endVIFs, involves subtle manipulation of both guest operating systems’ page tables, rather than copying ofthe packet.

1Note that Figure 5.1 does not show the packet demultiplexing mechanism used to send incoming packets to the correctroutelet for processing. This demultiplexing is described in Section 6.3, however at present it is sufficient to view the bridgesas performing the necessary demultiplexing between routelets.

28

eth1 eth2eth0FE FE FE

routelet 2

vif1.0BE

vif1.2BEBE

vif1.1 vif2.0BE BEBE

vif2.1 vif2.2

eth0

eth1FE

xbr−eth1

eth1

routelet 1

eth2eth0FE FE

Best Effort Router

xbr−eth0 xbr−control

(Control Messages)

Network

Data Transfer Type

[name]

Network Interface Type

Key

FE[name]

[name]BE

[name]

Physical EthernetDevice

BridgeVirtual Ethernet

Xen Virtual InterfaceBack−End

Xen Virtual InterfaceFront−End

Memory Page TransferPointer PassingNetworkTransmission

Figure 5.1: This diagram outlines the network setup used to connect routelets’ virtual Ethernet devicesto the physical Ethernet devices.

Mount Click Filesystem

The routelets control their Click Router configuration by reading and writing to files in the Click Filesys-tem. This virtual filesystem needs to be mounted within the routelet’s root filesystem before it can beaccessed. An entry was entered into the routelet’s fstab configuration file, in order to mount the ClickFilesystem automatically on startup.

Start ClickCom Server Daemon

The ClickCom tool (described in Section 6.2.2) is used by the main QuaSAR router to control therouting and packet processing of routelets, by transferring the required configuration to the routelet’sClick Router. The ClickCom server was added to the routelet’s startup script, allowing the main routerto send Click configurations to a routelet as soon as it is started.

5.2 Routelet Packet Processing

As each routelet supports a single network flow, it only has to deal with packets arriving at a singleinbound network interface, static processing of these packets, then transmission on a single outboundinterface. Packets therefore simply need to pass though a single chain of commands, with no branchesor decision points, while being processed by a routelet.

The QuaSAR routelet uses the Click Modular Router to process packets. As discussed in Section 2.1,a Click Router consists of a number of elements, linked together by one way links. When a packetarrives in a Click Router, it traverses this sequence of one way links, being processed by each elementit arrives at, until the packet arrives at an element which causes it to leave the Click Router (this couldinvolve being discarded, being sent to a network device, being passed up to the Linux networking stack,or a number of other options). This structure is ideal for the QuaSAR router, with a single sequence of

29

elements performing the necessary processing on packets between their arrival and departure 2. Many ofthe routelet’s packet processing requirements can be provided by elements which are already availablewithin Click. Elements can also be written to provide any required functionality which is not providedby the standard Click elements.

5.2.1 Overall Design

The QuaSAR router routes packets at the MPLS layer, therefore it has to process both the data-link (inthis case Ethernet) header and the MPLS shim header. The routelet, therefore, processes packets by firststripping off the outermost (old) Ethernet header, processing the MPLS shim as required for the nextnetwork hop, and finally encapsulating this packet in a new Ethernet header, before sending it to the nexthop in that network flow. Figure 5.2 outlines the Click configuration used by each routelet to performthis packet processing.

Queue

FromDevice ([inbound nic]) Strip (14)

ProcessShim ([new MPLS label])

ToDevice ([outbound nic])

EtherEncap ([src MAC], [dst MAC], 0x8847)

Figure 5.2: An overview of the Click architecture used by QuaSAR routelets to process packets.

Each of the Click elements used by the routelet in the above Click configuration is described in moredetail below:

FromDevice

The FromDevice element captures any packets arriving from a network device and pushes themto the next element. These packets have been captured from the network interface’s device driver,and have therefore not passed through any of Linux’s networking stack. Consequently, the packetsare still encapsulated with their original data-link header.

The FromDevice element takes a single configuration parameter - the name of the device it shouldcapture packets from. This device name is specified, when the main QuaSAR router configures aroutelet for a particular network flow, as the device upon which packets from that flow arrive. TheFromDevice element creates a special file in the routelet’s Click Filesystem, which can be usedto change the device from which it will capture packets. Therefore, if a network flow’s inbounddevice changes, the routelet does not have to be completely reconfigured. Instead, the main routercan use the ClickCom tool to write the new device name into this special file.

Strip

The Strip element simply removes a certain number of bytes from the front of a packet. This isused to remove the old data-link header from packets arriving into a routelet. The Strip element

2Click does allow elements to have more than one output or input, providing branching and merging of packet flows. Thisability was used within the main QuaSAR router (Section 6.3), but was not necessary in a routelet’s packet flow.

30

accepts one configuration parameter - the number of bytes to be stripped from the front of apacket. This is set to 14 bytes in the QuaSAR router, as this is the size of an Ethernet header. Ifa different technology is used, this parameter can simply be changed to support the header size ofthat technology.

ProcessShim

This element processes the MPLS shim on the packet. This involves changing the shim’s MPLSlabel to that used by the next hop in the network flow, and decrementing the shim’s time to live (ttl)field. This element is configured by providing the value to which an outgoing packet’s MPLS labelshould be changed. This element also provides a special file which can be used by the main routerto change this outgoing label value, without reconfiguring the whole routelet. The ProcessShimelement was created especially for the QuaSAR router. Its design and implementation is describedin more detail in Section 5.2.2.

EtherEncap

The EtherEncap element encapsulates packets in their outgoing Ethernet header. An Ethernetheader consists of the source hardware (MAC) address of the outgoing interface, the destinationhardware address of the next hop in this packet’s route, and the protocol type of this packet. TheEtherEncap element is thus configured by providing the values of these three components. Thetwo MAC addresses are provided by the main router when it configures a routelet, the protocoltype always being set to MPLS. The element again provides special files to alter these valuesdynamically without reconfiguring the whole routelet.

Queue

Click provides two mechanisms for moving packets between elements - push and pull. If a linkis in push mode, then the first element will push packets to the second element as soon as it hasfinished dealing with them. The second element will then immediately process this pushed packet.In pull mode, the second element is responsible for passing packets - it will ask the first elementfor a packet when it is prepared to process one. An element can be of type push, pull or agnostic(either push or pull) and the elements can only be linked to another element of the same type (anagnostic element can link to either a push or a pull element, however both its input and outputmust be of the same type. Once connected, it effectively becomes either a push or pull element,depending on the elements to which it is connected).

Most of the elements used by the routelet are agnostic, however, FromDevice is a push element (asit pushes packets as soon as they are captured) and ToDevice is a pull element (as it waits until thenetwork device is ready before asking for the next packet to send). A FromDevice element and aToDevice element cannot be connected together directly (even with agnostic elements in between)as they are of different types. A Queue element has a push input, and a pull output, therefore aQueue element is placed between the FromDevice and ToDevice elements to connect them. AQueue also provides storage for packets when more are being pushed into it than pulled out. Thisprovides a degree of buffering between the rate of incoming packets and the rate at which theycan be sent out of the outbound network interface. The Queue’s storage capacity can be altered,depending upon the requirements of the routelet.

ToDevice

The ToDevice element sends packets out of a network device. These packets are sent straightto the network interface’s device driver, and have therefore not passed through any of Linux’snetworking stack. The ToDevice element is configured by providing the name of the device itshould send packets out on. This can be modified using a special file in the same way as theFromDevice element.

31

The configuration values for each element are provided by the main router when it configures a routeletfor a particular network flow. This process is described in Section 6.2.

5.2.2 ProcessShim Element

The ProcessShim Click element was created so that the QuaSAR routelets could modify the MPLSshims of packets as they pass through the router. An MPLS shim is a 32 bit header which is used toroute packets in an MPLS network. In MPLS over Ethernet, this shim is located between the Ethernetheader and the IP header of a packet. An MPLS shim has four fields: a label, used for routing; anexperimental field (which can be used for DiffServ over MPLS); a ”bottom of stack” bit, used whentunnelling MPLS over MPLS; and a time to live (ttl) counter, which allows layer 3 functions, such asping route tracing, to occur, even though an MPLS router doesn’t have access to the layer 3 header.These fields are arranged as shown in Figure 5.3.

MPLS Label TTLSExp

32 bits

320 81

Figure 5.3: The layout of an MPLS Shim.

The experimental field and the ”bottom of stack” bit are not modified by the QuaSAR routelet, however,both the label and ttl fields must be modified before a packet is sent to the next hop on its route. The ttlfield must be decremented by each router the packet traverses across the network. If a packet arrives ata router with a ttl count of zero, then that packet is discarded. This prevents packets travelling foreverbecause of a loop within a network. The ttl is a 1 byte number at the end of the MPLS shim. TheProcessShim element, therefore, accesses the shim’s ttl using an unsigned char pointer to the lastbyte of the shim. The ProcessShim element tests this ttl against 0, discarding the packet if it is 0,otherwise the ttl is decremented.

A label switched path (LSP) does not use a single label value throughout the whole network. Instead,each router assigns its own label to an LSP. This means that it is not necessary to find a globally uniquelabel for an LSP across all of the routers that LSP traverses. When an LSP is created, a label distributionprotocol (RSVP-TE in the QuaSAR router) enables neighbouring routers to exchange the label valuethey have assigned to that LSP. MPLS routers use this information to swap an incoming packet’s MPLSlabel with the label that the next hop router has assigned to that LSP, before sending the packet.

When a new ProcessShim element is created, it is passed a new label value (as an unsigned integer)by a per-element configuration method. A packet’s MPLS label is replaced by the value in new labelwhen it is processed by the ProcessShim element. An MPLS label is a 20 bit integer, therefore, sinceC++ does not have an integer type which is 20 bits long, bit manipulation is used to mask the 12 bits atthe end of the shim when the label is replaced.

A write handler was added to the ProcessShim element. This creates a special file in the Click Filesystemand handles the effect of writing to that file. The write handler was implemented so that writing tothis special file updates the new label value with the value written to this file, allowing dynamicreconfiguring of the ProcessShim element.

32

Chapter 6

Main Router

In order to support QoS routelets, a number of changes must be made to the best effort router, whichis in overall control of the QuaSAR router. This best effort router runs within a privileged Xen domain(virtual machine) and therefore has direct access to the physical networking devices, and has the abilityto create new domains for use by routelets. The main QuaSAR router performs two functions: routingof any best effort, non QoS, packets arriving at the router; and setting up and managing QoS routelets.

The routing of best effort packets is performed by two components - Linux IP forwarding for IP packets,and MPLS Linux for best effort MPLS packets. Both provide static routing of packets through a network,with tools available to manually modify their routing tables. The MPLS RSVP-TE daemon is used bythe QuaSAR main router to provide dynamic creation of MPLS LSPs when the router receives RSVP-TE path creation messages. Similarly, Quasar could be modified to support dynamic IP routing, withthe addition of routing daemons which support dynamic routing protocols, such as OSPF or BGP (forexample, those in the Zebra Open Source Router). However, dynamic IP routing was not required bythis project.

Support for QoS routelets required the addition of four main components in the main QuaSAR router -Routelet Creation, Assignment of QoS Network Flows to Routelets, Packet Demultiplexing and RouteletManagement. These components are described fully in the following sections of this chapter.

6.1 Routelet Creation

The creation of a new routelet involves creating a new Xen domain, starting Linux within that domainand linking the domain’s virtual network devices to the router’s physical devices. It therefore takessignificant CPU time (a matter of seconds) to start each new routelet. Therefore, starting a routelet eachtime a new network flow, requesting QoS guarantees, arrives would considerably disrupt the routingperformance of the QuaSAR router as the routelet is started. Instead, QuaSAR creates a pool of routeletsat startup, when no packets are being routed, then assigns routelets to QoS network flows as they arrive.

A startroutelets script was implemented, which automatically creates a pool of routelets, readyfor use by network flows. A single argument specifies the number of routelets which this script shouldcreate. Each routelet requires a unique ID number, both to identify the Xen domain in which it is runningand to create a unique IP address, used by the routelet’s control VIF. The StartRoutelets scriptbegins by searching the list of Xen domains currently running, to find the highest current domain ID inuse. It does this by running a simple awk script over the output from the Xen management tool’s domainlisting option. Each new routelet is assigned a higher ID than the highest currently in use, to ensure this

33

new ID is unique, which allows the StartRoutelets script to add routelets to an existing pool, aswell as create a new pool.

Xen is passed a domain startup script and this unique ID, for each routelet to be created. This domainstartup script contains the information used by Xen to create the routelet’s domains, such as:

The location of the routelet’s Linux Kernel: A single Linux Kernel is shared by every routelet (de-scribed in Section 5.1.1). Since the Kernel is loaded into the routelet’s memory immediately,sharing a single Kernel file across routelets does not cause interference amongst them.

The amount of memory assigned to that routelet: Each routelet is assigned the smallest amount ofmemory, within which it can run (6Mb), so that the QuaSAR router can support the maximumpossible number of routelets. Each routelet is assigned the same amount of memory, regardless ofits QoS requirements, since additional memory would not have a significant effect on the simplepacket processing provided by the routelets.

The location of the routelet’s filesystem: A model routelet filesystem was created with the files neededby each routelet. This model filesystem was initially located on a Logical Volume Manager (LVM)partition, where each routelet could use a LVM snapshot of this model filesystem as their filesys-tem. A snapshot partition will only write to disk changes which the routelet makes from themodel filesystem. This allows multiple routelets to share the same filesystem without interferingwith each other when they write to that filesystem, whilst still saving disk space, as the sharedmodel filesystem is only stored once on disk. It was discovered, however, that routelets do notrequire a writable filesystem. By turning off system logging and filesystem verification checks inthe routelet’s guest operating system, the routelet simply requires access to a read only filesys-tem. Every routelet is therefore given read only access to a single file backed root filesystem,simplifying the filesystem layout and reducing the disk space required, to that of a single routeletfilesystem, no matter how many routelets are running simultaneously.

The routelet’s virtual network interface configuration: The routelet domain startup script containsthe information necessary to set up a routelet’s network configuration, as described in Section 5.1.3.This includes the number of virtual network interfaces a routelet should have, the virtual bridgesto which these network interfaces should be connected, and the IP address (generated from theroutelet’s ID number) which should be assigned to the routelet’s control VIF.

Once all of the routelets have been started, their domains are paused, until they are assigned to a networkflow. Therefore, unused routelets are not scheduled to run by the Xen Virtual Machine Monitor, and thepool of unused routelets does not use any CPU time. The routelets are paused after all of the routeletshave been started, rather than after each individual routelet starts, in order to allow the routelets time toboot their operating systems fully.

Finally, the framework used to demultiplex incoming packets between routelets is set up. This is dis-cussed in Section 6.3.

6.2 Assignment of QoS Network Flows to Routelets

Once a pool of routelets has been created, a mechanism is needed to remove routelets from this pooland assign them to the routing of a QoS network flow. A connectroutelet script was created, toautomatically perform the tasks necessary to assign a routelet to a network flow. This script is describedin Section 6.2.1.

34

In order to configure a routelet for the routing of a QoS network flow, the main QuaSAR router mustbe able to communicate with routelets. The ClickCom tool was created to provide communication ofClick configuration files between the main router and routelets. The design of this tool is described inSection 6.2.2.

The assignment of routelets to network flows was integrated within the RSVP-TE daemon, so thatroutelets are assigned to network flows automatically when an MPLS LSP is set up with RSVP-TEmessages. Section 6.2.3 describes this integration.

Finally, routelets must be returned to the idle pool after a network flow is torn down. This process isdescribed in Section 6.3.4.

6.2.1 ConnectRoutelet Script

Three tasks need to be performed to assign a routelet to a network flow: identify a routelet and removeit from the pool, configure this routelet so that it can route packets from the network flow, and finallyconfigure packet demultiplexing so that packets from that network flow are sent to the newly configuredroutelet. The main QuaSAR router uses a shell script to perform these tasks automatically when a newQoS network flow arrives.

The ConnectRoutelet script identifies which routelets are assigned to network flows and which arein the pool of available routelets, by identifying routelets whose domains are currently in the pausedstate. An awk script is run over the output from the Xen domain listing to identify routelets which arepaused. One of these paused routelets is selected and un-paused using the Xen domain managementtool. This routelet will be assigned to the new QoS network flow.

In order to configure this routelet for the QoS network flow which has just arrived, the main router mustpass a Click configuration, as described in Section 5.2.1, to the routelet. First, however, the configurationparameters of each Click element must be determined, as appropriate for the network flow.

Firstly, the routelet must route the packets between the network flow’s incoming and outgoing networkinterfaces correctly. The network flow’s incoming and outgoing network interfaces are known when anetwork flow is specified, as they are a fundamental characteristic of the network flow. The names ofthese network devices are passed to the ConnectRoutelet script, where they are used to fill in theFromDevice and ToDevice element configurations of the routelet’s Click configuration.

The routelet must replace the MPLS label of packets it is processing with the label used by the nextrouter for this network flow. This outbound label is also passed to the ConnectRoutelet script,where it is used to configure the ProcessShim element.

Finally, the EtherEncap element needs to be configured with the source and destination MAC addressesof the packet’s Ethernet header. The name of the network interface, out of which the packets will be sent,is already known, therefore the source MAC address can be found by searching for this interface namein the output of the ifconfig tool, which lists network interface information, including the interfaces’MAC addresses. The destination address is the address of the next hop on the network flow’s route. TheMAC address of the next hop is not typically known, however, the next hop’s IP address is part of anetwork flow’s specification. The ConnectRoutelet script is passed the next hop IP address, whichit translates to a MAC address by searching for this address in the router’s ARP cache1. If an ARP entry

1ARP [30], or address resolution protocol, is used by hosts to find the Ethernet MAC addresses which correspond to a givenIP address. When a host sends packets to an IP address, it first sends an ARP message to all neighbouring hosts to ask whichwould accept packets with this IP address, thereby discovering the MAC address of the host to which it should send packetswith this IP address. Recent IP address / MAC address pairs are stored in an ARP cache, so that ARP messages do not have tobe sent for packets with IP addresses which have been recently resolved.

35

cannot be found for this next hop IP address, then the ConnectRoutelet script pings the next hopIP address, and checks the ARP cache again. If the IP address is still not in the ARP cache, then thenext hop cannot be reached by this router, and the network flow cannot be set up. Otherwise, the MACaddress, corresponding to the next hop IP address, is used to configure EtherEncap’s destination address.

Once these configuration parameters have been discovered, the resulting Click configuration is sent to thenewly assigned routelet, using the ClickCom tool (Section 6.2.2). Finally, the QuaSAR demultiplexingsupport is configured, so that packets arriving to this router from the newly instigated QoS network floware sent to the assigned routelet. QuaSAR’s demultiplexing details are described in Section 6.3.

Performing these tasks using a compiled language such as C, instead of a shell script, was investigated,however, due to QuaSAR’s reliance on external tools (such as the Xen management tool), the speedincrease this would provide would be negligible.

6.2.2 ClickCom

A Click Router can be loaded, by writing a Click configuration into a config file in the Click Filesys-tem. Once a Click router is running, its elements can be reconfigured by writing values into Click files,created when elements are initialised. Although a routelet can control its own Click Router configu-ration through the Click Filesystem, the main router cannot directly access another operating system’sClick filesystem (i.e. the main router cannot directly manipulate a routelet’s Click Filesystem in orderto set up its configuration). It would initially seem that NFS [35] could be used to share this filesystembetween the routelet and the main best effort router, however, this is not the case. NFS, or Network FileSystem, can be used to share files across a network, however, the Click Filesystem does not consist ofany physical files on disk (the files are effectively just interfaces into procedures within the Click KernelModule), therefore there are no physical files for NFS to share.

A tool (ClickCom) was therefore created to transfer configuration data from the main router into aroutelet’s Click Filesystem. A Click configuration is written in a specially created programming lan-guage, therefore it can be transmitted as a simple character stream. ClickCom consists of a client and aserver process, which communicate using TCP/IP. The server process runs on each routelet, binding to aknown port on that routelet’s control VIF. The client process runs on the main QuaSAR router. When themain router needs to change the configuration of a routelet’s Click Router (for example when a new QoSnetwork flow arrives), it provides the ClickCom client with the routelet’s IP address, a Click configura-tion (either through standard input or from a local file), and the location of the file on the routelet’s ClickFilesystem which is to be written to (e.g. Click’s config file). The ClickCom client connects to theroutelet’s server, sends the location of the Click file to which the routelet should write the configuration,and then sends the configuration as a character stream. When the server receives a connection, it opensthe file at the location sent by the client, then writes the configuration character stream into that file.

The ClickCom tool can be used, both to create new Click router configurations within routelets, andto reconfigure the Click elements of running routelets, by writing into their configuration files (seeSection 5.2.1 for an account of the reconfiguration which is possible for each element).

6.2.3 RSVP-TE Daemon Integration

To support routelets, the RSVP-TE daemon was modified to assign routelets to QoS network flowswhen a new label switched path (LSP) is created. An LSP is created when a RSVP-TE Resv message,corresponding to a previously received RSVP Path message, is received by the RSVP-TE Daemon.Therefore, the RSVP-TE daemon was modified to call the ConnectRoutelet script when a Pathmessage is received, passing parameters specifying the inbound and outbound network interfaces, the

36

incoming and outgoing MPLS labels and the next hop IP address of the network flow to this script. Allof these parameters, except for the inbound network interface, are available to the RSVP-TE daemon ina structure which stores detail of the RSVP-TE Resv message. The inbound network interface can beinferred by the interface on which the corresponding RSVP Path message arrived, as RSVP-TE Path andResv messages follow inverse routes through the network (defining the route of the LSP).

The QoS requirements of a network flow can be specified using a flow specification (flowspec) withinthe RSVP-TE messages used to set up the network flow. The RSVP-TE daemon could be modified toextract the QoS requirement from this flowspec, and map these requirements to the resources required bythe routelet assigned to that flow. The RSVP-TE daemon could then use the management tools describedin Section 6.4 to allocate the appropriate resources to the newly assigned routelet. However, a flow’sQoS requirements could not automatically be mapped to the resources required by a routelet to providethis QoS before experiments were performed on the QuaSAR router. Therefore, resources are manuallyassigned to routelets in this project.

6.3 Packet Demultiplexing

Initially, QuaSAR’s design called for the use of Xen’s inbuilt packet demultiplexing between domains tosend incoming packets to the correct routelet for processing. However, further investigation showed thatXen’s demultiplexing support was not appropriate for the QuaSAR router. Xen’s packet demultiplexinginvolves the use of a virtual Ethernet bridge. A virtual network interface (VIF) back-end of each targetdomain is connected to a virtual bridge in the privileged domain, along with the the physical devicewhere packets to be demultiplexed are arriving. Bridges route packets based upon their data-link layeraddress (i.e. the Ethernet MAC address in this case), therefore each VIF is assigned a unique MAC ad-dress, used by the bridge, to route incoming packets to the correct VIF (and therefore domain). There aretwo possible methods of demultiplexing packets between domains, using this virtual bridge approach:

• Put the physical Ethernet device into promiscuous mode, and direct hosts wishing to send packetsto a domain to use the domain VIF’s MAC address, instead of the physical Ethernet device’s MACaddress as the packet’s destination. With the physical Ethernet device in promiscuous mode,it would accept these packets, even though their destination MAC address does not match thedevice’s MAC address. The bridge would then route these packets to the appropriate domain,based upon the VIF’s MAC addresses.

This approach is not appropriate for the QuaSAR router because the RSVP-TE messages, usedto set up an LSP, use IP to find their route through the network. Since the LSP follows the sameroute as the RSVP-TE messages used to set up that route, the RSVP-TE messages must be sent tothe MAC address of the routelet which is going to process that LSP. A fake IP address could be setup for each routelet, with the QuaSAR router faking ARP message responses, associating that IPaddress with the routelet VIF’s MAC address. The RSVP-TE messages could then be sent to theappropriate fake IP address, however, this would require that the previous hop on the LSP’s routehas knowledge of which routelet will manage the LSP, before the QuaSAR router even knowsthat a new LSP is being created. Without major changes in LSP creation (which would make theQuaSAR router inoperable with other RSVP-TE routers), this could not be accommodated. Thisapproach would also require a unique IP address for each routelet, which would be extremelywasteful.

• Packets are sent to the physical device’s MAC address, where they traverse the Linux networkstack until their IP address is exposed. This IP address would be associated with the appropriatedomain by entries added to the privileged domain’s ARP cache (either through ARP requests sent

37

across the virtual bridge, or static entries added to the ARP cache). Standard Linux IP forwardingcould then send the packets to the appropriate VIF’s MAC address, and thus the correct domain,over the virtual bridge. This approach is not appropriate for the QuaSAR router because QuaSARroutes packets based on their MPLS labels, a layer below the IP header. Therefore, the routershould not use the IP header of these packets at all. Also, this approach would effectively requiretwo stages of routing - routing of packets to the appropriate routelet’s domain, then routing thepackets to the next hop. This would have a serious negative performance impact on the QuaSARrouter, and would decrease the partitioning which QuaSAR provides between network flows.

Since Xen’s packet demultiplexing is not appropriate for the QuaSAR router, a new method of demulti-plexing packets, based upon their MPLS Label, was required. Click was used to provide a framework forthis demultiplexing support, since pre-existing Click elements could provide some of the functionalityrequired to demultiplex packets between routelets.

6.3.1 Overall Design

The main QuaSAR router must classify incoming packets, as either belonging to a QoS assured networkflow, or being best effort packets, and route them appropriately. Packet classification consists of twostages. Firstly, packets are classified as either being MPLS or non-MPLS packets. Non-MPLS packetscannot be routed by QoS routelets, therefore they are sent to the main router’s networking stack tobe processed. Secondly, the packet’s MPLS label is examined to discover which QoS routelet shouldprocess it, or whether it is from a best effort MPLS flow. It is important that this classification shouldbe dynamically reconfigurable, so that new QoS flows can be supported as they are created. Figure 6.1gives an example of the Click architecture which is be used by the main QuaSAR router to supportdemulitiplexing of two network interfaces between two routelets.

Each of the Click elements used by the demultiplexing Click configuration is described in more detailbelow:

FromDevice

The FromDevice element captures packets received at an Ethernet device before they traverse theLinux networking stack. The FromDevice element is described in more detail in Section 5.2.1.

Classifier

The Classifier element classifies packets based upon simple bit pattern matching at a given offsetinto the received packet. The Classifier is configured with a sequence of patterns assigned todifferent output ports. The Classifier checks the packet against each pattern in turn, pushing thepacket out of the port of the first pattern which matches that packet.

The packet demultiplexer uses two pattern / output port pairs. The first compares the protocol fieldof the packet’s Ethernet header to the MPLS protocol value. If the packet has an MPLS protocolfield, it is sent out of port 0, to the MplsSwitch. The second pattern is a default pattern whichmatches any packet, therefore packets which are not MPLS are pushed out port 1, to the ToHostelement.

MplsSwitch

The MplsSwitch directs an incoming packet to one of its output ports based upon the packet’sMPLS label. The MplsSwitch checks incoming packets’ MPLS labels against a number of MPLS

38

Routelet 1 Routelet 2

FromDevice (eth0) FromDevice (eth1)

Classifier(...) Classifier(...)

ToDevice (vif1.0) ToDevice (vif2.0)ToDevice (vif1.1) ToDevice (vif2.1)

MplsSwitch (...) MplsSwitch (...)

ToHost

ToHost

ToHost

ToHost

eth0 eth1 eth0 eth1

(to best effort router)




MPLS non-MPLS MPLS non-MPLS

Figure 6.1: An overview of the Click architecture used to demultiplex packets between routelets.

label / output port pair entries2. Packets are sent to the output port, corresponding to their MPLSlabel. MPLS label / output port pair entries can be added dynamically to the MplsSwitch, tosupport newly created network flows.

The packet demultiplexing architecture contains an MplsSwitch element for packets arriving fromeach physical Ethernet device. Each MplsSwitch is connected to a VIF on each of the routelets(the VIF which, on the routelet, corresponds to the physical Ethernet device which the Mpls-Switch deals with - see Section 5.1.3). When a new QoS network flow is being set up, theConnectRoutelet script adds an MPLS label / output port pair to the MplsSwitch (whichdeals with packets from the network flow’s inbound network device) in order to switch packetsfrom that network flow to its newly assigned routelet. If a packet arrives which does not matchany of the MPLS label / output port pairs, it is sent out of the default port 0 (where it is sent to themain, best effort router for processing).

The MplsSwitch element was created especially for the QuaSAR router. Its design and implemen-tation is described in more detail in Section 6.3.3.

ToHost

The ToHost element allows packets to be injected into the Linux networking stack at the samepoint as they were captured by the FromDevice element. Any packets which the packet demul-tiplexer sends to this element, therefore, appear to the Linux network stack as if they had neverbeen captured by Click in the first place. The ToHost elements are configured with the name of thedevice from which packets, sent to this element, should appear to have arrived. Best effort packets

2A hashtable is used to store the MPLS label / output port pairs (see Section 6.3.3), therefore an MPLS label lookup hasO(1) complexity, and scales well when the number of network flows increase.

39

are sent to the ToHost elements to be processed by the main, best effort router in the privilegedXen domain.

Queue

Queue elements are needed between FromDevice and ToDevice elements in Click. See Sec-tion 5.2.1 for more details.

ToDevice

The ToDevice element is used to send packets to the appropriate routelet’s VIF, for processing bythat routelet. The ToDevice element is described in more detail in Section 5.2.1.

6.3.2 Automatic Demultiplexing Architecture Creation

The packet demultiplexing Click architecture depends upon the number of routelets which have beencreated by the QuaSAR router. Therefore, when new routelets are created (by the StartRouteletsscript) the demultiplexing architecture must be changed. The StartRoutelets script, therefore,creates a new demultiplexing architecture whenever it is run.

The script searches the output of the ifconfig tool for a list of the Xen virtual network devices (VIFs)connected to the currently operational routelets. Each routelet has a VIF corresponding to each of thephysical network devices on the router, therefore, each routelet’s VIF is connected to the MplsSwitchwhich handles packets from the corresponding physical device. The VIF is connected to the MplsS-witch’s output port number which corresponds to the routelet’s ID, so that the ConnectRouteletscript can create an MPLS label / MplsSwitch output port pair from a new flow’s MPLS label and theassigned routelet’s ID number.

However, if the QuaSAR router has shut down some routelets, there will be gaps in the routelet IDnumber sequence. This would cause some of the MplsSwitchs’ ports to be unconnected, thus creating aninvalid Click configuration. Therefore, the StartRoutelets script keeps track of unconnected ports,and connects a Discard element to those ports, as they will never be used. This may seem wasteful, ifmany routelets are shut down and restarted, however, this is not normal practice in the QuaSAR router.When an network flow is torn down, the associated routelet is simply returned to the idle pool (seeSection 6.3.4) to be reused. Therefore, shutting down a routelet in the QuaSAR router would only occurin exceptional circumstances (such as a crashed routelet).

6.3.3 MplsSwitch Element

The MplsSwitch switches packets to an output port based upon the packets’ MPLS label. It does this bycomparing a packet’s MPLS label to a table of MPLS label / output port pairs, and sending the packetout of the output port specified by the pair which matches the packet’s label. These pairs are stored in ahash table, indexed by the hashed MPLS label, to ensure fast lookup when a packet arrives. If a packetdoes not match any of the pairs in the table, the packet is sent out of a default port.

Initially the table is empty, but MPLS label / output port pairs can be added by writing to a special filewhich this element creates. If a character stream with the format:

[MPLS Label] > [output port number]

is written to this special file, then this pair will be added to the table. This special file is written to by theConnectRoutelet script, to update the demultiplexing support when a new network flow has been

40

assigned to a routelet. Another special file, when written to with an MPLS label, will remove that MPLSlabel’s entry in the MplsSwitch’s table. The final special file created by the MplsSwitch will, when read,output the current table.

6.3.4 Returning Routelets to the Idle Pool

Eventually, network flows will be torn down. The QuaSAR router must therefore be able to discon-nect unused routelets, and return these routelets to the idle pool, otherwise it would quickly run out ofroutelets, or the resources necessary to start new routelets. Disconnecting a routelet from a network flowinvolves two tasks.

Firstly, the MPLS label / output port pair, used to demultiplex packets to the routelet being disconnected,must be removed from the MplsSwitch. This means that any remaining packets from the flow can berouted through the best effort router, and the MPLS label can be reused for another network flow. A paircan be removed from an MplsSwitch, by writing the flow’s MPLS label to the remove label specialfile within the Click Filesystem.

Secondly, the routelet must be returned to the idle pool so that it can be reused by another networkflow. The idle pool simply consists of all the routelets whose domains are paused. Therefore, to returnan routelet to the idle pool, its domain simply needs to be paused. A routelet does not need to bereconfigured before being returned to the idle pool, as its configuration is overwritten as soon as it isassigned to a new network flow.

These tasks were added to a disconnect option in the ConnectRoutelet script. The overall routeletlifecycle is described by Figure 6.2.

ShutdownRoutelet

ConnectRoutelet connect ...

ConnectRoutelet disconnect ...

StartRoutelets ...xm destroy ... Routelet Routelet

WorkingIdle

Figure 6.2: This Diagram outlines the routelet lifecycle, with the commands which move routeletsbetween states.

6.4 Routelet Management

The purpose of routing QoS network flows through routelets was to guarantee an allocation of therouter’s resources to each QoS flow, thus allowing the router to provide flows’ QoS requirements evenwhen under heavy load. Routelet management tools are required to assign resources, such as CPU time,network transmission rate, memory and disk access to each routelet, depending on the QoS requiredby the network flow it is servicing. The QoS provided by a routelet is not significantly affected by thememory or disk access rate granted to that routelet. This is because each routelet only performs very

41

simple packet processing and, after startup, does not use the disk or request additional memory at all.These two resources are therefore assigned statically at routelet creation. However, the CPU time andnetwork transmission rate, granted to a routelet, have a considerable effect on the QoS which can beprovided by that routelet. Tools are therefore used by the main QuaSAR router to dynamically allocatethese resources to routelets, depending on the QoS requirements of the flows they service.

6.4.1 CPU Time Allocation

CPU time is assigned to a domain, and therefore to a routelet, by the Xen virtual machine moni-tor’s scheduler. The Xen virtual machine monitor can support multiple schedulers, each with differentscheduling policies. To provide a guaranteed allocation of CPU time to each routelet, a Xen schedulerwhich provides soft real time guarantees must be used by the QuaSAR router.

The original version of the Xen virtual machine monitor provided a soft real time scheduler calledATROPOS. With the ATROPOS scheduler, each domain can be assigned a certain slice of the CPU time(e.g. 20000µs of CPU time every 100000µs), which the scheduler will, within limits, guarantee. Theoriginal design of QuaSAR called for this ATROPOS scheduler to guarantee CPU time to each routelet,however, the version of Xen used by QuaSAR no longer supports the ATROPOS scheduler.

A scheduler, which provides similar guarantees of CPU time allocation between domains, was beingwritten at the time of this project. This scheduler (named SEDF, after its earliest deadline first schedulingpolicy) was not finished before the end of this project, nevertheless, an early version was obtained, whichallowed some experiments to be performed with a soft real time scheduler. Unfortunately, this earlyscheduler was unstable and could not cope under heavy packet loads, therefore many of the experimentscould not make use of this scheduler.

The only Xen scheduler which was stable at the time of this project was a time shared scheduler, calledBVT (borrowed virtual time). The BVT scheduler bases its scheduling decision on how much CPU timea domain has previously been allotted,with domains which have received less CPU time than othersbeing more likely to be scheduled (in effect providing fair shares of CPU time to each domain). Weightscan be assigned to each domain, to skew the balance of CPU time assigned to each domain, giving eachdomain a proportional fair share of CPU time. However, this division of CPU time between domainsis not guaranteed. There was therefore no stable Xen scheduler which could be used to guarantee CPUtime partitioning between routelets at the time of this project, and writing such a scheduler was beyondthe scope of this project.

6.4.2 Network Transmission Rate Allocation

It is important that each routelet can be guaranteed a minimum network transmission rate, so that it canprovide the required QoS to the flow it is servicing. The transmission rate of a physical network card isfinite, therefore, the network transmission rate of each routelet must be limited, to prevent one routeletfrom overwhelming a network card, and preventing other routelets from making use of their allottedtransmission rate on that card.

Originally, the Xen virtual network devices (VIFs) had the ability to limit the transmission rate of adomain. The original design of the QuaSAR router called for this capability to be used to limit thetransmission rate of routelets, however, the change to a new virtual network device driver in newerversions of Xen had removed this ability. The Xen community planned to reintroduce the ability to limita domain’s network transmission rate in a later version of Xen, however, this would not be implementedbefore this project’s deadline. I therefore implemented transmission rate limiting in the new Xen virtualnetwork device drivers, and passed these changes back to the Xen community.

42

The transmission limiting system is implemented as a simple credit allocation scheme. Each virtualinterface can be allocated a credit of bytes which it can transmit, and a time period between creditreplenishments. Each transmission of x bytes through a VIF, uses up x credits on that VIF. Once a VIFruns out of credit, it can no longer transmit packets, until the VIF’s credit is replenished. This allows adomain’s (and therefore a routelet’s) transmission rate to be shaped, for example, limiting a domain totransmitting 500kB of data every 100ms.

To implement this, the Xen virtual network interface driver was modified so that it stores each VIF’s cur-rent credit, its maximum credit, its previous replenishment time and its replenishment period. Each timea request is made for the driver to transmit a packet on the network, the size of that packet is comparedagainst the VIF’s remaining credit. If the VIF has enough remaining credit, the credit is decremented bythe size of the packet, and the packet is sent. If the VIF does not have enough remaining credit, the nextexpected replenishment time (the previous credit replenishment time added to the credit replenishmentperiod) is compared against the current time. If the expected replenishment time has passed, then thecredit is replenished immediately (with the previous replenishment time set to the current time), and thepacket is sent. If the expected replenishment time has not yet passed, a timer is set to replenish the VIF’scredit at the expected replenishment time, and the packet is dropped.

Rather than checking the actual time (which involves a slow do gettimeofday system call), times arecompared with Linux jiffy based timing. Linux increments a global jiffies variable each time ahardware timer interrupt occurs. This jiffies variable therefore provides a lightweight method of com-paring different times. However, as the jiffies variable is only incremented each time a hardwaretimer interrupt occurs, it is course grained, with an accuracy of no more than about 10ms on an x86system. Therefore, this network transmission limiter cannot be used for fine grained traffic shaping.

Finally, the Xen management tool was modified, so that the transmission rate of a VIF can be limiteddynamically. These changes were submitted as a patch to the Xen project, where they have been inte-grated into the latest version of the Xen virtual machine. The VIF rate limiting command is used by theQuaSAR router to limit the transmission rate of routelets, to ensure partitioning of network transmissionbetween network flows.

43

Chapter 7

Experimental Testbed Setup

In order to evaluate the performance of the QuaSAR router, an experimental testbed was set up. Thistestbed was built to allow experiments which could evaluate the QoS (e.g. latency, jitter, throughput etc.)provided by the QuaSAR router, and compare this QoS with that provided by a standard router.

An important aspect of this project involves evaluating the partitioning provided by using routelets toroute individual network flows. Therefore, a testbed was required which could evaluate the performanceof the QuaSAR prototype, in the presence of conflicting flows.

This chapter describes the creation of the experimental testbed which was set up in order to evaluate theQuaSAR prototype. Section 7.1 describes the example network which was built for the experiments onthe QuaSAR router. Section 7.2 discusses the tools which were used to measure the various parametersof the QoS provided by a router (such as latency, jitter, throughput etc.). Section 7.3 describes someof the changes in the testbed hosts which were necessary to accurately measure relative times of events(necessary for per-packet measurements of transmission latency). Section 7.4 discusses the creationof a MPLS traffic generator, which was used to overwhelm the router while another flow was beingmeasured, thus allowing evaluation of the partitioning a router provides between network flows. Finally,Section 7.5 discusses the problems encountered when trying to overwhelm the QuaSAR router withnetwork traffic, and how these problems were overcome.

7.1 Network Setup

A test network was set up in order to perform experiments to compare the QuaSAR prototype’s per-formance to that of a standard router. The main component of this network is therefore a machinewhich routes packets between subnetworks, acting either as a QuaSAR router, or a standard router. Themachine chosen to act as a router had a 1Ghz Pentium 3 processor and 1Gb of RAM. This roughlyapproximates the type of machine typically used as a software router at the time of this project with,however, more RAM than would be normal, in order to support a large number of routelets runningsimultaneously.

This router machine was loaded with the software necessary to run either as a QuaSAR router, or astandard router. The standard router chosen to compare against the QuaSAR prototype was MPLSLinux, controlled by a version of the RSVP-TE daemon, not modified to support routelets. These areunmodified versions of the software used by QuaSAR to route best effort traffic. Both routers runon similarly configured 2.6.9 Linux Kernels, with QuaSAR’s main router’s kernel targeted at the Xenarchitecture, and the standard MPLS router’s kernel targeted at the x86 architecture. Both routers also

44

run with the same filesystem image, and use the same Linux distribution (SUSE 9.2), with the sameLinux configuration, to eliminate unnecessary disparities between the two routers.

A router can route packets from one subnetwork (subnet) to another. The number of subnets a router canconnect is dependent upon the number of network interface cards present on that router. The machineused as a router had only three PCI slots, therefore it could support a maximum of three network interfacecards, limiting the number of subnets that this router could support to three (unless it used non-standardnetwork cards with more than one port, which were not available during this project). The router wasfitted with three similar 100Mbit/s Ethernet network cards, each allocated its own subnet (10.1.0.0/16,10.2.0.0/16 and 10.3.0.0/16). Figure 7.1 gives an overview of the hosts which were connected to thesesubnets, and the overall network structure. The details of how these hosts were used are given below.

Figure 7.1: This diagram shows the basic network setup which was used throughout the QuaSAR routerexperiments.

In order to test the QoS provided by each router, network flows must be measured as their packetstraverse the router, travelling from one subnet to another. In order to test the effect that the router has onthis traffic, no other traffic should be present on the two subnetworks it traverses, otherwise collisions,queuing and other effects within the network could be attributed to the router. Therefore, two of thesesubnetworks were reserved as the arrival and departure subnets of the network flow(s) whose QoS isbeing measured. A client host was connected to each of these subnets, one to act as a source (10.1.0.2on the diagram) and the other as a sink (10.2.0.2) for the QoS measuring flow(s). The source and sink

45

hosts were connected to their subnets using an crossover Ethernet cable between the client host and thenetwork card of the router assigned to that subnet.

To test how the use of routelets affects the partitioning between network flows, an additional trafficgeneration network flow needs to be set up. However, this traffic generation flow cannot pass through thesubnets which contain the source and sink of the QoS measuring flow. Otherwise, this traffic generationflow could interfere with the QoS measuring flow, due to elements of the network other than the router(e.g. queuing on a network switch, or collisions in a network hub). Thus, the results derived would notprovide an accurate representation of the router’s partitioning capability. Therefore, the traffic generationflow requires its own source and sink subnets. However, the machine used as the router could onlysupport one additional subnet. If both the source and sink traffic generation hosts were placed on thesame subnet, their packets would not normally travel through the router (they would be routed directlythrough the subnet, through the Ethernet switch or hub used to create the subnet). MPLS, though, is adirected, or traffic engineered, routing protocol, so an LSP can be engineered to route traffic through therouter, even though it will be entering and exiting the same subnet. Two hosts (10.3.0.2 and 10.3.0.3),used as a traffic generator source and sink, were therefore connected to the final network card of therouter using an Ethernet switch.

All hosts which were used to inject traffic into the network were Dell Optiplex Gxa desktops, with233MHz Pentium 2 processors, 128Mb of memory, and 100Mbit/s Ethernet cards. These machineswere loaded with the same SUSE 9.2 Linux distribution and 2.6.9 Linux Kernel as that used by therouter machine. The same (unmodified) versions of MPLS Linux and the RSVP-TE daemon were usedto create, and route LSPs through the network, as those used by the router. The switch used to connectthe cross-traffic generating hosts to the router was a Netgear 100/1000Mbit/s GS516T Ethernet switch.

7.2 Network QoS Measurement Tools

A number of tools were used in order to measure the QoS provided by the router. The maximum through-put of a network flow through the router was measured using the iPerf tool, described in Section 7.2.1.Per-packet timing measurements through a network flow, such as latency and jitter, were measured usinga specially created tool, described in Section 7.2.2.

7.2.1 Throughput Measurement

The throughput of the QuaSAR router (and the standard MPLS router for comparison) was measuredusing the iPerf [40] network measurement tool. iPerf measures the maximum bandwidth of the networkbetween its client and server components. It does this by setting up a TCP connection between the clientand the server, then sending as much data as it can from the client to the server. The TCP protocol willuse its sliding window flow control mechanism to regulate the traffic between the client and the server, tothe maximum throughput supported by the network in between. The TCP flow control mechanism doesnot send the maximum amount of data across the network continually. Instead, the transmission rate of aTCP flow fluctuates around the maximum transmission rate, due to dynamic changes in the TCP windowsize introduced by the flow control mechanism used by Linux (TCP Reno [26]). Therefore, iPerf doesnot measure the absolute maximum bandwidth of a network flow, but does measure the likely maximumbandwidth which an application will receive through the network flow.

46

7.2.2 Per-Packet Timing Measurement

The QuaSAR prototype was created with the needs of isochronous (time sensitive) network flows, suchas voice over IP or streaming media, in mind. It was therefore important to measure the effect of theQuaSAR prototype router on per packet timing measures, such as latency and jitter, of network flows. Atool was created which mimics the traffic patterns of a simple voice over IP application, and measuresthe per-packet timing effects of the router on this traffic.

This tool sends UDP traffic from a client component to a server component. A 70 byte packet is sentfrom the client to the server every 20ms (this was the minimum time period which a Linux process canaccurately wait for, without using a busy wait loop which would interfere with packet transmission).Just before a packet is sent across the network, a timestamp of the current time is added to the packet.The packet then traverses the network, through the router, until it reaches the server. As soon as apacket is received by the server, the current time is again checked, and both this received time and thesent time (extracted from the packet) are written to a file1. These two times can be used to measure:the latency of each packet across the network (received − sent); the interarrival time between packets(received−received−1); the interdeparture time between packets (sent−sent−1); and the jitter inducedby the network (interdeparture − interarrival).

7.3 Accurate Relative Time Measurements

In order to measure the per-packet latency and jitter using the tool described above, both the sent andreceived times must be measured on clocks which are accurate to each other, within the expected rangesof per-packet timings. Initially, a network time synchronisation protocol (such as NTP [23]) was con-sidered to synchronise the clocks on hosts A and B while measuring per-packet timings. Using NTP tosynchronise the clocks was investigated, however, NTP could only synchronise clocks to an accuracyof about 4000µs, which was an order of magnitude greater than the minimum packet delay through therouter, and two orders of magnitude greater than the jitter induced by the router.

Another option was to synchronise the clocks of both hosts using a GPS receiver. GPS [28], or theGlobal Positioning System, consists of a number of satellites orbiting the earth. Each satellite containsan atomic clock, which it uses to transmit a very accurate time signal . This signal can be used by GPSreceivers to synchronise a host’s clock very accurately2. Therefore, two GPS receivers could be used tosynchronise the two hosts to a high accuracy, however, there was not sufficient time, or equipment, touse GPS to synchronise the hosts used by this experiment.

The final option was to use the same clock to measure both the received time and the sent time. There-fore, both the source and sink of the time measuring network flow must be located on the same host (i.e.10.1.0.2 and 10.2.0.2 are on the same machine in Figure 7.1). This project used this approach, of routingtraffic from one network interface on a host, through the router, then back to another network interfaceon the same host, for the per-packet timing analysis (however, throughput measurement continued to bemeasured using separate hosts). As the same host both sends and receives packets, both the source andsink network interfaces contend for the resources of the same system bus, therefore they could interferewith each other. However, the per-packet timing measurement tool only sends a packet every 20ms,therefore it is unlikely that an outgoing packet would interfere with the previous incoming packet, as thenetwork transmission latency was usually much less than 20ms.

1The server and client machines’ clocks must be synchronised very accurately for this process to produce accurate results.This synchronisation is discussed in Section 7.3

2This GPS time signal is usually used to find the location of a GPS receiver. The signal takes a certain amount of time totravel from the satellite to the receiver, dependent on the distance between the two. The GPS receiver can therefore work outits exact position by comparing the time difference between received signals from three or more GPS satellites.

47

This method required the routing of network traffic from one network interface to another on the samehost, whilst still traversing the network. However, Linux forwards any packets addressed to local in-terfaces directly to those interfaces internally, without sending them across the network, no matter howthe routing tables are set up. In order to send packets across the network, the host needs to be trickedinto thinking it is sending packets to a non-local address. Packets are therefore sent from one networkinterface to an imaginary address. The other network interface of the host is set up to respond to ARPrequests for this imaginary address, therefore the router will route packets, with this imaginary addressas their destination, to the host’s other network interface. However, although the packets have traversedthe network from one network interface of the host to the other, the host does not know it has to deal withpackets which have this imaginary destination IP address (as it is not a local address). Therefore, NAT(Network Address Translation) was set up on the host to translate the destination address of incomingpackets, having this imaginary IP address, to the local address of the interface it arrives on. A similarsource address translation takes place on outgoing packets, so that the source of the network flow seesreplies from the imaginary address, as it expects. The network flow therefore appears, to the host, to bebeing sent to another host, and be arriving from another host, even though both its source and sink areon the same machine.

7.4 Network Traffic Generation

In order to evaluate the partitioning which the QuaSAR prototype provides between network flows, atraffic generation network flow needs to be sent through the router. This induces contention for routerresource between the flows, thereby allowing a router’s partitioning of resources between flows to bemeasured. The hosts generating this conflicting traffic must generate sufficient traffic to overwhelm, orat least stress, the router, so that the effect of this conflicting traffic on the QoS, provided to other flows,can be measured.

Initially, it was thought that an application could produce this level of traffic. However, the systemcall overheads, involved in sending each packet through the network stack from user-level code, onlyallowed roughly 50,000 packets per second to be sent by a single host. This was not a sufficient numberof packets per second to stress the router3. The traffic therefore needed to be generated within the Linuxkernel, where it would not incur these overheads.

The Linux Kernel contains a packet generator module, which can be used to generate a certain numberof packets per second. This packet generator creates a packet, then repeatedly sends this packet directlyto the network interface’s device driver, thereby bypassing the network stack entirely. This allows asingle host to generate sufficient packets per second to stress the router. However, this packet generatormodule only generates IP over Ethernet network traffic, whereas the QuaSAR routelets only processMPLS packets. Therefore, to perform experiments on QuaSAR routelets, an MPLS packet generatorwas required.

A PktGen Click element was created in order to generate sufficient MPLS traffic for experiments onthe QuaSAR prototype. A Click router can be configured so that a chain of elements create any typeof packet (e.g. an MPLS packet), which is then pushed into this PktGen element. The PktGen elementthen repeatedly sends this packet to the network interface device driver, in much the same way as the

3Note that the amount of processing a router must perform to route a network flow is more dependent on the number ofpackets per second (pps) of the flow, than the flow’s overall bandwidth. This is because a router needs to perform work to routeevery packet. Simply increasing the bandwidth of a flow by increasing the size of each packet, without increasing the numberof pps, does not significantly increase the processing load on the router. Therefore, the maximum stress a network flow canplace on a router occurs when that flow is sending the smallest possible packet size (64 bytes in Ethernet) at the maximumthroughput (100Mbit/s for Fast Ethernet), i.e. the maximum pps.

48

Linux packet generator module does. This PktGen element can be configured with three parameters -the name of the network interface from which the generated traffic should be sent, the number of copiesof each packet to send, and the amount of time which the PktGen element should wait between sendingeach packet. This allows the rate of packet generation, the length of time spent generating traffic, andoutbound network interface to be altered.

The PktGen element sends packets to the network device driver in a similar way to the ToDevice Clickelement, however, the sending is surrounded by a loop which sends a single packet the number of timesspecified by the PktGen’s configuration. Between each of these packet sends, the element waits forthe amount of time specified by its configuration. It uses a busy wait, where it iterates in a loop ofuseless processing, for the required period. The number of iterations necessary to produce the requireddelay is calibrated when the element is initialised, with the difference in time (measured using thekernel do gettimeofday system call) between a fixed number of busy loop iterations used to measurethe number of iterations the machine performs per second. This is used to calculate the number of busyloop iterations necessary to generate a certain wait period. This busy wait method of specifying a waitperiod obviously involves unnecessary computation, however, it is the only method of very accuratelyspecifying a wait period, as Linux timer interrupt methods are only accurate to tens of milliseconds.

A special file handler was added to the PktGen element, so that the actual number of packets per secondit has transmitted can be examined. This pps value is calculated by measuring the time at the start of thetraffic generation, and at the end of the traffic generation, then dividing the number of packets sent bythis time difference.

Overall, this PktGen element allows a single host (with a 233MHz processor) to generate up to 100,00064 byte MPLS packets per second on a Fast Ethernet card.

7.5 Straining Router with Generated Traffic

In order to test the partitioning provided by routelets, the generated traffic needs to stress the router,so that its effect on other flows can be investigated. The PktGen element can transmit a maximum of100,000 pps. It was found that this level of traffic did not cause a noticeable change in the QoS provideby the router to other network flows.

In order to stress the router, an attempt was made to use more than one host to generate an aggregateflow of more than 100,000 pps through the router. However, recall from Section 7.1, there was onlyone network interface card available for stress inducing traffic. Additional hosts were connected to theEthernet switch, where the switch combined the traffic generated by these flows and sent it through thesingle link to the router. However, this aggregation of traffic from additional hosts actually decreasedthe load on the router. It seems that collisions and queued traffic decreased the overall pps which couldbe sent to the router.

Since the amount of generated traffic could not be increased, the processing power of the router neededto be reduced so that the effect of processing this traffic could be measured. There was not enough timeto make all the necessary changes to another, slower machine, in order for it to act as the QuaSAR andstandard MPLS routers. Therefore the original machine had to be slowed down in some way.

An attempt was made to under-clock the machine being used. Under-clocking is a process by whichthe machine’s processor’s clock speed is deliberately slowed down (e.g. 500MHz instead of 1GHz),meaning that the processor performs fewer operations every second. Unfortunately, the machine beingused as the router did not have any controls (either hardware or software) which could be used to changethe processor’s clock speed.

49

Another option was to start a busy wait process, which simply used up CPU cycles uselessly. However,when this process was run in user-mode, the Linux Kernel (which was routing the packets) simply pre-empted this busy wait process when it had any work to do, therefore, a user-mode busy wait process hadno effect on the routing of packets.

The busy wait process could have been run as a kernel thread, where it would not be pre-empted by anypackets being routed. However, although the standard MPLS router only has one kernel, the QuaSARrouter has multiple kernels running, depending upon the current number of routelets. Therefore, the CPUconsumed by the busy wait thread(s) would have to be shared across all of these kernels. It would there-fore be very difficult to generate the same busy wait overhead for each flow on the QuaSAR prototypeas for each flow in the standard MPLS router.

Instead, an approach was taken which would induce the same overhead in both routers. A clock timerinterrupt is produced by the x86 system clock every 1ms 4. This interrupt is handled by the Linux Kernelevery time it occurs. However, in a Xen system, the interrupt is first handled by the Xen VMM, beforeit is passed to any of the Guest operating systems. Therefore, a busy wait loop could be inserted into theXen timer interrupt handler on the QuaSAR router, and the Linux timer interrupt handler in the standardMPLS router, to slow down both routers equally. Although this slows down both routers equally, all ofthe slowdown occurs at the very start of every 1ms period. Since the average latency (on an unloadedrouter) is an order of magnitude less than this 1ms period, the per-packet timings could be offset in someunexpected way. However, after averaging over a significant number of packets, this effect should evenitself out.

This approach of slowing down the router by modifying the timer interrupts was only used in the exper-iments which measured the partitioning effect of the router (Section 8.2). All other experiments usedunmodified timer interrupt handlers. Section 8.2.1 compares the unloaded QoS provided by each ofthe routers, and shows that the effect of slowing the router using this method does not overly affect theresults.

4This period is actually programmable at system boot, however both Xen and Linux use the same timer period.

50

Chapter 8

Experimental Results and Analysis

In order to evaluate the QuaSAR prototype router, a number of experiments were performed to assess theQuality of Service provided, by the router, to various network flows. These experiments can be split intotwo main sets: those which investigate the overhead which virtualisation introduces (Section 8.1); andthose which examine the partitioning provided by routelets to individual network flows (Section 8.2).

8.1 Virtual Machine Overhead

The major differences between the QuaSAR router and a standard software router is the use of virtualmachine technologies. The use of virtual machines will always incur an overhead, because of guestoperating system context switches, protection enforced by the VMM and the additional memory requiredto run multiple operating systems at once. It is, therefore, important to evaluate the overhead incurredwithin the QuaSAR router due to its use of virtualisation.

In order to determine the overhead incurred by the use of virtual machine technologies in QuaSAR, anddiscover what causes this overhead, the Quality of Service provided by a number of different routerswas measured. A network was set up, as described in Section 7.1, with no cross-traffic being generated.These experiments, therefore, measure the ideal QoS characteristics provided by the router when it isunder no additional load other than the flow being measured.

Three QoS characteristics were measured - latency, interpacket jitter, and maximum throughput. Thesecharacteristics were measured for three types of router:

• A standard MPLS router. This router does not use any virtualisation technology, and provides thebaseline for these results.

• QuaSAR’s best effort router. This is effectively the same as the standard MPLS router, howeverit runs within a Xen virtual machine. These experiments run with QuaSAR’s best effort routerrunning within a privileged Xen domain, with full access to the physical network devices andno other guest operating systems running concurrently. This, therefore, evaluates the overheadcaused by simply running on top of Xen’s virtualisation layer.

• Quasar’s routelet. This involves routing packets from the measurement flow through a QuaSARroutelet. The routelet runs within a single unprivileged domain, with incoming packets being clas-sified within the privileged best effort router, as described in Chapters 5 and 6. These experimentsevaluate the overhead involved in classifying packets, and passing them between domains.

51

The QuaSAR routers were also tested, where possible, with two Xen domain scheduling policies -Borrowed Virtual Time (bvt) and Simple Earliest Deadline First (sedf). The bvt scheduler does notprovide any guarantees to routelets, apart from proportional fairness, however it was stable throughoutthe project. The sedf provides soft real time guarantees to routelets, however it was not stable and manyexperiments caused this scheduler to fail. In all virtual machine overhead experiments, the schedulerswere set up so that they did not limit the CPU time to any domain, i.e. all domains were given a fairshare of the CPU time, and any domain could pre-empt another when it received an interrupt or an eventnotification. This attempts to limit any unexpected results caused by one domain being given priorityover another (The partitioning experiments in Section 8.2 purposely give some domains priority overothers, in order to assign different amounts of CPU time to each flow).

8.1.1 Latency

The unloaded latency of each router was measured using the tool described in Section 7.2.2, using thenetwork setup shown in Figure 8.1. This tool measures end to end packet latency, therefore, it includesthe latency induced by the network and the time packets spend traversing the host’s network stack. It,therefore, does not produce an absolute value for the time a packet takes to travel through the router.However, the latency induced by the network and the host’s network stack stays constant throughout theexperiments. Therefore, the experiment provides an accurate measure of relative difference between thelatencies of each type of router and thus the extra overhead incurred by each type of router.

Figure 8.1: This diagram shows the network set up used to perform per-packet timing experiments. Notethat the source and sink machines are the same, in order to accurately synchronise time measurement(see Section 7.3). The Labels show the interfaces real IP addresses, and their fake addresses used to trickLinux into sending the packets across the network.

The latency of the network and the host’s network stack can, however, be approximated in order toprovide a more accurate approximation of the absolute time spent by a packet in each type of router.Each packet transmitted by the latency measuring tool is 70 bytes + a 14 byte Ethernet header + a 20byte IP header + 8 byte UDP header = 112 bytes. Each 112 byte packet traverses two 100Mbit/s Ethernetlinks (one from the host to the router, the other from the router to the host), therefore the time a packetspends on the network can be approximated to:

2 × (112 × 8)100, 000, 000

= 17.9µs (8.1)

The time spent traversing the host’s networking stack is dependent upon the number of operations whichneed to be performed on each packet between the application layer and the IO necessary to send thepacket from the physical network device. When testing the traffic generating capabilities of the hosts, itwas discovered that they could send roughly 50,000pps from user level, before the network stack over-head overwhelmed the host’s CPU (see Section 7.4). These hosts were clocked at 233MHz, therefore

52

the number of cycles necessary to process a single packet through the network stack can be estimatedas:

CPU Cycles per Second

Maximum PPS=

233, 000, 00050, 000

= 4660 cycles per packet (8.2)

Therefore, the time taken for a packet to traverse the host’s network stack twice (once when sending andonce when receiving) can be very roughly estimated as:

2 × 4660233, 000, 000

= 40µs (8.3)

Therefore, the per-packet latency due to factors other than the router can be estimated as being 57.9µs.

These experiments measure the per-packet latency. However, since this value is very small, it can begreatly affected by random events, such as scheduling delays, cache misses, interrupt handling etc.Therefore, each run of the experiment measured the latency of a large number of packets, so that thesefluctuations could be averaged out. To ensure that each run is averaged over enough packets to provide asteady result, a run for each router was graphed to show how the average latency changes as the numberof packets used to create that average latency increases. The spread of results is also measured againstthe number packets measured in a run using 10th and 90th percentile error bars. Figure 8.2 shows thesegraphs for each of the router types.

These graphs show that the average latency becomes more steady as the number of packets averagedover increases. They also show that the spread of latency values is bound because the 90th / 10thpercentile error bars reduce to their minimum, then hold steady at that minimum and do not fluctuateconsiderably (notice the QuaSAR best effort router using the sedf scheduler has a considerably largerspread of values, shown by its larger error bars, which will be discussed later). The random fluctuationsin per-packet latency average out after between 500 and 1000 packets, depending on the router type.Therefore, each run should be averaged over at least 1000 packets to remove fluctuations and ensure aconsistent result. Each run was chosen to run over 3000 packets, since this did not considerably affectthe overall time required by experiments.

To calculate the number of 3000 packet runs which are necessary to give a stable result, the averagelatency was graphed against the number of runs averaged over. Figure 8.3 shows this graph from 1 to 5runs (note the Y axis range has been decreased, compared with the previous graphs, to show the resultsin more detail). As can be seen, after about three to four runs, the average latency does not fluctuateconsiderably, and the error bars do not increase any further. Therefore, at least four separate runs mustbe made to ensure a consistent result. In order to add some leeway, all experiments were averaged overfive runs.

The average per-packet latency induced by each of the router types is shown by Figure 8.4. A 70 bytepacket takes, on average, 194µs to travel through the network using the standard MPLS router. Of this, itis estimated that 136.1µs is spent within the router (recall from above that it is estimated that each packetspends 57.9µs traversing the network, and being processed by the host’s network stack). Figure 8.5(a)shows the per-packet timings for every packet in a single experimental run. From this graph, it is clearthat most of the packets incur the minimum latency of about 190µs, with another significant, but muchless substantial band at 230µs and then a scattering of packets delayed for much longer.

The same packet will have, on average, a latency of 204µs through the network when being routed by thebest effort QuaSAR router. Recall, the QuaSAR best effort router (when it uses the bvt Xen scheduler) isrunning the same routing software as the standard MPLS router, but from within a Xen virtual machine.Therefore, the overhead incurred by simply using Xen, even without context switches between domains,is estimated as about 10µs. This is 7.3% of the time spent in the router, or 5.2% of the overall timespent by each packet within the network. The per-packet timings in Figure 8.5(b) are relatively similar

53

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

number of packets averaged over

(a) Standard MPLS router

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)


(b) Quasar Best Effort (bvt)

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)


(c) Quasar Routelet (bvt)

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)


(d) Quasar Best Effort (sedf)

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)


(e) Quasar Routelet (sedf)

Figure 8.2: These graphs compare the number of packets in a run, against the average latency. The errorbars show the 10th / 90th percentile values (for every increase of 50 packets).

54

180

200

220

240

260

1 2 3 4 5

aver

age

late

ncy

(μs)

run number


180

200

220

240

260

1 2 3 4 5

aver

age

late

ncy

(μs)

run number


180

200

220

240

260

1 2 3 4 5

aver

age

late

ncy

(μs)

run number


180

200

220

240

260

1 2 3 4 5

aver

age

late

ncy

(μs)

run number


180

200

220

240

260

1 2 3 4 5

aver

age

late

ncy

(μs)

run number


Figure 8.3: These graphs compare the number of runs used to calculate the average latency, against thevalue of the average latency. The error bars show the minimum and maximum run average values.

55

0

50

100

150

200

250

Standard MPLS

Quasar Best Effort (bvt)

Quasar Routelet (bvt)

Quasar Best Effort (sedf)

Quasar Routelet (sedf)

aver

age

late

ncy

(μs)

Figure 8.4: The average latency of each type of router. The error bars show the minimum and maximumrun (not per packet) latencies.

to those of the standard router in Figure 8.5(a), but simply shifted up by about 10µs. Therefore, Xen’svirtualisation appears to incur a consistent overhead in each packet’s latency. This suggests that theoverheads incurred purely by Xen are caused by overheads in interrupt handling (which occurs for eachpacket) and / or overheads in the operations necessary to process each packet, rather than random (e.g.scheduling) delays.

When a packet is being processed by a QuaSAR routelet, it passes through two domains, the privilegeddomain with access to the physical devices, and the QoS routelet’s domain. It is, therefore, expectedthat there is an additional per-packet overhead caused by moving between domains. Packets processedby a routelet have an average latency of 249µs, therefore, processing a packet through a routelet adds anadditional overhead of 45µs, over and above the overhead induced by Xen’s virtualisation (22% of theoverall network time, 30.1% of the routing time).

The major additional overheads, when processing a packet in a routelet, involve moving the packetbetween the domains twice (once from the privileged domain to the QoS routelet when the packet arrives,then again from the routelet to the privileged domain for transmission). Moving a packet betweendomains involves transferring ownership of the area of memory where the packet is stored from onedomain to another, then context switching between the domains. Transferring the memory ownershipfrom one domain to another involves manipulating the page tables of both domains and some reasonablycomplex event channel communication between domains to initiate this process. However, this memorytransfer does not account for such a large increase in per-packet overhead.

The two domain context switches, on the other hand, involve a number of operations which could signif-icantly contribute to this overhead. Firstly, all of the processor’s state (for example, register values) hasto be stored to memory, and the previously saved state of the new domain has to be restored. A more im-portant overhead involved in context switching between domains involves the change of address space.Since different domains have different virtual memory address spaces (to ensure one domain does nothave unauthorised access to another domain’s memory), a context switch between the domains must

56

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

packet number


0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

packet number


0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

packet number


0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

packet number


0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

aver

age

late

ncy

(μs)

packet number


Figure 8.5: The per-packet latencies for an example run of each type of router

57

change the page table used to translate virtual addresses into physical memory addresses. This involvesinvalidating the TLB (Translation Look-aside Buffer - the hardware lookup table used for virtual tophysical memory address translation), as the entries it contains will no longer be appropriate for thenew domain. The memory cache will also no longer be relevant, as it is likely to be filled with entriesused by the old domain. Therefore, a context switch involves a major direct overhead, due to saving andrestoring the processor state, as well as major indirect overheads, e.g. slower initial memory access andaddress translation, caused by the hardware caches being invalidated.

It is revealing to roughly estimate the amount of time which is spent processing each of these overheads.The total overhead can be expressed as:

(2 × context switch cost) + (2 × packet transfer cost) + other costs = 45µs (8.4)

where other costs account for packet demultiplexing and network device bridging. Preliminary experi-ence indicates that the other costs account for approximately 5µs per packet, and that a context switchis approximately three times more costly (due to its indirect overheads) than transferring a packet be-tween domains. When this information is applied to equation 8.4, one finds that a single context switchaccounts for 15µs of the overhead and transferring a packet between domains takes approximately 5µs.Note that these overheads have not been directly measured, they are estimates which are simply providedto give a rough overview of the relative sizes of the overheads involved.

The per-packet timings in Figure 8.5(c) show that there is a significant increase in the direct, per-packetlatency overheads, since the minimum packet latency increases by about 35-40µs compared with theprevious two routers. However, there is also a significant increase in the number of packets scatteredabove this minimum latency, suggesting that indirect overheads, such as cache invalidation and schedul-ing delays, produce a large proportion of the increased overheads.

The final two values show the per-packet overhead incurred by running the QuaSAR router with a softreal time scheduler (sedf). The use of a real time scheduler will introduce additional overheads, due toa more complex scheduling operation and domains not necessarily being scheduled immediately whena packet arrives, due to scheduling constraints. The use of an unfinished version of the sedf scheduleris also likely to have contributed to the additional overhead. An additional overhead of 20µs (224µsoverall) is incurred in the QuaSAR best effort router when using the sedf scheduler. However, when theactual per-packet timings are examined on an run of this experiment (Figure 8.5(d)) a regular patterningis clearly visible. The solid band of minimum packet latencies in both Figure 8.5(b) and Figure 8.5(d)does not appear to increase significantly when moving from the bvt to the sedf scheduler. Therefore,almost all of the increase in average latency is caused by this patterning effect, rather than a consistent20µs overhead in every packet. This patterning effect suggests a delay between the time a packet arrives,and a domain being woken up. However, in this experiment, there is only one domain running and thatdomain is given full access to all the available CPU time, so the domain should never need to be wokenup. Therefore, this patterning appears to be an artifact of the sedf scheduler being unfinished at the timeof the experiment.

The sedf scheduler increases the latency of the QuaSAR routelet by 10µs to 259µs. This increase of10µs is less than the 20µs incurred by the sedf scheduler in the best effort router. This is unexpected,since packets passing through a routelet cause two additional scheduling decisions. Therefore, it wouldbe expected that the additional scheduling overhead of the sedf scheduler would be more prominentwith the QuaSAR routelet. However, comparing Figures 8.5(d) and 8.5(e) it is clear that the sedf routletconfiguration does not suffer from the same regular latency patterning as the sedf best effort router.It appears that the context switches between domains, necessary to route packets through a routelet,prevents the apparent delay in domain wake up which may cause the patterning in the sedf best effortrouter. Thus, the 10µs additional overhead induced by the sedf scheduler in the QuaSAR routelet router,

58

is more representative of the overhead which should be expected if a finished soft real time schedulerwas used, than the 20µs overhead observed in the QuaSAR best effort router.

8.1.2 Interpacket Jitter

Jitter between packet arrival times is another important QoS measure, especially in isochronous networkflows, such as VOIP traffic. The jitter induced by each type of router can be estimated by measuring theinter-packet arrival time (i.e. the time difference between arriving packets) and comparing this with thepacket period. The inter-packet arrival time is calculated by subtracting the arrival time of one packet,with the arrival time of the next packet (i.e. received − received−1). The period between packets isintended to be 20µs, to imitate a VOIP traffic flow, however the timers used by Linux are not accurateenough to guarantee that every packet is sent exactly 20µs after the other.

Therefore, to accurately measure the jitter induced by each type of router, the inter-arrival time shouldbe compared against the inter-departure time (sent − sent−1). The inter-departure time accuratelymeasures the difference between each packets actual send time. The per-packet jitter can, therefore, beaccurately calculated with interarrival− interdeparture. Figure 8.7 shows the jitter induced by eachrouter type over a run of 3000 packets.

As can be seen, the jitter fluctuates around 0µs (positive jitter must be balanced by negative jitter, oth-erwise an infinite delay will gradually build up), therefore simple averaging will not generate any usefulvalue. Instead, the root mean square average was taken over each packet’s jitter value, to find the aver-age deviation from an ideal of zero jitter, induced by each router type. These experiments used the sameraw data as the latency experiments, therefore the jitter measurements were also averaged over 5 runs of3000 packets. Figure 8.6 shows the average root mean squared jitter induced by each of the router types.

0

10

20

30

40

50

60

StandardMPLS

QuaSAR BestEffort (bvt)

QuasarRoutelet(bvt)

Quasar BestEffort (sedf)

QuasarRoutelet (sedf)

Inte

rpac

ket J

itter

(us

)

Figure 8.6: The average root mean squared jitter of each type of router. The error bars show the minimumand maximum run (not per packet) jitter values.

The jitter between packets when they are sent through the standard MPLS router is, on average, 12.4µs.Figure 8.7(a) shows that most of the packets have a very small jitter (< 10µs), however, there is a thin

59

-600

-400

-200

0

200

400

600

0 500 1000 1500 2000 2500 3000

Jitte

r (μ

s)

packet number


-600

-400

-200

0

200

400

600

0 500 1000 1500 2000 2500 3000

Jitte

r (μ

s)

packet number


-600

-400

-200

0

200

400

600

0 500 1000 1500 2000 2500 3000

Jitte

r (μ

s)

packet number


-600

-400

-200

0

200

400

600

0 500 1000 1500 2000 2500 3000

Jitte

r (μ

s)

packet number


-600

-400

-200

0

200

400

600

0 500 1000 1500 2000 2500 3000

Jitte

r (μ

s)

packet number


Figure 8.7: The per-packet jitter for a single run of each type of router

60

band of packets with approximately 25µs jitter, and a small scattering of packets with a much largerjitter. The thin band of packets with about 25µs of jitter suggest a regular, intermittent delay. As packetsin this band occur roughly every 1 second, this implies that their extra jitter is caused by a timer eventwhich occurs every second.

When packets are passed through the QuaSAR best effort router, the jitter increases to 15.1µs, an in-crease of 2.7µs or 21.7%. Closer investigation of the per-packet jitter values (Figure 8.7(b)) revealsthat the vast majority of packets still experience less than 10µs of jitter. The band of packets, whichexperience additional jitter every 1 second, have a jitter of about 50µs, double that of the same bandwhen routed through the standard MPLS router. This suggests that Xen doubles the time taken to pro-cess the timer event which causes this delay, than Linux does in the standard MPLS router (if indeed itis caused by a timer expiring). This may occur if the timer event is being processed by both the XenVMM, and any other domains running. Therefore, with only a single domain running, each of thesetimer events would incur twice as much work - once when being run in the in the Xen VMM, then againbeing processed by the best effort Linux kernel. The QuaSAR best effort router also suffers from morelarge jitter packets than the standard MPLS router. This suggests that Xen incurs more large, randomdelays, probably due to processing within the Xen VMM.

The average jitter of packets being routed by a QuaSAR routelet is 16.3µs, 1.2µs or 7.9% greater thanthe jitter produced by the QuaSAR best effort router. Since the QuaSAR routelet incurs such a largeincrease in overhead, it is surprising that the Quasar routelet does not increase packet jitter more thanit does. Processing packets through a routelet involves at least two context switches and all of thescheduling delays and cache misses that entails (see Section 8.1.1). These effects therefore cause fairlyconsistent increases in delay, without inducing significant jitter. A closer look at the per packet valuesin Figure 8.7(c) shows that the vast majority of packet jitter values are still less than 10µs. The extrajitter seems to be caused by an increased percentage of packets in the 10 to 100µs range. It is difficult toestimate why there are more packets in this range, however, they still seem to appear in clumps separatedby about 1 second, suggesting they may also be caused by the timer event which delays packets in theQuaSAR best effort router.

The QuaSAR best effort router, using the sedf scheduler, incurs a massive average jitter of 55.5µs, 40.4µs(267.5%) greater than the same router using the bvt scheduler. Examining the per-packet jitter values(Figure 8.7(d)) shows that the vast majority of packets still have a jitter of less than 10µs, therefore thisincrease is not caused by a uniform increase in each packet’s jitter. The massive increase in jitter is,instead, clearly due to the regular diagonal bands of increasing jitter. These delayed packets occur every200ms (i.e. every tenth packet), with each of these delayed packets having roughly 50µs more delaythan the previously delayed packet. After about 10 of these delayed packets (every 2 seconds), the delayof these packets decreases to zero again. This strange cycle of delayed packets is highly suggestive of adeficiency in the unfinished sedf scheduler, which should be resolved before the scheduler is released.

The sedf scheduler increases the jitter of the QuaSAR routelet by 4.7µs or 28.8% (to 21µs), comparedwith the bvt scheduler. Again, Figure 8.7(e) shows that the majority of packets have a jitter of less than10µs. The additional jitter appears to be caused by an increase in the number of packets with a jitterof between 10 and 100µs, rather than an increase in the jitter of packets in this range. There were alsomore, seemingly random, packets with a jitter greater than 100µs.

8.1.3 Throughput

The maximum throughput which each type of router can support through a single network flow wasinvestigated using the iPerf network performance measurement tool (see Section 7.2.1). Figure 8.8shows the network setup which was used for these throughput experiments. This measurement stresses

61

Figure 8.8: This diagram shows the network setup used to perform throughput measurements.

the router more than the previous experiments, as it involves sending a much greater traffic flow thanthat sent by the per-packet timing measurement tool. However, the throughput measurement is moredependent upon the network speed, and the ability of the hosts to produce the required amount of traffic.

Figure 8.9 shows the maximum single flow throughput which each type of router can support, averagedover 5 runs. The results show that the overhead, introduced by Xen, decreases a router’s maximumsingle flow throughput by 0.1%, from 93.68Mbit/s to 93.58Mbit/s. The overheads involved in passingpackets to a routelet for processing incur a further 0.02% decrease in throughput, however, this is wellwithin the bounds of experimental uncertainty. Overall, the QuaSAR router does not significantly affectthe throughput of a single network flow when using the bvt scheduler.

However, when it uses the sedf scheduler, more significant decreases are seen in throughput. TheQuaSAR best effort router’s throughput decreases by 2.6% to 91.24Mbit/s when using the sedf sched-uler. This sharp decrease is almost certainly due to the patterning effect seen in the timing experimentsperformed on the sedf best effort router. Since TCP bases its flow control mechanism on timeouts andpacket loss, the delay pattern brought about by the sedf scheduler may confuse TCP’s flow control intothinking that the network has a lower maximum bandwidth than it does. Noticeably, the graph does notreport the maximum throughput of a QuaSAR routelet being scheduled using the sedf scheduler. This isbecause this type of router would not support the bandwidth created by the iPerf tool at all. The routelet’sdomain would immediately hang when the experiment was started, so no results could be determinedfor this router type. These results prove that the sedf scheduler was not stable enough to be used withinthe QuaSAR router, therefore, the sedf scheduler was not used for any further experiments.

The results of these experiments are, however, limited. The throughput of a single flow is limited to100Mbit/s, because of the speed of the Ethernet network used to connect the hosts to the router. Inaddition, the specification of the host, used to generate this traffic, was significantly lower than thatof the router. The host seemed to struggle to cope with the overheads involved in sending this level oftraffic, and its CPU time was fully utilised before it even reached the maximum throughput of 100Mbit/s.Therefore, the results of this experiment were heavily constrained by factors other than the type ofrouter. Each type of router would support an overall throughput, much greater than that of the singleflow presented here, however, the router used did not have enough network cards to fully investigatethe maximum overall throughput which each router could support. The partitioning experiments inSection 8.2 provide a better overview of the overall traffic load which each router type can sustain,however, these throughput experiments do identify the limitations placed upon each flow due to the

62

90

91

92

93

94

95

Standard MPLS

Quasar Best Effort (bvt)

Quasar Routelet (bvt)

Quasar Best Effort (sedf)

aver

age

TC

P b

andw

idth

(M

bits

/s)

Figure 8.9: The average maximum throughput of a single flow in each type of router

virtualisation overheads.

8.2 Network Flow Partitioning

The main purpose of the QuaSAR router was to use virtualisation techniques to improve the partitioningbetween network flows. In order to evaluate the effectiveness of the QuaSAR router at providing parti-tioning between network flows, the QoS of one network flow was measured, as a competing flow triedto use an increasing proportion of the router’s resources. Figure 8.10 shows the network setup whichwas used to perform these partitioning experiments. These experiments mainly evaluate how well eachtype of router can partition the CPU processing required by different flows, as these flows do not sharethe same network resources of the router.

These experiments again evaluate the three router types (standard MPLS, QuaSAR best effort andQuaSAR routelet), however, all QuaSAR experiments were performed under the bvt scheduler, as thesedf scheduler was not stable enough to support the traffic loads this experiment employed. The QuaSARroutelet experiments used two routelets, one for processing of the measured flow, and one for processingof the cross-traffic flow. This meant that the resources used by the cross-traffic could be controlled betterthan if it was processed as best effort traffic. The problems of controlling the cross-traffic in the besteffort router are discussed in Section 8.3.

The cross traffic-flow emulates a misbehaving flow which is sending more traffic than it requested re-sources for beforehand. In this situation, the Xen scheduler should only assign CPU time to the cross-traffic’s routelet for the requested traffic load, not any extra. However, with the bvt scheduler it is notpossible to guarantee that the cross-traffic’s routelet is not being assigned more CPU time than it wasguaranteed. It is only possible to weight the cost assigned to each domain for a slice of CPU time, sothat, ideally, it is scheduled often enough to service the QoS requirements of the requested flow and nomore. The QuaSAR routelet experiments therefore weighted both routelets’ (the routelet processing themeasured flow and the one processing the cross-traffic) domains so that they were likely to be assigned

63

Figure 8.10: This diagram shows the network setup used to perform partitioning experiments. Note thatthe cross-traffic does not traverse network links used by the measured flow.

enough CPU time to process the traffic of the measured flow (so that the scheduler did not limit themeasured flow), however, this could not be guaranteed.

The same Quality of Service parameters of latency, jitter and throughput were measured in each type ofrouter, while that router was being loaded with an increasing amount of cross-traffic. The parameterswere measured using the same tools as were used to evaluate the unloaded routers in Section 8.1, witheach measurement also averaged over 5 runs and each per-packet timing run consisting of 3000 packets.

8.2.1 Effect of Timer Interrupt Busy Wait

In order to produce usable results with these experiments, the machine used as the router needed to beslowed down. This is because the maximum amount of interference traffic which could be injected intothe network was not enough to significantly affect the QoS provided by each of the router types to themeasured flow. Therefore, if the routing machine had not been slowed down, the QoS characteristicsprovided by the routers would have been the same across the whole range of interference traffic load.This would not allow any useful conclusions to be drawn, which had not already been discovered in theunloaded router experiments.

The method used to slow down the router machine (described in Section 7.5) involved inserting a busy

64

0

50

100

150

200

250

StandardMPLS

Quasar BestEffor(bvt)

QuasarRoutelet(bvt)

Late

ncy

(μs)

Without Busy Wait

With Busy Wait

(a) Latency

0

5

10

15

20

StandardMPLS


QuasarRoutelet(bvt)

Jitte

r (μ

s)

Without Busy Wait

With Busy Wait

(b) Jitter

90

91

92

93

94

95

StandardMPLS


QuasarRoutelet(bvt)

Thr

ough

put (

Mbi

t/s)

Without Busy Wait

With Busy Wait

(c) Throughput

Figure 8.11: This figure shows the difference in unloaded performance of each router type when it isslowed down with the timer interrupt busy wait loop. The three QoS parameters used for the partitioningexperiments (Latency, Jitter and Throughput) are shown.

wait loop into the timer interrupt handler. While this did mean that the available range of cross-trafficcould produce measurable differences in the QoS characteristics of each router type, it is possible thatit could introduce unexpected consequences. For example, since the slow down overhead occurs at thestart of every 10ms period, the packet jitter could be inadvertently affected.

To investigate whether this busy wait loop substantially affected the results presented here, the QoSprovided by each unloaded router was investigated both with and without the added busy wait loop.Figure 8.11 shows the results of this investigation. It can be seen that the busy wait loop decreases theperformance of each router type in a uniform manner. The latency of each packet’s transmission timeincreased by between 2-4% for each type of router. The jitter introduced by each router type increasedmore, by between 5-10% depending upon the router type. However, it is expected that the jitter wouldincrease more significantly than the latency. The busy wait loop introduces all of its delay at the startof every 10ms period, therefore packets just before, and just after the busy wait period will experiencesignificant extra delay and thus jitter. The busy wait loop has much less impact on the maximum, perflow, throughput of each router (between a 0.1-0.3% decrease depending upon the router). Since the

65

throughput is more heavily constrained by other factors, such as network speed and host performance, itwas expected that the busy wait loop would not greatly affect an unloaded router’s per-flow throughput.

Overall, the busy wait loop within the timer interrupt does seem to provide the required result of slowingthe machine down in a reasonably uniform manner. Loading the router may cause some unforseenconsequences, however, this cannot be directly uncovered since there are no results to compare with theloaded router characteristics.

8.2.2 Latency

Figure 8.12 shows the effect of increasing cross-traffic on the latency of a QoS flow in each type ofrouter being evaluated. The latency of the standard MPLS router is not significantly affected until thecross-traffic reaches about 90,000 pps. At the maximum cross-traffic load of 97,700 pps, the latencyof the measured flow increases by 140% to 480µs. The latency seems to increase exponentially after90,000 pps, however, this cannot be confirmed without increasing the cross-traffic load placed upon therouter.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Late

ncy

(μs)

Interference Flow Packets per Second

"Standard MPLS""Quasar Routelet (bvt)"

"QuaSAR Best Effort (bvt)"

Figure 8.12: This graph shows the effect of an increasingly badly behaved network flow on the latencygenerated on another flow for each type of router.

The QuaSAR best effort router has a very different latency characteristic as the router becomes moreloaded by the cross-traffic. The latency is not significantly affected until the cross-traffic reaches 60,000pps, however, between 60,000 and 65,000 pps, the latency suddenly jumps by 2700% to about 6000µs.After 65,000 pps the latency does not significantly increase any further. Recall, that the QuaSAR besteffort router uses the same routing software as the standard MPLS router, but within a Xen domain. Ittherefore identifies the overheads produced by the Xen VMM itself. Such a sudden jump, then no furtherincrease, is highly suggestive of a timer of some sort being missed, rather than an increased per-packet

66

overhead caused by Xen. It is possible that the increasing cross-traffic causes the Xen domain schedulerto block the best-effort routing domain. However, since no other domains are running in this experiment,the best effort router should never blocked by the Xen scheduler. It is also possible, although unlikely,that this characteristic is caused by the busy wait loop which was added to Xen’s timer handler to slowthe routing machine down.

The QuaSAR routelet’s latency also starts rising at 60,000 pps, however, it does not rise in a suddenjump. Instead, this rise starts off gradually, then increases exponentially, eventually reaching almost 100times the original latency. The overheads incurred by virtualisation clearly prevent the virtual routeletapproach achieving a greater partitioning between flows than the standard MPLS router. However, thevirtual routelet approach does outperform the best effort router’s latency when the cross-traffic is be-tween 60,000 and 80,000 pps. It seems that the virtual routelet approach prevents the, possibly timerbased, glitch which Xen causes at 60,000 pps, by limiting the time spent processing cross-traffic pack-ets. However, increasing the cross-traffic after this point causes an exponential rise in the time spentprocessing packets from the measured flow.

It seems likely that this exponential increase is caused by context switches between domains. Each time apacket arrives from the cross-traffic flow, it causes an interrupt in the router. Xen may delay processingthis interrupt somewhat, however, the bvt scheduler is specifically built to make each domain appearhighly interactive. Therefore, it is likely to immediately context switch to the domain which handles theinterrupt as soon as one arrives (as this usually provides low latency to asynchronous external events,making the domains appear more responsive). If the measured flow’s routelet is processing a packet andan arriving cross-traffic flow packet causes an interrupt, then the bvt scheduler may immediately contextswitch to the privileged domain, to handle this interrupt, then context switch back to the routelet’sdomain to continue processing the packet. Since context switches between domains are very costly (seeSection 8.1.1), this exponentially increases the time taken to process a packet in the routelet. This failingcould have been negated by using a soft real time scheduler, which doesn’t context switch to processan interrupt until the currently running domain has used its guaranteed time slice. Another approachwould be to poll the network card for packets instead of using an interrupt driven model. Neither ofthese approaches were available during this project for reasons discussed earlier in this report.

8.2.3 Interpacket Jitter

Figure 8.13 shows the effect of increasing cross-traffic on the jitter of a QoS flow in each type of routerbeing evaluated. As can be seen, virtualisation does not have such an extreme effect on a flow’s jitter,compared with latency, when the router is under strain. When being processed by the standard MPLSrouter, the flow’s jitter stays relatively steady until the cross-traffic reaches 80,000 pps. At this point, thejitter steeply increases, until it has increased to an average of 209µs at 97,700 pps. It looks likely thatthis increase would continue at a similar rate, if more cross-traffic could be generated.

The jitter characteristic for the QuaSAR best effort router is similar to its latency characteristic. Thejitter suddenly increases substantially, to 370µs when the cross-traffic reaches about 60,000 pps, butdoes not increase any further with increasing cross-talk. The possible missed timer in Xen, which maybe the cause of the sharp increase in latency at 60,000 pps, could also be the cause of the substantialincrease in jitter at the same cross-traffic load. This suggests that the timer, which is being missed byXen, does not equally affect each packet (otherwise the jitter would not increase as the average latencyincreases). Instead, it seems that the missed timer substantially increases the latency of some packets,but does not affect other packets, increasing the average latency as well as the interpacket jitter.

The QuaSAR routelet’s jitter more closely follows the jitter induced by the standard MPLS router thanthe QuaSAR best effort router does. At the maximum cross-traffic load of 97,700pps, the QuaSAR

67

0

50

100

150

200

250

300

350

400

450

0 20000 40000 60000 80000 100000

Jitte

r (μ

s)


Standard MPLSQuaSAR Best Effort

QuaSAR Routelet

Figure 8.13: This graph shows the effect of an increasingly badly behaved network flow on the jitter ofanother flow for each type of router.

routelet induces jitter of 325µs, 55% greater than the standard MPLS router at the same point. However,at this point, both routers’ jitter is increasing exponentially. Instead of the QuaSAR routelet’s overheadscausing a consistent 55% increase in jitter, it appears that the knee point (i.e. where the jitter startssuddenly increasing exponentially) occurs at 2000 pps less cross-traffic for the QuaSAR routelet thanthe standard MPLS router. This suggests that the virtual routelet approach incurs the same jitter char-acteristic as the standard MPLS router, except the standard MPLS router can cope with 2000 pps morecross-traffic than the QuaSAR routelet.

Overall, the QuaSAR routelet may not provide better jitter characteristics, when under load, than thestandard MPLS router. However, it substantially improves upon the jitter characteristics of the QuaSARbest effort router1. Since the QuaSAR best effort router is the standard MPLS routing software runningwithin a Xen virtual domain, this proves that the virtual routelet approach reduces some of the additionalcross-traffic overhead introduced by virtualisation, with respect to a flow’s jitter.

8.2.4 Throughput

Figure 8.14 shows the effect of increasing cross-traffic on the maximum bandwidth of a QoS flow in eachtype of router being evaluated. Testing the throughput of each type of router is a much more strenuoustest of the router’s performance than the per-packet timing experiments. This is because the throughputtesting tool (iPerf) injects a much greater traffic load in the measured network flow than the per-packet

1It appears from the graph that the jitter induced by the QuaSAR routelet will overtake the QuaSAR best effort jitter justafter 100,000 pps if the QuaSAR best effort jitter stays steady. However, Section 8.2.4 suggests that the jitter induced by thebest effort router will jump again as the cross-traffic load continues to increase. Therefore, the jitter introduced by the QuaSARroutelet may never overtake the QuaSAR best effort router’s jitter.

68

timing tool does (the per-packet timing tool only injects roughly 28kbit/s, whereas iPerf tries to reachabout 100Mbit/s). This means that the router is already significantly loaded, even before the cross-trafficcontends for router resources. This experiment, therefore, gives some indication of the effect on eachrouter type of a greater traffic load than can be generated by the cross-traffic flow itself.

0

20

40

60

80

100

0 20000 40000 60000 80000 100000

TC

P B

andw

idth

(M

bit/s

)


Standard MPLSQuaSAR Best Effort

QuaSAR Routelet

Figure 8.14: This graph shows the effect of an increasingly badly behaved network flow on the maximumthroughput of another flow for each type of router.

The throughput which the standard MPLS router can provide to a single flow is not affected until thecross-traffic reaches about 90,000 pps. At the maximum cross-traffic of 97,700 pps, the throughputdecreases by 4.3% to 89.6Mbit/s.

The QuaSAR best effort router again incurs a steep decrease in performance, followed by a level perfor-mance characteristic, as seen in the previous two experiments. However, in this case, the sharp decreasein performance occurs when the cross-traffic reaches 40,000 pps, then again at 60,000 pps. The perfor-mance drop seen at 60,000 pps in the previous two experiments now seems to occur at 40,000 pps. Sincethe throughput measurement tool increases the load on the router much more than the per-packet timingmeasurement tool, it is expected that this performance drop would occur at an earlier cross-traffic load.

The second drop in performance suggests that if we increased the cross-traffic load in the previousexperiments above 100,000pps, we would see another sharp drop in performance in their results as well.The movement of the first performance drop (from 60,000 pps to 40,000 pps) and the second drop (frompossibly 100,000 pps to 60,000 pps) is not the same between the timing and throughput experiments.However, the throughput performance does not necessarily scale with increased router load, in the sameway as timing performance. This is especially true, since the experiment measures a single networkflow’s maximum throughput (which is limited by the speed of the network connections) rather thana router’s overall maximum throughput. Also, the load on the router is affected by the throughputmeasurement network flow, as well as the cross-traffic network flow. This explains the lack of anotherperformance drop at 80,000 pps, which one would naively expect to find (e.g. if performance drops

69

occur every 20,000 pps). At 80,000 pps, the throughput being achieved by iPerf is 3Mbit/s, 3.2% of thatachieved at 20,000 pps. Therefore the throughput measurement flow only exerts about 3% of the loadit did when the cross-talk was 20,000 pps. So, although the load placed on the router by the cross-talkflow is much greater, the load introduced by the throughput measurement tool is much less.

The QuaSAR routelet sees a much more gradual decrease in throughput performance than the best effortrouter. The throughput performance does, however, start to decrease almost as soon as cross-trafficis injected into the router. This suggests that the overheads involved in passing about 100Mbit/s oftraffic between domains (with the context switches this incurs) are using almost all of the availableCPU processing time. The QuaSAR routelet’s maximum throughput is not as significantly affected bythe cross-traffic after 40,000 pps, as the QuaSAR best effort router’s throughput, suggesting that thevirtual routelet design does provide some partitioning between network flows, when the overheads ofvirtualisation are taken into account.

8.3 Future Experiments

A number of other experiments were considered to further analyse the various features of each type ofrouter, and further evaluate the performance of the QuaSAR router. These experiments were rejectedeither for design issues, time limitations or equipment shortages. The experiments are presented here aspossible future work, with reasons for their rejection.

Best Effort Cross-Traffic

When evaluating the partitioning performance of the QuaSAR routelet in Section 8.2, all of thecross-traffic was routed by a QoS routelet. While this tested the partitioning which could beprovided between two QoS network flows (one well behaved and one badly behaved), it did notevaluate a QoS flow’s partitioning from best effort traffic. This was a conscious decision becauseof a design decision which had been made as the QuaSAR router was being built. Since the besteffort router had to control the routelets when RSVP control messages arrive, it must run withina privileged Xen domain (unprivileged Xen domains have no control over other domains). Thepacket demultiplexing must also run within a privileged domain, since it requires direct access tothe physical network devices, which can only be granted to privileged domains. It was thereforedecided to run both the best effort router, and the packet demultiplexing within the same privilegeddomain. This meant that the available CPU time of the best effort router could not be limited,without also affecting the QoS routelet’s packet demultiplexing time or the control code.

It would be possible to separate out the control code, the best effort router and the packet process-ing. A very simple domain could act as a hardware device controller and packet demultiplexer.A privileged control domain would be passed all RSVP messages, and a best effort router do-main would handle all best effort traffic. These domains could be hooked up to the simple packetdemultiplexing domain in the same way as the QoS routelets are, and could also be limited intheir resource (e.g. CPU time, memory, network transmission rate, etc) usage in the same way.Unfortunately, this design would have required more time to build than was available during thisproject. It would also have incurred extra overheads, since all packets would have to be passedfrom one domain (the simple packet demultiplexer) to another (the domain which processes thatpacket) and back again.

Scalability

One aspect of the QuaSAR router’s performance, which was not measured, was its scalabilityto an increase in network QoS flows. The QuaSAR router would not have been anywhere near

70

as scalable as a normal router, since it runs a separate Linux operating system for each networkflow, and so it would quickly run out of resources (e.g. memory) as additional Linux operatingsystems were started to support new network flows. However, it would still have been interestingto evaluate how the QuaSAR router copes with an increasing number of network flows.

To measure the QoS parameters of each network flow to the level of detail provided in the exper-iments presented here, each flow would require its own source and sink hosts. If multiple flowswere run from one host, then each could interfere with the other, especially since the host ma-chines were of a significantly lower specification than the router machine. There was not enoughtime or equipment to set up separate machines for each network flow. There was also not enoughtime to create experiments which would evaluate a flow’s QoS in less detail, but with greaterconsistency, when multiple flows are being sourced by a single host.

Network Resource Partitioning

The partitioning experiments described in Section 8.2 concentrated on partitioning the CPU timeassigned to each flow’s routelet. However, the QuaSAR router also allows the network transmis-sion rate of a routelet to be limited. In order to evaluate the network transmission rate partitioningprovided by the QuaSAR router, a cross-traffic network flow could be set up which would berouted to the same destination as the measured flow (rather than passing through the same router,but travelling on different network line cards than the measured flow, as in Section 8.2). This cross-traffic flow would consist of maximum sized packets, at a rate designed to saturate a 100Mbit/snetwork. Since large packets are used, it would take less packets per second (pps) to saturate thenetwork. Therefore the router’s CPU would not be as overwhelmed as with the small packet, highpps, flow used in Section 8.2. The routelet network transmission rate limiting could then be usedto prevent the cross-traffic flow from saturating the destination network, providing the measuredflow with a better QoS.

However, this experiment is very dependent upon the host’s ability to cope with 100Mbit/s oftraffic. Because of the limited hosts available during this project, this experiment would have beenheavily limited by the host’s performance. It would therefore have been very difficult to make anyworthwhile measurements in this experiment.

Direct Measurement of Individual Overheads The measurements performed here allow evaluation ofthe overall overheads involved in the virtualisation of a network router. Although these mea-surements can be used to conjecture a rough estimation of the individual overheads, they do notprovide direct evidence of which individual operations contribute the most significantly to theseoverheads. To uncover exactly which areas of the virtualisation approach need to be modifiedso that the virtual routelet architecture is effective, more fine grained detail about the individualoverheads requires to be measured.

To measure these overheads directly, profiling code could be added at strategic locations withinthe Xen virtual machine monitor. This added profiling code would have to be carefully written, toensure that it does not add significant overhead to any of the areas being profiled. This would alsorequire detailed knowledge of almost all of the Xen virtual machine monitor’s code, which wouldnot have been possible over the timescale of this project.

71

Chapter 9

Evaluation

The previous chapter presented the results of the experiments which were performed on the QuaSARrouter, and provided some analysis of the possible reasons behind these results. This chapter attempts toprovide an overview of these results and bring together the analysis on these results, in order to evaluatethe QuaSAR router’s overall performance. Section 9.2 goes on to relate this analysis to the overalldesign of the QuaSAR router, attempting to identify the areas of QuaSAR’s design which prevented itfrom providing better partitioning between network flows than a standard router.

9.1 Experimental Results Overview

The unloaded latency experiments provide a good measure of the overhead incurred by the various com-ponents of the QuaSAR router, since it measures the actual time taken to process an individual packet.These experiments suggest that the Xen virtualisation software (without context switching between do-mains) incurs roughly a 7.3% processing overhead per routed packet. This compares well with theperformance evaluation performed by Barham et al in [2], which found a 5% average overhead incurredby Xen virtualisation. The virtual routelet approach used by QuaSAR incurs an additional 30.1% pro-cessing overhead per packet. This suggests that the major overheads incurred by a QuaSAR routelet arecaused by the context switch time and the transferring of packets between domains. The overhead in-curred by a soft real time scheduler could not be fully evaluated, due to the unfinished nature of the sedfscheduler used. However, the fact that the unfinished sedf scheduler incurs only a 10% additional over-head (for QuaSAR routelets) implies that a soft real time scheduler, designed specifically for networkrouting, could be used by the QuaSAR router without incurring major overheads.

The measurement of unloaded jitter identifies how evenly the overheads are spread across individualpacket transit times. If a large overhead is identified by an increased average packet latency, but thejitter does not increase substantially, then this suggests that the overhead affects each packet equally.Conversely, if a large increase in average latency is accompanied by a substantial increase in jitter, thenit suggests that large overheads affect a small number of packets, whereas other packets are unaffected(e.g. the overhead is caused by a random event, which delays 1 in 10 packets).

The 7.3% overhead incurred by Xen virtualisation is accompanied by a 21.7% increase in jitter. Thissuggests that the overheads caused by Xen are random in nature and do not affect each packet equally.Therefore, the overheads cause by Xen virtualisation occur, either because of an increases in the numberof random events which can delay packets (for example, a domain may periodically yield control tothe VMM, to ensure they do not hog the CPU), or because the length of random delays, present before

72

virtualisation, increases (e.g. the periodic timer interrupt now has more work to do to ensure all domains,and the Xen VMM, are notified of time changes).

The 30.1% overhead incurred by the virtual routelet approach to network routing is accompanied byonly a 7.9% increase in jitter. It seems likely, therefore, that most of the overheads incurred by thevirtual routelet approach are incurred equally across packets. Since the major overheads which wereidentified (context switching and passing packets between domains) occur for each and every packet,this may not be surprising. However, it would be expected that passing a packet between domains wouldintroduce more random jitter than it does, since a context switch needs to occur before the packet wouldbe processed any further. The event which signals a packet being passed between domains does notimmediately cause a context switch in Xen (otherwise thrashing between domains could easily occur),therefore, it would be expected that some random packets would be significantly delayed (causing con-siderable jitter) as they wait for a context switch to occur. The fact that this does not happen shows thatXen correctly assumes that the privileged domain has no more work to do, then context switches to theroutelet’s domain, because it has an outstanding event. There are also some indirect consequences ofcontext switching, such as cache memories and TLBs being invalidated. The lack in jitter which accom-panies these events suggests that a consistent set of memory locations are required during each packet’stransmission, so the overhead incurred by these indirect consequences are equal for each packet.

The unloaded throughput experiments are less valuable, since their results are considerably constrainedby both the maximum network speed and the performance of the hosts. The results could also be affectedin unexpected ways by the TCP flow control algorithm. They do, however, provide a useful measurementof the real world performance which could be expected by the QuaSAR router. The results prove thatthe QuaSAR router can cope with a high throughput load, while confirming a slight overhead introducedby the QuaSAR router, as seen in previous experiments.

The evaluation of QuaSAR’s flow partitioning capabilities identifies some other interesting aspects. Inall three experiments, the virtualisation overheads incurred by the QuaSAR router prevent it from cop-ing with greater cross-traffic loads than the standard MPLS router. However, in all three experimentsthere are areas the virtual routelet approach (QuaSAR routelet) can outperform a standard router withina virtual machine (QuaSAR best effort). Since the QuaSAR routelet approach incurs much greateroverheads, per-packet, than the QuaSAR best effort approach, this suggests that the virtual routelet ar-chitecture has some promise. The overheads incurred by the form of virtualisation used in this projectwere, however, too great for the QuaSAR router to be effective, compared with a standard router. Themajor overheads which prevented the QuaSAR router from providing better partitioning than a standardrouter are discussed in the next section.

9.2 Virtualisation Overheads

The Xen virtualisation software was built to support complex, multi-threaded, multi-address space op-erating systems, in order to provide virtual server farms [33]. It was not built for this type of virtualroutelet architecture. There were therefore a number of design decisions which were made in Xen whichare aimed at providing support for virtual server farms, and which do not suit the virtual routelet designmodel. Xen must enforce security between domains, since it is designed to support several mutuallyuntrusted domains. However, trust and security are secondary concerns for the QuaSAR router (eachroutelet is trusted). What is more important, in a virtual routelet architecture, is an approach whichenforces cycle and bandwidth guarantees for each routelet. This section discusses some of the aspectsof Xen’s design which incurred major overheads in this virtual routelet architecture or prevented theenforcement of each routelet’s cycle or bandwidth guarantees. It also discusses how these aspects couldbe designed differently to suit the virtualisation technique required by the QuaSAR router.

73

9.2.1 Linux OS for each Routelet

Each routelet within the QuaSAR router requires an individual Linux operating system. As the numberof network flows being routed by a QuaSAR router rise, this produces an increasingly large overhead.The QoS routelets have very simple requirements, therefore the use of a complex multi-user operatingsystem, such as Linux, for each routelet vastly increases the resources required by each routelet. Thiswill, in turn, decrease the scalability of the QuaSAR router, since it will more quickly run out of memoryas additional flows arrive and unnecessary operations within the multiple Linux operating systems (suchas processing timer events) decrease the usable CPU time.

As no simple operating systems were available for Xen during the course of this project, the use ofa complex operating system was unavoidable. If, however, time had allowed, the scalability of theQuaSAR router and possibly its performance under load could have been improved by designing andimplementing a simple operating system aimed directly at the router’s requirements.

9.2.2 Context Switch Overhead

One major overhead in QuaSAR’s design seemed to be the time taken to context switch between do-mains. The processing of each QoS packet requires two context switches, therefore this overhead be-comes very significant as the traffic load on the router increases. Xen is designed to support multiplecomplex multi-user operating systems, each running complex software which is not necessarily trusted.Each domain, therefore, runs within its own virtual memory address space, to ensure isolation betweendomains and enforce inter-domain security. Although this guarantees that one domain cannot access an-other domain’s memory, it increases the cost of each context switch, as each context switch now requiresa change in memory address space (with a move to a different page table, and invalidation of the TLB),as well as saving and restoration of the CPU state.

The virtual routelet architecture has a very different set of requirements from the intended uses of Xen.The routelets only run a very small, pre-known set of software, which is therefore trusted. Inter-domainmemory security is therefore not very important to the QuaSAR router. The performance of the QuaSARrouter could be substantially improved by sharing a single address space between all of the routelets, asthis would substantially reduce the overhead caused by context switches between domains. This wouldreduce some of the partitioning between routelets (as they would now be sharing the same memory),however, if it was carefully designed, so that no guest operating system tramples on the memory of anyother guest operating system, it would not affect the architecture’s overall approach.

9.2.3 Routlet Access to NIC

Another major problem with the design of the QuaSAR router is that it requires packets to be passedbetween domains multiple times. Although Xen uses page table manipulation to achieve this packettransfer, it still incurs a significant overhead (due to event messaging, scheduling between domains, andother factors). This approach was necessary when using the Xen virtual machine, since only one domaincan have access to the physical networking devices (again, to ensure partitioning and security betweendomains).

As discussed previously, security between routelets is not very important in the virtual routelet archi-tecture, and while partitioning of each routelet’s network transmission rate is important, it could stillbe achieved if each routelet was given full, but rate limited, access to the network devices. However,naively giving each routelet access to the physical network devices would cause serious problems. Forexample, if one domain is pre-empted by another, while in the middle of sending a packet to a network

74

device, that packet would become corrupted, especially if the other domain also starts sending a packetto the same network device. A shared network device driver would have to be very carefully constructedto ensure one domain cannot interfere with another domain’s packet sending or receiving. This driverwould include critical sections to prevent a domain from being pre-empted when it is in the middle ofsending or receiving data from a device. These shared device drivers would also need some form of ratelimiting, to prevent one routelet from hogging a network device (since the routelets are trusted, the XenVMM would not necessarily have to ensure this rate control).

9.2.4 Classifying Packets

A large proportion of each packet’s processing time is spent classifying the packet and deciding whichroutelet processes that packet’s flow. This processing time cannot be assigned to the flow’s routelet,because the packet has not yet been classified to its particular flow. The fact the QuaSAR routelet wasdesigned to route MPLS traffic could have exacerbated this problem. MPLS was chosen because it isvery simple to classify (by simply checking the MPLS label), and its processing requirements are alsoextremely simple (replacing the MPLS Label, then encapsulating it in an Ethernet frame header). Al-though this simplified the design and implementation of the QuaSAR router, it meant that the differencebetween classification time and processing time was not as large as it could have been, thereby reducingthe effectiveness of the routelet partitioning. A more complex networking protocol, such as IP, could seebetter partitioning with the QuaSAR router. Although this would involve more complex classification,the more routing and processing requirements of IP could have offset this classification cost, improvingthe overall portion of each flow’s processing which could be partitioned.

9.2.5 Soft Real Time Scheduler

One of the most important deficiencies in the use of Xen for this project was the lack of a stable soft realtime scheduler to provide effective CPU partitioning between domains. The use of the ATROPOS softreal time scheduler to partition each routelet’s CPU processing was an important aspect of the originaldesign. As discussed previously, the ATROPOS scheduler was not available, and its replacement (sedf)was not stable enough to use in the partitioning experiments. Therefore, a time shared (bvt) schedulerhad to be used. This scheduler does not guarantee that the routelet processing the cross-traffic does notuse more processing time than it is allowed. The cross-traffic flow’s routelet may, therefore, be usingsome of the CPU time assigned to the measured flow’s routelet’s, thereby reducing the effectiveness ofthe QuaSAR routelet’s performance.

This lack of a soft real time scheduler was one of the most important problems in this project. Thepurpose of the QuaSAR router was to evaluate the effectiveness of a virtual routelet approach. The otherproblems reduce the performance of the QuaSAR routelet, however this can be taken into account whenevaluating the router. The lack of a soft real time scheduler prevents the QuaSAR router from fullyfollowing the design philosophy of the virtual routelet approach. Therefore, the partitioning measure-ment on the QuaSAR router does not accurately evaluate the partitioning performance which could beexpected by a true virtual routelet approach.

75

Chapter 10

Conclusion

10.1 Aims

The main aim of this project was to test the hypothesis that virtual machine techniques can be usedwithin network routers to provide effective partitioning between different network flows, and thus pro-vide Quality of Service guarantees to individual network flows. To this end, the project involved thecreation of a prototype experimental router (QuaSAR) which uses virtualisation techniques to provideQoS guarantees to network flows.

One of the primary aims of this research project, therefore, involved developing techniques which allowvirtual machine techniques to be effectively used within network routers to improve the QoS guaranteesthey can make to network flows. The virtual routelet architecture was proposed as a technique whichcould improve the partitioning between network flows using virtualisation methods. This aim, there-fore, lead to the goal of creating design techniques and tools which would enable this virtual routeletarchitecture, and allow the QuaSAR prototype to be built.

An important aspect of this work involved performing experiments on the prototype router, to evaluate itseffectiveness and discover the areas of its design that be improved. This evaluation should also uncoverwhether the underlying virtual routelet architecture is an effective way of increasing the partitioningbetween flows, and it should suggest ways in which this architecture can be improved.

The process of hardware virtualisation adds an inherent overhead to the cost of performing a given task.The amount of overhead incurred depends upon how well the virtualisation process matches the giventask. The virtualisation technology used within this project (Xen) was not designed for the process ofnetwork routing, but was instead aimed at complex server virtualisation. Another aim of this projectwas therefore to evaluate the design features of Xen, which incur the greatest overhead in this virtualroutelet architecture, and predict the design features needed by a virtualisation approach suited to thevirtual routelet architecture.

10.2 Achievements

Many of the aims of this project were successfully met. The creation of the QuaSAR router prototype in-volved the development of a number of tools and techniques which enable virtual machine technologiesto be used within a network router. The virtual routelet architecture was elaborated upon by the creationof the components necessary to implement this design in the QuaSAR prototype router. As such, thearchitecture of the virtual routelet was designed and implemented, and packet processing capabilities

76

were added so that it could route MPLS traffic. Dumultiplexing support was developed, so that packetscould be sent to the appropriate routelet for processing. The complex internal networking, necessaryto connect virtual routelets to the the appropriate physical devices, was determined and set up as partof the QuaSAR prototype. Control code was implemented, to allow RSVP messages to set up newnetwork flows and associate those flows with virtual routelets. Finally, methods were developed whichallowed management of each routelet’s allocation of overall resources, such as CPU time and networktransmission rate.

The creation of the QuaSAR router also brought about some indirect benefits to the research communitywhich were not specific aims of this project. The integration of various open source projects meant that anumber of the open source projects had to be upgraded to support the latest Linux kernel release. As partof this process, the Click Modular Router was ported from the 2.4 kernel to the 2.6 kernel. The changesI made to the Click Modular Router, along with some preliminary porting work, was passed along tothe Click community as a source code patch. These changes (modified to allow backward compatibilitywith previous kernel releases) have now been integrated into the Click development tree. Therefore, anindirect contribution of this project, to the research community as a whole, was the porting of the ClickModular Router to the most recent Linux Kernel.

The Xen virtual network interface (vif) transmission rate limiting support was included in QuaSAR’soriginal design as a means of guaranteeing a proportion of the overall router’s network transmissionresources to each routetet (and thus, each network flow). The original vif rate limiting support had beendropped from Xen as it evolved, therefore, I reimplemented this rate limiting support for the new Xenrelease and added support to Xen’s management tools to allow user-level control over this function.These improvements were contributed to the Xen community as a source code patch, whereupon theywere integrated within the Xen development tree for use in the next Xen release.

The creation of the QuaSAR router proved that it is possible to use virtual machine techniques within arouter, and still achieve acceptable performance. The QuaSAR router could support high throughput net-work flows and achieved acceptable per-packet latencies and inter-packet jitter for use in a standard edgenetworking environment. Virtualisation did, however, incur significant performance overheads, partic-ularly in regard to per-packet latencies. These overheads prevented the QuaSAR router from achievingbetter partitioning between network flows compared with a standard software router, however, whensome of the virtualisation overheads were taken into account (by comparing the QuaSAR best effortrouter to the performance of a QuaSAR routelet), the virtual routelet approach seemed to show promise.The fact that the QuaSAR routelet outperformed QuaSAR best effort routing in some partitioning ex-periments is particularly significant, since when unloaded, the QuaSAR routelet approach incurs 29.5%more virtualisation overheads (due to context switching, packet processing between domains and otheroverheads) than the QuaSAR best effort router. This suggests that the virtual routelet architecture couldbe effective at increasing the partitioning between network flows, if the overheads of virtualisation weresubstantially reduced.

The evaluation of the QuaSAR router provided considerable insight into the design features of the vir-tualisation software used, which were not appropriate in the virtual routelet architecture. These insightssuggest a number of design features which would be important in the creation of a virtualisation ap-proach, appropriate to the virtual routelet architecture (such as low context switch overhead and sharedaccess to physical devices - see Section 9.2).

In fact, as well as evaluating the effectiveness of the virtual routelet architecture, the QuaSAR router isan effective tool with which to measure different performance aspects of virtual machine technologies.The internal routing can be set up to pass packets between any number of different domains before rout-ing them to the correct network. The latency between packet arrivals can therefore be used to measurecontext switch times, scheduling delays, domain memory sharing and many other virtualisation over-heads. Modification of the virtual machine manager and the QuaSAR router would allow more precise

77

measurements of each individual overhead, providing profiling of the performance of different aspectsof Xen, or any other virtualisation technology. Since the performance of the virtualisation overheads isprofiled externally (i.e. by measuring the latency of packets at the external hosts) the actual profilingcode does not affect the measurement results. Throughput measurements of the QuaSAR router couldalso be used to evaluate scalability issues of virtualisation, with regard, for example, to aspects such asmaximum number of context switches that can be performed within a certain period.

Overall, although the QuaSAR router could not provide better partitioning between network flows, it diduncover a number of valuable insights into potential improvements in the virtual routelet architecture (forexample, in the classification stage) and the virtualisation techniques necessary to make this architectureeffective.

10.3 Further Work

As well as performing the experiments described in Section 8.3, identification of some of the deficienciesin the QuaSAR router suggest some natural extentions to this work. Firstly, each QoS routelet is onlyrequired to perform simple packet forwarding (complex packet forwarding, routing table maintenanceand configuration are dealt with by the best effort router in the main guest operating system). However,even though the QoS routelets have such simple processing requirements, they still run on top of afull (albeit, small) Linux operating system. This is excessive in terms of the resources required byeach routelet. While the best effort router requires some of the functionality provided by a complexoperating system such as Linux, the QoS routelets could be supported by a much simpler operatingsystem. Therefore, one possible extension would be the creation of a simple operating system whichperforms the required routing, without using extra resources, or introducing additional latency to thetime taken to forward packets due to unnecessary complexity.

Another problem with this design is the fact that every packet from a QoS flow must be demultiplexedwithin the best effort router, before being sent to its routelet for processing. This causes possiblecrosstalk between flows’ traffic patterns, and reduces the effective partitioning between network flows.For example, the QoS flows still require some of the resources of the best effort router, so if a router isoverwhelmed with best effort traffic, the QoS flows will also suffer. This could be alleviated somewhatby demultiplexing packets to their required routelet or best effort router within hardware, rather thansoftware. The Arsenic network interface card [31] provided the ability to demultiplex packets betweenvirtual interfaces, based upon bit patterns in their header. While this was created with the intentionof demultiplexing packets between applications, it could easily be modified to send packets directly tothe routelet processing them. This would remove QoS flows’ dependency on the best effort router’sresources. While it would not remove all possible crosstalk between flows, as the resources used bydemultiplexing are still shared between flows, it would at least provide partitioning flows arriving ondifferent network line cards. It would also move the demultiplexing to hardware, increasing its speed(possibly to network line speed) and reducing the probability of the demultiplexing stage becomingoverwhelmed.

This work could also be extended by investigating the use of another method of resource virtualisation.The virtualisation software used in the prototype (Xen) was developed with the express purpose ofsharing a machine’s resources between full multi-threaded, virtual address space, complex operatingsystems. These capabilities are not required by a router. A similar effect could be realised through theuse of a stateless library, real time operating system. Each routelet would run as a separate real timeprocess, with the OS’s scheduler giving each “routelet process” the resources it requires. The use of astateless system library (such as that used by the Nemesis operating system [34], [17] and [21]) wouldprovide partitioning between the routelets, as any system calls they make would be independent of each

78

other (i.e. if a routelet made a system call, it would not have to wait for another routelet’s system callto finish). Therefore, each routelet could only make use of the resources allocated to it by the system’sreal time scheduler. So that each QoS flow is not affected by the queues of other flows, each routeletwould queue its own outgoing packets. The transmit function of the network line card’s driver wouldbe modified, so that it obtains the next packet to send from the routelets’ outgoing queues, choosing aqueue based upon the resources allocated to each routelet.

The QuaSAR prototype also suffers by being a software router and using commodity x86 hardware,rather than using the hardware of high performance commercial routers. The architecture presented bya hardware router is substantially different from that of commodity hardware, requiring different designtechniques. A hardware router is typically designed to work on a number of packets in parallel. Forexample, a modern hardware router may use forwarding engines on network interface cards (nics) toprocess packets independently and in parallel. These packets may then be transferred between nicsin parallel using a switching fabric. However, although hardware routers provide much better routingperformance, compared with software routers, they are much less flexible. Network processors bridgethe gap between hardware and software routers. These processors provide the flexibility of softwarerouters but are generally better suited to processing packets in parallel than commodity hardware. Forexample, the Intel IXP processor [1] consists of a general purpose CPU (generally used for controlpurposes), as well as a number of a simple microengines (generally used to process multiple packets inparallel).

Developing the router resource virtualisation techniques proposed here for a network processor would bean interesting extension to this work. Network processors, such as the IXP, contain multiple processingengines (or microengines) which can process packets in parallel. Since these processing engines areindependent, they could be used to partition the router’s resources between network flows. For example,the processing of packets from each QoS flow could be assigned to separate processing engines. Thereare 16 processing engines in the IXP2800, which are obviously not enough to service the number ofnetwork flows typically processed by an Internet router, if they are assigned in this way1. One solutionto this problem would be to give each network flow its own virtual processing engine, which has limitedaccess to the physical processing engines. These virtual processing engines could be assigned to thephysical processing engines by a scheduler, based on the requirements of the QoS flows which theyprocess. These virtual processing engines would be independent of each other, so that QoS guaranteescould be made by the router.

10.4 Summary

The evaluation of the QuaSAR prototype router suggests that it is indeed feasible to use virtualisa-tion techniques within a network router, without overly compromising the router’s ability to route highthroughput, low latency traffic of a single network flow. With regard to the hypothesis that virtual ma-chine techniques can improve a router’s ability to provide partitioning between network flows, the resultsare ambiguous. While routing through a QuaSAR routelet provided some improvements in partitioning,over the QuaSAR best effort router, the virtualisation overheads involved prevented it ever from reachingthe level of performance achieved by a standard software router. This suggests that the virtual routeletarchitecture has some promise, if the virtualisation overheads can be reduced substantially. Evaluatingthe results of the QuaSAR router experiments, however, provides some valuable insight into the areas ofvirtualisation which could be enhanced to improve the effectiveness of the virtual routelet architecture.

1The IXP actually supports up to 8 hardware threads per processing engine, however, even if each QoS flow was assignedto its own hardware thread, this would only allow for 64 network flows.

79

Overall, this project has uncovered a number of interesting aspects of virtual machine research and novelrouting architectures.

80

Bibliography

[1] Matthew Adiletta, Mark Rosenbluth, Debra Bernstein, Filbert Wolrich, and Hugh Wilkinson. Thenext generation of intel IPX network processors. Intel Technology Journal, 2002.

[2] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer,Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP ’03: Proceedings of thenineteenth ACM symposium on Operating systems principles, pages 164–177. ACM Press, 2003.

[3] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. RFC 2475: An Architecturefor Differentiated Services, December 1998. http://www.faqs.org/rfcs/rfc2475.html.

[4] R. Braden, D. Clark, and S. Shenker. RFC 1633: Integrated services in the Internet architecture:an overview, June 1994. http://rfc.sunsite.dk/rfc/rfc1633.html.

[5] Andrew T. Campbell, Herman G. De Meer, Michael E. Kounavis, Kazuho Miki, John B. Vicente,and Daniel Villela. A survey of programmable networks. SIGCOMM Comput. Commun. Rev.,29(2):7–23, 1999.

[6] A.T. Campbell, H.G. De Meer, M.E. Kounavis, K. Miki, J. Vicente, and D.A. Villela. The GenesisKernel: a virtual network operating system for spawning network architectures. In Second Inter-national Conference on Open Architectures and Network Programming (OPENARCH), pages 115– 127, New York, March 1999.

[7] Mark Carson and Michael Zink. NIST switch: a platform for research on quality-of-service rout-ing. Internet Routing and Quality of Service, 3529(1):44–52, 1998.

[8] P. Chandra, A. Fisher, C. Kosak, T.S.E. Ng, P. Steenkiste, E. Takahashi, and Hui Zhang. Dar-win: customizable resource management for value-added network services. In Sixth InternationalConference on Network Protocols, pages 177 – 188, Austin, October 1998.

[9] Cisco. Cisco IOS Switching Services Configuration Guide, Release 12.2, chapter 6, pages 123–185.Cisco Systems Inc, 2003.

[10] Tzi cker Chiueh. Resource virtualization techniques for wide-area overlay networks. Technicalreport, Computer Science Department, State University of New York at Stony Brook, 2003.

[11] Bryan Clark, Todd Deshane, Eli Dow, Stephen Evanchik, Matthews Finlayson, Jason Herne, andJeanna Neefe Matthews. Xen and the art of repeated research. In Usenix annual technical confer-ence, 2004.

[12] Dan Decasper, Zubin Dittia, Guru M. Parulkar, and Bernhard Plattner. Router plugins: A softwarearchitecture for next generation routers. In SIGCOMM, pages 229–240, 1998.

[13] J. Dike. A user-mode port of the linux kernel. 5th Annual Linux Showcase & Conference, 2001.

81

[14] Rod Fatoohi and Rupinder Singh. Performance of Zebra Routing Software. Technical report,Computer Engineering, San Jose State University, 1999.

[15] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and Mark Williamson.Safe hardware access with the Xen virtual machine monitor. In 1st Workshop on Operating Systemand Architectural Support for the on demand IT InfraStructure (OASIS’04), October 2004.

[16] Didier Le Gall. Mpeg: a video compression standard for multimedia applications. Commun. ACM,34(4):46–58, 1991.

[17] Steven M. Hand. Self-paging in the nemesis operating system. In OSDI ’99: Proceedings ofthe third symposium on Operating systems design and implementation, pages 73–86. USENIXAssociation, 1999.

[18] P. Van Heuven, S. Van Den Berghe, J. Coppens, and P. Demeester. RSVP-TE daemon for Diffservover MPLS under Linux. http://dsmpls.atlantis.rug.ac.be accessed November 2004.

[19] R. Keller, L. Ruf, A. Guindehi, and B. Plattner. PromethOS: A dynamically extensible routerarchitecture for active networks. In IWAN 2002, Zurich, Switzerland, 2002.

[20] Kevin Lai and Mary Baker. A performance comparison of UNIX operating systems on the pentium.In USENIX Annual Technical Conference, pages 265–278, 1996.

[21] IM Leslie, D McAuley, R Black, T Roscoe, P Barham, D Evers, R Fairbairns, and E Hyden. Thedesign and implementation of an operating system to support distributed multimedia applications.In IEEE Journal on Selected Areas in Communications, pages 1280–1297, 1996.

[22] Christopher Metz. IP Routers: New Tools for Gigabit Networking. IEEE Internet, 2(6):4–18,November 1998.

[23] D L Mills. Internet time synchronization: The network time protocol. IEEE Transactions onCommunications, 39(10):1482–1493, 1991.

[24] Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek. The click modular router. InSymposium on Operating Systems Principles, pages 217–231, 1999.

[25] Kensuke Otake. 0sys Minimal Linux Distribution. http://phatboydesigns.net/0sys/.

[26] J Padhye, V Firoiu, D F Towsley, and J F Kurose. Modeling TCP reno performance: a simplemodel and its empirical validation. IEEE/ACM Transactions on Networking, 8(2):133–145, 2000.

[27] Abhay K. Parekh and Robert G. Gallagher. A generalized processor sharing approach to flowcontrol in integrated services networks: the multiple node case. IEEE/ACM Trans. Netw., 2(2):137–150, 1994.

[28] B.W. Parkinson and J.J. Spilker. The global positioning system: theory and applications. Theglobal positioning system: theory and applications/ edited by Bradford W.Parkinson, JamesJ.Spilker, Jr.; associate editors, Penina Axelrad, Per Enge. Washington, DC: American Institute ofAeronautics and Astronautics, 1996. Progress in astronautics and aeronautics; v. 163-164., 1996.

[29] Craig Partridge. A 50-Gb/s IP Router. IEEE/ACM Transactions on Networking, 6(3):237–248,June 1998.

[30] D. Plummer. RFC 826: An Ethernet Address Resolution Protocol, November 1982.http://www.faqs.org/rfcs/rfc826.html.

82

[31] Ian Pratt and Keir Fraser. Arsenic: A user-accessible gigabit ethernet interface. In INFOCOM,pages 67–76, 2001.

[32] Avinash Ramanath. A study of the interaction in BGP/OSPF in Zebra/ZebOS/Quagga. Technicalreport, Computer Science Department, State University of New York at Stony Brook, 2000.

[33] Dickon Reed, Ian Pratt, Paul Menage, Stephen Early, and Neil Stratford. Xenoservers: Account-able execution of untrusted programs. In Workshop on Hot Topics in Operating Systems, pages136–141, 1999.

[34] Timothy Roscoe. Linkage in the nemesis single address space operating system. SIGOPS Oper.Syst. Rev., 28(4):48–55, 1994.

[35] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and imple-mentation of the Sun Network Filesystem. In Proc. Summer 1985 USENIX Conf., pages 119–130,Portland OR (USA), 1985.

[36] Pascal Schmitdt. ttyLinux User Guide.http://www.minimalinux.org/ttylinux/docs/user guide.pdf, version 4.5 edition, February 2005.

[37] Jonathan Sevy. Linux Network Stack Walkthough.http://edge.mcs.drexel.edu/GICL/people/sevy/network/Linux network stack walkthrough.html.

[38] Jeremy Sugerman, Ganesh Venkitachalam, and Beng-Hong Lim. Virtualizing I/O Devices onVMware Workstation’s Hosted Virtual Machine Monitor. In Proceedings of the General Track:2002 USENIX Annual Technical Conference, pages 1–14. USENIX Association, 2001.

[39] The International Engineering Consortium. Multiprotocol Label Switching (MPLS).http://www.iec.org/online/tutorials/mpls/ - Accessed November 2004.

[40] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. Iperf: The TCP/UDP BandwidthMeasurement Tool. http://dast.nlanr.net/Projects/Iperf/.

[41] A. Whitaker, M. Shaw, and S. Gribble. Denali: Lightweight virtual machines for distributed andnetworked applications. Technical report, University of Washington, 2002.

[42] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. Scale and performance in the Denaliisolation kernel. SIGOPS Operating Systems Review, 36(SI):195–209, 2002.

[43] J. Wroclawski. RFC 2210: The use of RSVP with IETF integrated services, September 1997.http://www.faqs.org/rfcs/rfc2210.html.

[44] X. Xiao and L. M. Ni. Internet QoS: A big picture. IEEE Network, 13(2):8–18, March 1999.

[45] Lixia Zhang, Steve Deering, Deborah Estrin, Scott Shenker, and Daniel Zappala. RSVP: A newresource reservation protocol. IEEE Network Magazine, 7(5):8–18, September 1993.

[46] Junaid Ahmed Zubairi. An automated traffic engineering algorithm for MPLS-Diffserv domain.Proc. Applied Telecommunication Symposium, pages 43–48, April 2002.

83

Documents

Network Router Resource Virtualisation