Linux Kernel Packet Transmission Performance in High-speed …956412/FULLTEXT01.pdf · 2016. 8. 30. · This thesis aims to investigate the maximum capabilities of Linux packet transmissions

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Linux Kernel Packet Transmission Performance in High-speed Networks

CLÉMENT BERTIER

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Kungliga Tekniska hogskolan

Master thesis

Linux Kernel packettransmission performance in

high-speed networks

Clement Bertier

August 27, 2016

Abstract

The Linux Kernel protocol stack is getting more and more additions as time goes by. As new technologies

arise, more functions are implemented and might result is a certain amount of bloat. However new

methods have been added to the kernel to circumvent common throughput issues and to maximize

overall performances, given certain circumstances. To assess the ability of the kernel to produce packets

at a given rate, we will use the pktgen tool.

Pktgen is a loadable kernel module dedicated to traffic generation based on UDP. Its philosophy was to

be in a low position in the kernel protocol stack to minimize the amount of overhead caused by usual

APIs. As measurements are usually done in packets per second instead of bandwidth, the UDP protocol

makes perfect sense to minimize the amount of time creating a packet. It has several options which will

be investigated, and for further insights its transmission algorithm will be analysed.

But a software is not just a compiled piece of code, it is a set of instructions ran on top of hardware. And

this hardware may or may not comply with the design of one’s software, making the execution slower

than expected or in extreme cases even not functional.

This thesis aims to investigate the maximum capabilities of Linux packet transmissions in high-speed

networks, e.g. 10 Gigabits or 40 Gigabits. To go deeper into the understanding of the kernel behaviour

during transmission we will use profiling tools, as perf and the newly adopted eBPF framework.

Abstract

Linux Kernel protokollstacken blir fler och fler tillagg som tiden gar. Som ny teknik uppstar, fler funk-tioner har genomforts och kan leda till en viss mangd svalla. Men nya metoder har lagts till karnan foratt kringga vanliga genomstromning problem och att maximera den totala forestallningar, med tankepa vissa omstandigheter. Att faststalla formagan hos karnan for att producera paket med en givenhastighet, kommer vi att anvanda pktgen verktyget.Pktgen ar en laddbar karnmodul tillagnad trafik generation baserad pa UDP. Dess filosofi var att vara ien lag position i karnan protokollstacken for att minimera mangden av overhead orsakad av vanliga API:er. Som matningarna gors vanligtvis i paket per sekund i stallet for bandbredd, gor UDP-protokolletvettigt att minimera mangden tid pa att skapa ett paket. Det har flera alternativ som kommer attundersokas, och for ytterligare insikter sin sandningsalgoritmen kommer att analyseras.Men en programvara ar inte bara en kompilerad bit kod, ar det en uppsattning instruktioner sprangovanpa hardvara. Och den har maskinvaran kan eller inte kan folja med utformningen av en program-vara, vilket gor utforandet langsammare an vantat eller i extrema fall aven fungerar inte.Denna avhandling syftar till att undersoka de maximala kapacitet Linux paketsandningar i hoghastighetsnat,t.ex. 10 gigabit eller 40 Gigabit. For att ga djupare in i forstaelsen av karnan beteende under overforingenkommer vi att anvanda profilverktyg, som perf och det nyligen antagna ramen eBPF.

Contents

1 Introduction 51.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Sustainability and ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 92.1 Computer hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.5 Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.6 PCIe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.7 Networking terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 OS Architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 /proc pseudo-filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Socket Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 xmit more API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.5 NIC drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.6 Queuing in the networking stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Related work – Traffic generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.1 iPerf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.2 KUTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 PF RING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Netmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.5 DPDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.6 Moongen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.7 Hardware solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.1 pktgen flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.3 Transmission algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.4 Performance checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Related work – Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.1 perf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.2 eBPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1

3 Methodology 333.1 Data yielding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Data evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Linear statistical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Experimental setup 354.1 Speed advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Hardware used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Machine A – KTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 Machine B – KTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.3 Machine C – Ericsson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.4 Machine D – Ericsson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Choice of Linux distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Creating a virtual development environment . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Empirical testing of settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Creation of an interface for pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 Enhancing the system for pktgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.8 pktgen parameters clone conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 eBPF Programs with BCC 455.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 kprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Estimation of driver transmission function execution time . . . . . . . . . . . . . . . . . . 46

6 Results 496.1 Settings tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Influence of kernel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.2 Optimal pktgen settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.3 Influence of ring size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Evidence of faulty hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.3 Study of the packet size scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3.1 Problem detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.3.2 Profiling with perf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3.3 Driver latency estimation with eBPF . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusion 587.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A Bifrost install 62A.1 How to create a bifrost distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62A.2 Compile and install a kernel for bifrost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B Scripts 64

C Block diagrams 66

2

List of Figures

2.1 Caches location in a 2-core CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Theoretical limits of the link according to packet size on a 10G link. . . . . . . . . . . . . 122.3 Theoretical limits of the link according to packet size on a 40G link. . . . . . . . . . . . . 132.4 Tux, the mascot of Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Overview of the kernel [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6 How pointers are mapped to retrieve data within the socket buffer [18]. . . . . . . . . . . 172.7 Example of a shell command to interact with pktgen. . . . . . . . . . . . . . . . . . . . . . 222.8 pktgen transmission algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.9 Example of call-graph generated by perf record -g foo [38] . . . . . . . . . . . . . . . . . . 292.10 Assembly code required to filter packets on eth0 with tcp ports 22. . . . . . . . . . . . . . 30

3.1 Representation of the methodology algorithm used . . . . . . . . . . . . . . . . . . . . . . 333.2 Pearson product-moment correlation coefficient formula. . . . . . . . . . . . . . . . . . . . 34

4.1 Simplification of block diagram of the S7002 motherboard configuration [46, p. 19] . . . . 364.2 Simplification of block diagram of the ProLiant DL380 Gen9 motherboard configuration. . 374.3 Simplification of block diagram of the S2600IP [47] motherboard configuration. . . . . . . 384.4 Simplification of block diagram of the S2600CWR [48] motherboard configuration . . . . . 394.5 Output using the –help parameter on the pktgen script. . . . . . . . . . . . . . . . . . . . 43

6.1 Benchmarking of different kernel version under bifrost (Machine A) . . . . . . . . . . . . . 506.2 Performance of pktgen on different machines according to burst variance. . . . . . . . . . 516.3 Influence of ring size and burst value on the throughput . . . . . . . . . . . . . . . . . . . 526.4 Machine C parameter variation to amount of cores . . . . . . . . . . . . . . . . . . . . . . 536.5 Machine C bandwidth test with MTU packets. . . . . . . . . . . . . . . . . . . . . . . . . 546.6 Throughput to packet size, in million of packets per second. . . . . . . . . . . . . . . . . . 546.7 Throughput to packet size, in Mbps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.8 Superposition of the amount of cache misses and the throughput ”sawtooth” behaviour. . 56

C.1 Block diagram of motherboard Tyan S7002 . . . . . . . . . . . . . . . . . . . . . . . . . . 66C.2 Block diagram of the motherboard S2600IP . . . . . . . . . . . . . . . . . . . . . . . . . . 67C.3 Block diagram of the motherboard S2600CW . . . . . . . . . . . . . . . . . . . . . . . . . 68C.4 Patch proposed to fix the burst anomalous cloning behaviour . . . . . . . . . . . . . . . . 69

3

List of Tables

2.1 PCIe speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Flags available in pktgen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.1 Comparison of throughput with eBPF program . . . . . . . . . . . . . . . . . . . . . . . . 56

4

Chapter 1

Introduction

Throughout the evolution of network interface cards to high-speeds such as 10, 40 or even 100 Gigabitper second the amount of packets to handle on a single interface has increased drastically. Whilst theenhancement of the NIC is the first step of a system to handle more traffic there is a inherent consequenceto it: the remainder of the system must be capable of handling the same amount of traffic. We are in anera where the bottleneck of the system is shifting towards the CPU [1], due to a more and more bloatedprotocol stack.To ensure the capabilities of the operating system to produce or receive a given amount of data, weneed to assess them through the help of network testing tools. There are two main categories of networktesting tools: software and hardware based. Hardware network testing tools are usually seen as accurate,reliable and powerful in terms of throughput [2] but expensive nonetheless. While software-based testingmight in fact fact be less trustworthy than hardware-based, it has a tremendous advantage of malleabil-ity. Modifying the behaviour (e.g. protocol update) of the software is easily realized, on the other handit is not only complex in the case of hardware but also likely to increase the price of the product [3] andusually impossible for the consumer to tamper with as they are commonly proprietary products. Thereis no better solution between the two, it is a different approach to the same problem and hence testinga system from both perspectives if possible shall be recommended. However in this document we willfocus solely on software testing as we did not have specialised hardware.

The Linux operating system will be used to conduct this research as it is fully open-source and recentadditions aiming to enable high performances have been developed. It is based on a monolithic-kerneldesign, meaning the OS can be seen as split into two parts: a kernel-space and user-space [4]. The kernel-space is a contiguous chunk of memory in which everything related to the hardware is handled as wellas core system functions, for instance process scheduling. The user-space is where regular user programsare being executed and they have much more freedom as they ignore the underlying architecture andaccess it through system-calls: secure interfaces to the kernel.The issue in this model for a software-based network tool is the trade-off between the level on whichthe software will be located: a user-space network testing program is likely to be slowed down by thenumerous system-calls it must perform, and has no control over the path the packet is going to take intothe stack. A kernel-space network testing program will be faster but much more complex to design asrules within the kernel are paramount to its stability: as it is a single memory address any part of thekernel can call another function located in the kernel. This paradigm can result in disastrous effects onthe system if not being cautiously manipulated.

As we require high performance to achieve line-rate we will therefore use a kernel-space network testingtool: pktgen [5]. It is a purely packet-oriented traffic generator which does not mimic actual protocolbehaviour, located at the lowest level of the stack allowing minimum overhead. Its design allows to fullytake advantage of the symmetric multiprocessing capabilities of a system, which is found on commodityhardware nowadays enabling parallelization of tasks, by having several queues for each CPU. Due tothe overhead caused by the treatment of each packet, we will orient our research towards performancewhile using minimum-sized packets which is also the common practice within network testing assessment.However MTU-sized packets are a good way to benchmark the ability of a system to handle a maximum

5

amount of throughput, as smaller sized packets should never yield a higher throughput, given the sameparameters. A notable advantage of pktgen is the fact that the module is found within the official kerneland therefore does not require any other installation, and can be found in all common distributions.

Getting a low-level traffic generator is not enough to tell if the system is correctly optimized since itdoes not always reveal the bottleneck of the system. To go deeper into the performances we must get aprofile: an overview of the current state of the system. In order to perform such investigations we willuse perf events [6], a framework to monitor the performances of the kernel by monitoring well-knownbottleneck functions or hardware events likely to reveal problems and outputting a human-readablesummary.

To complete the profiling done by perf, we will use the extended Berkeley Packet Filtering aka eBPF[7]. It is an in-kernel VM which is supposedly secure (e.g. does not crash, is finite) due to a strongmonitoring of the code before being executed. It can be hooked onto functions and will be used tomonitor certain point of interest by executing small programs from the kernel into the user-space.

1.1 Problem

While the speed of network interface cards increase, Linux’s protocol stack is also gaining more additions:for instance to implement new protocols or to enhance already-existing features. More packets to treat aswell as more instructions per packet intrinsically end up in heavier CPU loads. However there are somecountermeasures which have been introduced to mitigate the performance loss due to outdated designsin certain parts of the kernel. For instance NAPI [8] which reduces the flow of interrupts by switching toa polling mode when overwhelmed or the recently added xmit more API [9] allowing bulking of packetsto defer usual actions to groups rather than per single packet.Considering all the recent improvements, can the vanilla kernel scale up to performances high enough tosaturate 100G links?

We will assess the kernel’s performance at the lowest possible level to avoid maximum overheadtherefore allowing maximum packet throughput, hence the use of pktgen. It is important to understandthat pktgen’s result will not reflect any kind of realistic behaviour as its purpose is to determine theperformances of a system by doing aggressive packet transmission, and the absence of overhead is thekey to its functionality: it has to be seen as a tool to reveal underlying problems instead of focusing onregular protocol stack overhead. In other words, it is the first step into verifying a system’s transmissionabilities and should therefore be seen as an upper-bound to real-life transmission. Implying if there areresults underneath the maximum NIC speed, it would inevitably prove actual transmission scenariosabove this result can not be reached. The follow-up question being: can pktgen’s performances scale upto 100G-link saturation?

Ideally the performances indicated by pktgen should be double-checked, meaning getting a secondmethod to testify the accuracy of pktgen’s prompted performance. Hence we will use eBPF as a way tobind a program onto the NICs driver transmission function in order to measure the throughput. CaneBPF be able to correctly quantify the amount of outgoing packets knowing each call is potentially inan order of nanoseconds? If so, do the measured performances match pktgen’s results?

We hypothesize that with the current technologies added in the kernel we will be able to reach aline-rate at 100G with minimum-sized packets using the pktgen module, given proper hardware.

1.2 Methodology

We will use an empirical approach throughout this thesis. The harvesting method will consist in runningpktgen while modifying certain parameters to assess the impact of the given parameter over the finalresult. This will be done by iterating over the parameters with a stepping big enough to finish withina reasonable amount of time but small enough to pinpoint any important change in the results. Thevalue of the stepping will require prior tuning. Each experiment, in order to assert its validity, has tobe run several times and on different machines with similar software configuration. To make the results

6

human-readable and concise they will be processed into relevant figures, comparing the parameters whichwere adjusted with their related pktgen experiment result.To realize the performance assessment the following experiment will be realized in order:

• Draw a simple baseline with straightforward parameters.

• Verify whether the kernel version is improving or downgrading the performances, and select thebest-suited one for the rest of the experiments.

• Assess the performances of the packet bulking technique through the xmit more API option ofpktgen, and verify if it improves the packet throughput.

• Tamper with the size of the NIC’s buffer as an attempt to increase the performance of packetbulking.

• Find the optimal performance parameter of pktgen.

We will also be monitoring certain metrics through profiling frameworks which are not guaranteed to bedirectly correlated with the experiment. To test the linear correlation of two variables (i.e. an experimentresult and a monitored metric) we will use a complementary statistical analysis through the help of thePearson product-moment correlation coefficient.

1.3 Goal

As the performances reached by the NICs are now extremely high, we need to know if the system thatthey are supposed to be used with are capable of handling the load without any extra products orlibraries. The purpose is to understand whether or not 100Gb/s NICs are in fact any useful for vanillaLinux kernels. Therefore the goal is to provide a study of the current performances of the kernel.

1.4 Sustainability and ethics

As depicted above the goal is to assess the ability of the kernel to output a maximum amount of packets.In other words, with perfect hardware and software tuning the system should be able to reach a certainvalue of packets per second, sent over a wire. Whilst this does not take care of the environmental aspectdirectly (e.g. power-saving capabilities are disabled in favour of performance) assessing the global tuningof the system will logically help to understand if a system is using more resources than it should. Hencealso indirectly assessing its power consumption. If an issue somehow reducing the global throughput isrevealed, it could possibly imply machines running under the same configuration also have to put extracomputing power to counteract the loss, also bringing ecological issues on a larger scale.

1.5 Delimitation

To limit the length of the thesis and impose boundaries to avoid having to go into too many endeavours wewill solely focus on the transmission side. Simple examples of the use of eBPF will be provided to preventfrom having to go into too much details inside the latter framework. Regarding the bulking of packets,we will exclusively look into (packet) throughput performances, while in reality such addition might infact introduce some latency and therefore could create dysfunctions in latency-sensitive applications.The kernel should not be modified with extra libraries specialized in packet processing.

1.6 Outline

The thesis is divided as followed:

• Chapter 2 will provide a background on hardware, software, profiling and pktgen uses.

7

• Chapter 3 will explain the methodology used behind the experiments.

• Chapter 4 will summarize the experimental setup including:

– Detailed hardware description.

– Research behind the performance optimization.

– practical description of the realization of experiments and how they were exported into con-sequential data.

– how a prototype of an interface for pktgen was realized to standardize the results.

• Chapter 5 will be a brief introduction to BCC programming, presenting the structure to createprograms with the framework.

• Chapter 6 will hold the most probing results from the experiments into graphical data and theirassociated analysis.

• Chapter 7 will conclude and wrap-up the results.

8

Chapter 2

Background

This section will be dedicated to providing the required knowledge to the reader to fully understand theresults at the end of the thesis. Going into deep details of the system was necessary to interpret theresults and hence a great part of this thesis was dedicated to understanding various software/hardwaretechniques and technologies. To do so we will follow a path divided in several sections:

• Firstly we will introduce technical terms related to hardware to the reader as those factors willbe investigated to give a deeper overview of the system. This will be done by examining differentbottlenecks like the speed of the a PCIe bus or the maximum theoretical throughput on an Ethernetwire.

• Secondly we will dig into the inner working of the Linux operating system, to mainly understandthe global architecture of the system but also to provide insights of how the structures and differentsections interact together to transmit a packet over the wire. This will include interaction with thehardware, hence a brief study of the drivers.

• Then we will do a strong literature review of the related work accomplished on software trafficgeneration to compare their perks and drawbacks.Then a thorough study of the pktgen module will be realised, from its internal working to mostproficient parameters influencing throughput performance.

• Last but not least there will be a brief introduction to profiling, which consists of tracing the systemto assess its choke-points by analysing the amount of time spent executing functions.We will also explain how eBPF, an extended version of the Berkeley Packet Filtering originallycreated for simple packet analysis, which is now a fully functional in-kernel virtual machine thatmay now be used to investigate parts of the kernel by binding small programs to certain functions.

9

2.1 Computer hardware architecture

As we are going to introduce numerous terms that are closely acquainted with the hardware of themachine, this section will be there to clarify most of those to the reader.

2.1.1 CPU

A CPU, or central processing unit, is the heart of the system as it executes all the instructions stored inthe memory.

CPU Caches are a part of the CPU that store data which is supposedly going to be needed again bythe CPU. An entry in the cache table in called a cache line. When the CPU needs to access data, itfirst checks the cache, which is directly implemented inside the CPU. If the needed data is found, it isa hit, otherwise a miss. In case of a miss, the CPU must fetch the needed data from the main memory,making the whole process slower.In principle, the size of the cache needs to be small. For two reasons, the first one being the fact it isimplemented directly in the CPU, making the lack of space an issue, and secondly because the biggerthe cache, the longer the lookup therefore introducing latency inside the CPU.

Multi-level caches are a way to counteract the trade-off enforced by the cache size to table lookupsissue. There are different levels of caches, which are all ”on-chip” meaning on the CPU itself.

• The first level cache, abbreviated L1 Cache will be small, fast, and the first one to be checked.Note that in real-life scenarios, this cache is actually divided in two caches: the one that storesinstructions and one that stores data.

• The second level cache, abbreviated L2 Cache will be bigger than the L1 cache, about 8 to 10 timesmore storage space.

• The third and last level cache, abbreviated L3 Cache is much larger than the L2 cache howeverthis characteristic vastly variates on the price of the CPU. This cache is not implemented in allbrand of CPUs, however the ones that were used for this thesis did (Cf Methodology – Hardwareused 4.2).Moreover L3 caches have the particularity of being shared between all the cores, which leads us tothe notion of Symmetric Multiprocessing.

L3 Cache

L2 Cache

L1 Cache

Instruction

L1 Cache

Data

CORE 0

L2 Cache

L1 Cache

Instruction

L1 Cache

Data

CORE 1

Figure 2.1: Caches location in a 2-core CPU.

10

Please note that the Figure 2.1 is a simplification of the actual architecture.

2.1.2 SMP

Symmetric Multiprocessing involves two or more processing units on a single system which will run thesame operating system, share a common memory and I/O devices, e.g. hard drives or network interfacecards. The notion of SMP applies to both completely separate CPUs and CPUs that have several cores.The obvious aim of having such an architecture is benefiting from the parallelism of programs to maximizethe speed of the overall tasks to be executed by the OS.

Hyperthreading is Intel’s proprietary version of SMT (Simultaneous multi-threading) which is an-other technique to improve thread symmetric execution, and adds logical cores to the physical ones.

2.1.3 NUMA

Non Uniform Memory Access is a design in SMP architecture which states that CPUs should havededicated spaces in the memory which can be accessed much faster than the others due to its proximity.This is done by segmenting the memory and assigning a specific part of it to a CPU. CPUs are jointby a dedicated BUS (called the QPI for Quick Path Interconnect on modern systems). The memorysegmented for a specific CPU is called local memory of the CPU. If it needs to access another part of thememory than its own, it is designated as remote memory, since it must go through a network of BUSconnections in order to access the requested data.This technique aims to mitigate the issue of memory access on an SMP architecture, as a single BUS forall the CPUs is a latency bottleneck in modern system architecture [10]. A NUMA system is sub-dividedinto NUMA nodes, which represent the combination of a local memory and its dedicated CPU. With thehelp of the command lscpu one can view all the NUMA nodes that are present on a system. It alsoprompts the latency to access one remote memory node to another.

2.1.4 DMA

Direct Memory Access is a technique to avoid having to make the CPU intervene between an I/O deviceand the memory to copy data from one another. The CPU simply instigates the transfer a with the helpof a DMA controller which then takes care of the transfer between the two entities. When the transferis done, the device involved throws an interrupt in order to notify the OS and therefore the CPU thatthe operation has been completed and the consequential actions should be taken, e.g. treat the packetsin case of reception of packets or clean-up the memory from the buffer used in case of transmission.

2.1.5 Ethernet

Ethernet is the nowadays standard used for layer-2 frame transmission that we will be using throughoutthis thesis. The minimum size of a frame in Ethernet was originally 64 bytes due to CSMA/CD techniquebeing used on the link. The idea was to have a minimum time-slot ensured by this fixed size so that thetime taken sending those bits on the wire would be enough for all station within a maximum cable radiusto hear the transmission of the frame before it ended. Therefore if two stations started transmitting overthe common medium (i.e. wire), they would be able to detect the collision.When a collision happens, a jam-sequence is sent by the station noticing it. Its aim is to make the CRC(located at the end of the frame) bogus, making the NIC discard the entire frame before computation.The minimum size packet of 64 bytes makes sense in 10Mb/100Mb Ethernet networks, as the maximumlength of the cable is respectively 2500 meters and 250 meters. However, if we push the same calculationto a 1000Mb aka 1G Ethernet, the maximum length of 25m can be considered too small, not mentioning2.5 meters on a 10G Ethernet. Whilst there are techniques in 1G Ethernet to extend the slot size andkeeping the minimum frame size to 64 bytes, we will not consider them in this thesis as we will beusing 10G Ethernet which is full-duplex, therefore no need for medium-sharing techniques. The 64-bytesminimum will still be used as a standard.

11

In reality when one sends a 64 bytes packet on the wire, there are in total 84 bytes that can be countedper frame.

• 64-bytes frames composed of:

– 14-bytes MAC header, destination MAC, source MAC and packet type.

– 46-bytes payload, typically IP packet with TCP or UDP on top of it.

– 4 Bytes CRC at the end.

• 8-byte preamble, for the sender and receiver to synchronise their clocks.

• 12 bytes of interframe-gap. There is not any actual transmission, but it is the required amount ofbit-time that must be respected between each frame.

Theoretical limit As shown above, for a 60 bytes-payload (including IP and TCP—UDP headers)we in reality must count 84 bytes.This implies that for a 10-Gigabit transmission we will have a maximum of:

Max =Bandwidth

Framesize=

10 ∗ 109

84 ∗ 8=

10 ∗ 109

672≈ 14880952 ≈ 14.88 ∗ 106 frames per second

We can conclude that the maximum number of minimum sized frames that can be sent over a 10G link is14.88 millions per second. By applying the same calculation to a 40G and 100G link we find respectively59.52 and 144.80 millions per seconds.

0

2

4

6

8

10

12

14

16

200 400 600 800 1000 1200 1400

Mp

ps

Packet size

Maximum amount of packet per second to size on 10G-Link

Limit

Figure 2.2: Theoretical limits of the link according to packet size on a 10G link.

12

0

10

20

30

40

50

60

200 400 600 800 1000 1200 1400

Mp

ps

Packet size

Maximum amount of packet per second to size on 40G-Link

Limit

Figure 2.3: Theoretical limits of the link according to packet size on a 40G link.

The figure 2.3 will be useful as a benchmark during our experiments, as it is the upper bound.

2.1.6 PCIe

Peripheral Component Interconnect Express usually called PCIe is a type of BUS used to attach compo-nents to a motherboard. It was developed in 2004 and as of 2016, its latest release version is 3.1 but only3.0 product are available. A new 4.0 standard is expected in 2017. PCIe-3.0 (sometimes called Revision3) is the most common type of BUS found among high-speed NICs; because other standards are in facttoo slow to provide the required BUS speed to sustain 40 or even 10 Gigabit per second if the amountof lanes is too little (see next paragraphs).

Bandwidth To actually understand the speed of PCI-e BUSes we must define the notion of ”transfer”,as the speed is actually given in ”Gigatransfers per seconds” in their specification [11]. A transfer isthe action of sending a bit of data on the channel, however it does not specify the amount of bit sentbecause one needs the channel width to compute it, in other words without the amount of bits sent in atransaction, we can not calculate the actual bandwidth of the channel.Circumventing the complex design details, on the PCIe version 1.0 and 2.0 an 8/10b encoding is used[11, p. 192].This forces to send 10 bit for an 8-bit data transfer, implying an overhead of 1− 8

10 = 0.2% for every bittransfer.The 3.0 revision uses a 128b/130b encoding, limiting the overhead to 1− 128

130 ≈ 0.015%.Now that we know the channel width, we can calculate the bandwidth B:

B = Transfers ∗ (1− overhead) ∗ 2

The table 2.1 holds the results of the bandwidth calculation. We highlighted the compatible band-width for 10G in blue and for 40G in red (10G being compatible with 40G). However using a bandwidthless large than the theoretical throughput of a NIC will function (if enough lanes for the device), but itwill result in a packet throttling because of the BUS speed.

13

Version 1.1 2.0 3.0

Speed 2.5 GT/s 5 GT/s 8 GT/s

Encoding 810

810

128130

Bandwidth 1x 2 Gb/s 4 Gb/s 7.88 Gb/s




Table 2.1: PCIe speeds

2.1.7 Networking terminology

DUT Device Under Test, the targeted device that we aim to assess its performances.

Throughput The throughput is the fastest rate at which the count of test frames transmitted by theDUT is equal to the number of test frames sent to it by the test equipment. [12]

14

2.2 Linux

The Linux operating system started in 1991 as a common effort to provide a fully open source operatingsystem by Linus Torvalds. It is UNIX-based and the usage of Linux is between 1 to 5 % of the globalmarket, implying that it is scarcely used by users. However this data is quite unreliable as most companiesor researcher rely on publicly available data, for instance the User-Agent header passed in a HTTP requestthat however can be forged, or worldwide device shipments that tend to be unreliable as well since mostlaptops will at least allow dual-booting with a second OS.While not being frequently used within the major part of the population, it is extremely popular amongthe server market share, its stability, open-source code and constant update making it a weapon of choicefor most system-administrators. Whilst it will be referred as ”Linux” in this document, the correct termwould be GNU/Linux as the operating system is a collection of programs on top of the Linux kernel andLinux is depended from GNU softwares.

Figure 2.4: Tux, the mascot of Linux

2.2.1 OS Architecture design

Linux is a monolithic kernel design [4, p. 7], meaning that it is loaded as a single binary image at boot,stored and ran in a single address space. In other words: the base kernel is always loaded into one bigcontiguous area of real memory, whose real addresses are equal to its virtual addresses [13]. The mainperks of such an architecture being the ability to run all needed functions and drivers directly from thekernel space, making it fast. However it comes with a price of stability issue, as the whole kernel runsalong as a single entity, if there is an error on any subset of it, the system’s stability as a whole can notbe guaranteed.

Whilst such drawbacks could seem as an impediment for the OS, monolithic kernels are not onlymature nowadays, but the almost-exclusive design used in industry. It opposes to micro-kernels, whichwe will not detail as it is outside the scope of this study.But it is not realistic to talk about ”pure” monolithic kernel, as Linux actually has ways to dynamicallyload code inside the kernel space, more precisely pre-compiled portions of code which are called loadablekernel module or LKM. As the code can not be loaded inside the same address space that the kerneluses, the memory will be allocated dynamically. [13]The flexibility offered by LKMs are absolutely crucial to Linux’s malleability: if every component hadto be loaded at boot the size of the boot image would be colossal.

15

The operating system can be seen as being split into three parts: the hardware which is obviouslyimmobile, the kernel space and user-space. This segmentation makes sense when it comes to memoryallocation as explained above. The kernel-space is static and continuous, it runs all the functions thatinteract directly with the hardware (drivers) and its code can not change (unless the code being exe-cuted is a LKM). The user-space is much more free of action, as the memory it uses can dynamically beallocated therefore making the loading and evolutions of programs quite flawless. However to interactwith hardware, e.g. memory or I/O devices, it must go through system-calls.System-calls are functions that aim to make usage of a service from the kernel, while abstracting theunderlying complexity through simple functions.

Figure 2.5: Overview of the kernel [4]

2.2.2 /proc pseudo-filesystem

The /proc folder is actually a whole other file-system on its own called procfs [4, p. 126]. Loaded atboot, its purpose is to be a way to harvest information from the kernel. In reality it does not have anyphysical files (i.e. written over hard-disks), all of the ones represented inside of it are actually stored inthe memory of the computer (called ram-based file-system) rather than on a hard-drive, also implyingthey will disappear at shut-down.It was designed to gather any kind of information the user could need to inspect about the kernel, oftenrelated to performances. A lot of programs interact directly with this information to gain knowledgeof the system, for instance the well-known command ps makes usage of different statistics included in/proc. However it is even more powerful, as we can directly ”hot-plug” functionalities from inside thekernel by interacting with /proc, for instance which CPUs are pinned to a particular interrupt can be

16

changed with the help of it. Needless to say, not all functionalities inside the kernel can be changed bysimply writing a number or a string inside the /proc.This becomes a key-element when linked not only to the vanilla kernel but also its modules. As explainedpreviously, we can load or unload LKMs, and as they are technically part of the kernel we therefore willfind their status and configuration interfaces in the /proc.

Other information systems

• sysfs: another ram-based files-system this time whose goal is to export kernel data structures, theirattributes, and the linkages between them to userspace [14].Usually mounted on /sys.

• configfs: complimentary to sysfs, it allows the user to create, configure and delete kernel objects[15].

2.2.3 Socket Buffers

Socket buffers or SKBs are single-handedly the most important structure in the Linux networking stack.For every packet being present in the operating system, a SKB must be affiliated to it in order to storeits data in memory. It has to be done in kernel space, as the interaction with the driver happens insidethe kernel [16].The structure sk buff is implemented as a double linked list in order to loop through the differentSKBs easily. Since the content of the sk buff structure content is gigantic we will not go into too muchdetail here, but here are the basics [17]:

Figure 2.6: How pointers are mapped to retrieve data within the socket buffer [18].

The socket buffers were designed to encapsulate easily any kind of protocol, hence there are ways toaccess the different parts of the payload by moving a pointer around and mapping its content into astructure.

17

As shown in figure 2.6 the data is located in a contiguous chuck of memory and pointers indicate thelocation of the structure. When going up the stack, extra pointers are mapped to easily recognize andaccess the desired part of the packet, e.g. IP header or TCP header. Important note, the data pointerDOES NOT refer to the payload of the packet, and reading from this will most likely end up in gibberishvalues for the user.With the help of system calls, SKBs are abstracted for user-space programs who most likely will notmake use of underlying stack architecture. However those system calls are not accessible from insidekernel-space.To decode easily data from within the kernel, pre-existing structures with the usual fields of protocols arefound and by mapping a pointer to a structure one can make packet content understandability trivial.

Reference counter Another very important variable the structure holds is called atomic t users.It is a reference counter, a simple integer that accounts the amount of programs that are using the SKB.Is it implemented as an atomic integer, meaning that it must be tampered only through the help ofspecific functions that will ensure the integrity of the data among all cores.It is originally initialized at the value 1 and if it reaches 0 the SKB ought to be deleted. Users should notinteract with such counters directly however as we will see with pktgen the latter is not always respected.

2.2.4 xmit more API

Since kernel 3.18 some efforts were made to optimize the global throughput through batching, i.e. bulkingpackets as a block to be sent instead of being treated one by one. Normally when a packet is given tothe hardware through the driver, several actions are made like locking the queue, copying the packet tothe hardware buffer, tell the hardware to start the transmission, etc [9].The idea was to simply communicate to the driver that several more packets are coming and can thereforedelay several actions knowing it will be a better fit to postpone it until there are no more packet to besent. It is important to note the driver is not forced in any way to delay its usual procedures, andis the one taking the decision. To make this functionality available to drivers while not breaking thecompatibility with the old ones, a new boolean in the SKB structure xmit more has been added. If setto true, the driver knowns there are more packets to come.

2.2.5 NIC drivers

NIC drivers handle the communication between the NIC and the OS, primarily to handle the packetsending and reception. There are two solutions to receiving packets:

• Interrupt: in case of reception of a packet, the NIC sends an interrupt to the OS in order for it toretrieve the packet. But in case of a high-speed reception, the CPU will most likely be overwhelmedby the interrupts, as they are all executed with a higher priority than other tasks.

• NAPI: To mitigate the latter issue, we disable temporally the interrupts and switch to a pollingmode. It is done through the help of the New API which is an extension to the device driverpacket processing framework [8]. It is done by switching off the interrupt of a NIC when it reachesa certains threshold fixed at driver initialization [16].

Here are the common pitfalls that can influence the NIC driver performance [19]:

• DMA should have better performance than programmed I/O, however due to the high overheadcaused by it, one should not allow DMA under a certain threshold.

• For PCI network cards (which is the only relevant type for high-speed networks nowadays) the sizeof the burst size for DMA is not always fixed and must be determined. This should coincide withthe cache size of the CPU, making the process faster.

• Some drivers have the ability to compute the TCP checksums, offloading from the the CPU andgaining efficiency due to optimized hardware.

18

2.2.6 Queuing in the networking stack

The queuing system in Linux is implemented through an abstraction layer called Qdisc[20]. Its usesranges from a classical fifo algorithm to more advanced QoS-aimed queuing (e.g. HTB or SFQ). Thoughthose methods can be circumvented if one user-level application fires multiple flows at once [21].

Driver queues are the lowest-level networking queue that can be found in the OS. It directly in-teracts with the NIC through DMA. However this interaction is done asynchronously between the twoentities (the opposite would make the communication extremely inefficient) hence the need of locks toensure data integrity. The lowest function to directly interact with the driver queue that one can use isdev hard start xmit().Nowadays most NIC have multiple queues, to benefit best from the SMP capabilities of the system [22].For instance, the 82580 NIC from Intel and their variants support multi-queues. Some frameworks (e.g.confer 2.3.5) allow direct access to the NIC registers for better analysis and tuning of the hardware.

19

2.3 Related work – Traffic generators

2.3.1 iPerf

iPerf [23] is a user-space tool made to measure the bandwidth of a network. Due to its userspace design,it can not achieve high packet speed because of the need of using system calls to interact with the lowerinterfaces, e.g. NIC drivers or even qdiscs. To mitigate this overhead issue the user might use a zero-copyoption to make the access to the packet content faster. It is able to saturate links through the use oflargely sized packets, and can even report MTU if unknown to the user. It may measure jitter/latencythrough UDP packets. You must be running both instances of server and client of iPerf to make theprogram run. An interesting new option is the handling of the SCTP protocol in the version 3.The simplicity of installation and use make it a weapon of choice for network administrators that wish tocheck their configurations. It is important to note that this project is still maintained and being updatedfrequently by the time of this thesis.Note that this is the only pure user-space oriented traffic generation tool that we will describe here, astheir performance can not match other optimized framework.Other userpace examples include (C)RUDE [24], NetPerf [25], Ostinato[26], lmbench [27].

2.3.2 KUTE

KUTE [28] is a UDP in-kernel traffic generator. They divided their program into two LKMs, a senderand a receiver. Once loaded, the sender will compute a static inter-frame gap based on the speed specifiedby the user during setup. One improvement they advertise is to directly use the cycle counter locatedin CPU registers instead of the usual kernel function to check the time, as it was considered not preciseenough. Note that as this technology is from 2005, this information might be outdated. An interestingfunction is that they do not handle the Layer 2 header, making it theoretically possible to use over any L2network. The receiver module will provide statistics to the user after the end, when it is been unloaded.

2.3.3 PF RING

PF RING [29] is a packet processing framework developed by the ntop company. The idea was, asfor pktgen and KUTE, to put the entire program inside the kernel. However it goes a step further byproposing actual kernel to user-space communication. The architecture, as the name suggests, is basedon a ring buffer. It polls packets from the driver buffers to the ring [30] through the use of the NAPI.While it does not require particular drivers, the addition of PF RING aware drivers are possible andshould provide extra efficency.Entirely implemented as a LKM, they advertise a speed of 14.8 Mpps ”and above” on a ”low-end 2,5GHzXeon”. However they do not state clearly whether that concerns packet transmission or capture, leavingthe latter statement ambiguous.

PF RING ZC is a proprietary variant of PF RING, and is not open-source. On top of the previousfeatures they offer an API which is able to handle NUMA nodes, as well as zero copy packet operation,supposedly enhancing the global speed of the framework.On this version traffic generation is explicitly possible. It can also share data among several threadseasily, thanks to their ring buffer architecture coped with zero copying.

2.3.4 Netmap

Netmap[31] aims to reduce kernel-related overhead issues by bypassing the kernel with its own home-brewed network stack.They advertise a 10G wirespeed (i.e 14.88 Mpps) transfer with a single core at 1.2 Ghz. As number ofimprovements they:

• Do a shadow copy (snapshot) of the NIC’s buffer into their own ring buffer to support batching,bypassing the need of skbuffers, hence gaining speed on (de)allocations.

20

• Efficient synchronization to make the best use of the ring buffer.

• Natively supports multi-queues for SMP architectures through the settings of interrupt affinities.

• The API is still completely independent from the hardware used. The devices driver ought to bemodified to interact correctly with the netmap ring buffer, but those changes should always beminimal.

• Netmap does not block any kind of “regular” transmission from or to the host even with a NICbeing used by their drivers.

• They also handle the widely used libpcap library by implementing their own version on to of thenative API.

• The interaction with the API is done through /dev/netmap, and the content is updated by polling.The packets are checked by the kernel for consistency.

It is also implemented as a LKM making it easy to install however drivers might need to be changed forfull usability of the framework.

2.3.5 DPDK

The Data Plane Development Kit [32] is a ”set of libraries and drivers for fast packet processing”. Itwas developed by Intel and is only compatible with Intel’s x86 processors architecture.They advertise a speed of 80 Mpps on a single Xeon CPU (8 cores), which is enough to saturate a 40Glink.DPDK moves its entire process in the user-space, including ring buffers, NIC polling and other featuresusually located inside the kernel. It does not go through the kernel to push those changes or actions, asit features an Environment Abstraction Layer, an interface that hides the underlying components andbypasses the kernel by loading its own drivers. The offer numerous enhancements regarding software andhardware, e.g. prefetching or setting up core affinity among many other concepts.

2.3.6 Moongen

Moongen [33] is based on the DPDK framework, therefore inheriting its perks and drawbacks. Moongenbrings new capabilities to the latter framework by adding several paradigms as “rules” for the software:It must be fully software implemented, and therefore run on off-the-shelf hardware, it must be ableto saturate links at 10G wirespeed (i.e. 14.88 Mpps), be as flexible as possible, and least but not leastsupport precise time-stamping/rate control (i.e. inter-packet latency). They found that the requirementswere best fulfilled by implementing malleability through Lua scripting, as the language also has fastperformances due to JIT support (Cf 2.5.2). The architecture behind Moongen lies on a master/slaveinteraction, set up within the script the user must provide. The master process will set-up the counters,including the ones located on NICs, and the slave will effectuate the traffic generation.An interesting feature that was introduced in this traffic generator was a new approach to rate control.As explained previously, NICs have an asynchronous buffer to take care of packet transmission, and ausual approach to control the rate is to wait time between packets. However the NIC might not decideto send to packets as they arrive. Instead of waiting, Moongen fills the inter-packet gap with a faultypacket: they forge a voluntarily incorrect packet checksum so that the receiving NIC will discard itupon arrival. However this method is limited due to the NIC having a minimum-size packet acceptance,making the faulty packets having to be a certain size which can be impractical in some situations.They advertise a speed of 178.5 Mpps at 120 Gbit/s, with a CPU Clock at 2 GHz.

2.3.7 Hardware solutions

There are a numerous amount of examples that we could provide with hardware technologies oriented fornetwork testing, but as this experiment will mostly focus on the software spectrum of traffic generation,we will not expend too much on this topic. Companies like Spirent [34] or IXIA [35] provide solutions.

21

2.4 Pktgen

Introduction pktgen is a module of the Linux kernel that aims to analyse the networking performancesof a system by sending with as many packets as possible [5]. It was developed by Robert Olsson and hasbeen integrated to the Linux main tree in 2002 [36].The usage of the tool is made through the procfs. All the related files mentioned in the followingparagraphs related to pktgen are located in /proc/net/pktgen.To interact with the module, one must write into a pre-existant file representing kernel threads dedicatedto pktgen. There are as many threads as there are cores, for instance the file”kpktgend 0” is the file bound to the thread for the core number 0. This information is important asnowadays CPUs all have SMP, hence the need of support for such architecture. The user then passescommands by directly writing inside those files.

# echo "add device eth0" > /proc/net/pktgen/kpktgend 0

Figure 2.7: Example of a shell command to interact with pktgen.

Figure 2.7 shows a typical example of interaction between the pktgen module and the user. Byredirecting the output of the echo command, we pass the command “add device” with the argument“eth0” to the thread 0. Please note that all writing operations in the proc filesystem must be done assuperuser (aka root). If the operation is unsuccessful, there will be an I/O error on the command.While this might seem slightly disrupting a first, the design choice behind this interface is due to themodule being in-kernel making a proc directory the simplest design to allow interaction with the user.

Example Now that the interaction between the user and the module has been clarified, here is arepresentative description of how to typically use pktgen, that can be logically split in 3 steps.

1. Binding: The user must bind one or more NICs to a kernel thread.Fig 2.7 is an example of such action.

2. Setting: If the operation is successful, a new file will be created, matching the name of the NIC(or associated queue). For instance by executing the command in Fig 2.7, a new file eth0 will becreated in the folder. The user must then pass the parameters that he or she wishes by writing inthe latter file.A non exhaustive list of parameters would be:

• count 100000 – Send an amount of 100000 packets.

• pkt size 60 – Set the packet payload to 60 bytes. This does include IP/UDP headers. Notethat 4 extra bytes are added by the CRC on the frame.

• dst 192.168.1.2 – Set the destination IP.

3. Transmitting: When all the parameters are set, the transmission may start by passing the parameterstart to the pktgen control file pgctrl. The transmission will either stop by interrupting thewriting operation (typically CTRL+C in the terminal) or when the total amount of packets to besent will be matched by the pktgen counter.The transmission statistics, as in time spent transmitting or number of packets per seconds will befound in the file(s) matching the name of the interfaces used in the second step, e.g. eth0.

While it is possible to associate one thread with several NICs, the opposite is not. However pktgenhas a workaround to be able to profit from the multi-core capacities, by adding the number of the coreafter the name of the NIC: eth0@0 will result in interacting with the NIC eth0 through the core 0.

2.4.1 pktgen flags

pktgen has a several flags that can be set upon configuration of the software. The following list is inclusive

22

and up to date, as it was directly fetched and interpreted from the latest version of the code (v2.75 –Kernel 4.4.8).

Flag Purpose

IPSRC RND Randomize the IP source.

IPDST RND Randomize the destination IP.

UDPSRC RND Randomize the UDP source port of the packet.

UDPDST RND Randomize the UDP destination port of the packet.

MACSRC RND Randomize the source MAC address.

MACDST RND Randomize the destination MAC address.

TXSIZE RND Randomize the size of the packet to send.

IPV6 Enable IPv6.

MPLS RND Get random MPLS labels.

VID RND Randomize VLAN ID label.

SVID RND Randomize SVLAN ID label.

FLOW SEQ Make the flows sequential.

IPSEC ON Turn IPsec on for flows.

QUEUE MAP RND Match packet queue randomly

QUEUE MAP CPU Match packet queue to bound CPU

NODE ALLOC Bind memory allocation to specific NUMA node.

UDPCSUM Include UDP checksums.

NO TIMESTAMP Do not include timestamp in packets.

Table 2.2: Flags available in pktgen.

The highlighted flags in the table 2.2 coloured in grey represent the most important ones to enforce theperformance of the system.QUEUE MAP CPU is a huge performance boost because of the thread behaviour of pktgen. In short, whenthe pktgen module is loaded it creates a thread for each CPU core detected on the system, this includeslogical cores, then a queue is created to handle the packets to be sent (or received) for each thread, thatway they can all be independently used instead of a single queue that would require great concurrencyto function. It also takes benefit from the ability of recent NICs to do multi-queuing.Setting this flag ensures the queue the packet will be sent to is located on the same as the current coretreating the packet.NODE ALLOC is obviously only needed in a NUMA-based system, and allows to bind an interface (orqueue, as explained) to a particular NUMA memory bank, avoiding latency caused by having to fetchinto remote memory.

Note that during the scope of this thesis we will not be treating pktgen options that change or modifythe protocol used during the transmission, e.g. VLAN tagging, IPsec, or MPLS. This is outside the scopeas we only care about maximum throughput and therefore will not have any use for such technologies.

2.4.2 Commands

there are quite a few commands that can be passed to the module.

1. The commands used on the threads ”kpktgend X” are straightforward: add device to add adevice, append ’@core’ to the device name create new queue associated, and rem device allremoves all associated devices (and their configuration) of a thread.

23

2. The commands on used on the ”pgctrl” file are also obvious: start begins the transmission (orreception)and stop ends it.

3. Most of the commands passed to the device are easily understandable and well documented in [37].We will only list commands that need to be explained:

• node < integer >: when the NODE ALLOC flag is on, this binds the selected device to thewanted memory node.

• xmit mode < mode >: set the type of mode pktgen should be running. By default the valueis start xmit, which is the normal transmission mode which we will detail further in the nextparagraph.The other mode is netif receive which turns pktgen into a receiver instead of a trans-mitter. We will not go into the details on the algorithm here as it will not be charted here;however the algorithm is summarized through a diagram in the appendix.

• count < integer >: select the amount of packets to be sent. A zero will result in a infiniteloop until stopped.It is important to note that because of the granularity of the timestamping inside pktgen,an amount of packets considered too small will result in a biased speed advertised. As arecommendation the program must run for at least a few millisecond, therefore the countnumber must match the speed of the medium.

• clone skb < integer >: This option aims to mitigate the overhead caused by having todo a memory allocation for each packet sent. This is done by ”recycling” the SKB structureused, hence sending a carbon-copy of the packet over the wire. This is done through a simpleincrementation of the reference counter, to avoid its destruction by the system.The integer passed as an argument will be the amount of copies sent over the network for 1SKB allocation. Example, by using skb clone 1000, packets number 1 to 1000 will be thesame, then packets from 1001 to 2000 will be the same, etc.

• burst < integer >: This option is the most important one for maximum throughput, astestified by the experiments further. It makes use of the xmit more API hence allowing bulkas explained previously.

2.4.3 Transmission algorithm

Through a code review, we will now explain the internal workings of pktgen when it comes to packet trans-mission. The following explanation concerns the pktgen xmit function located in net/core/pktgen.c.

Everything commented in this section is condensed in the figure 2.8

1. At start options are retrieved like the burst (which is equal 1 by default), through atomic accessif necessary. The device is checked to be up, with a carrier, if not the function will return. Thisimplies that in case of the device not being up, no error will be returned to the user.

2. If there are not any valid SKBs to be sent or it is time for a new allocation, pktgen frees the currentSKB pointer with kfree skb (if it is null the function will simply return). A new packet will beallocated and filled with the correct headers through fill packet() function. If the latter didnot work, the function will return.

3. If inter-packet delay is required, the spin() function is fired.

4. The final steps to sending out packets are to retrieve the correct transmission queue, disablesoftware-irq as bottom halves could delay the traffic, lock the queue for this CPU.

5. Increment the reference counter with the amount of bursting data about to be sent. This shouldnot happen here, and will be discussed in section 4.8.

24

6. Start sending loop: send a packet with the xmit more API compliant function netdet start xmit().The latter takes as an argument, among others, a boolean to indicate if there is more data to come,so in case the SKB is unique it will be set to false. Otherwise set to true until we run out ofbursting data to send.

7. In case of error on transmission returned by netdet start xmit(), except if the device wasbusy in which case we will try once more, the loop will exit.

8. In case of success update the counters: number of packets sent, amount of bytes sent and sequencenumber.

9. If there is still data to be sent (i.e. burst > 0), go back to start of sending loop, also check thequeue is not frozen. Otherwise exit loop.

10. Exit loop: unlock the queue bound to CPU and enable software-irq.

11. If this is the end of all transmissions programmed, pktgen will check that the reference counter ofthe last skb is 1, then will stop the program.

12. Otherwise the function ends here.

25

Figure 2.8: pktgen transmission algorithm

26

2.4.4 Performance checklist

Turull et al. [36] issued a series of recommendations to be sure the system is properly configured to yieldthe best performances of a pktgen traffic generation.

• Disable frequency scaling, as we will not focus on energy matters.

• Same goes with CPU C-States, their purpose being power saving we should limit its use in orderfor the CPU to avoid creating latency by falling into a ”sleep” state.

• Pinning the NIC queue interrupts to the matching CPU (or core), aka ”CPU affinity”. Thisrecommendation was already issue by Olsson [5].

• Because of the latter statement, one should also deactivate interrupt load balancing as is spreadsthe interrupts among all the cores.

• NUMA affinity which maps a packet to a NUMA node can be a problem if the node is far from theCPU used for instance. As explained previously pktgen supports assigning a pktgen to a specificnode.

• Ethernet flow control has the possibility of sending a ”Pause frame” to temporally stop the trans-mission of packet. We will disable this as we will not focus on the receiver side.

• Adaptive Interrupt Moderation (IM) must be kept on for maximum throughput and minimizingoverhead from the CPU.

• Placing the sender and receiver on different machines to avoid having the bottleneck located onthe BUS of a same machine.

We will later be carefully adjusting the parameters of the machines used through the help of scriptingand/or Kernel/BIOS settings if possible.

27

2.5 Related work – Profiling

Profiling is getting records of a system (or several systems) called the profile. It is commonly usedto evaluate the performances of a system by estimating if certain parts of the system are being toogreedy/slow, e.g. taking too much CPU cycles for its operations compared to the rest of the otheractions to be executed. We will only pay attention to Linux profiling, as the entire subject was basedon this specific OS, and therefore will talk about techniques that might not be shared among othercommonly used operating systems (e.g. Windows or BSD based).

There were two profiling systems that were investigated during this thesis, the first one being perf[38] and the second one is in fact more than a profiling tool, as it has several other purposes and wasrecently turned into a profiling tool in the latests kernel versions: eBPF [7].

2.5.1 perf

Perf also called perf events is fairly broad spectrum into the profiling capabilities. It is based on thenotion of events, which are tracepoints that perf pre-programmed inside the kernel. The tool has, bydefault, several default events from different sources [39]:

• Hardware Events: Use CPU performance counters to gain knowledge of cpu cycles used, cachesmisses and so on.

• Software Events: low level events based on kernel counters. For example, minor faults, majorfaults, etc.

• Tracepoint Events: based on ptrace, which is the same lib used by gdb to debug user-spaceprograms, perf has several pre-programmed tracepoints inside the kernel. They are located on”important” function, meaning function that are almost-mandatory to be executed for a system-call to function correctly. For example, the tracepoint to deallocate a SKB structure is calledsock:skb free.The list of tracepoints used by perf can be found by running sudo perf list tracepoints .

• Dynamic tracing: this is NOT exclusive to perf, it is a kernel function that perf uses for monitoring.The principle is to create a ”probe”, either called kprobe if located in kernel-code or uprobe if inuser-code.This is an interesting functionality as it brings us the ability to monitor any precise function wewish to investigate, instead of relying on general-purpose functions (tracepoints).

• Snapshots frequency: perf is able to take snapshots at a given frequency to check the CPU usage.

The more a function is called, the more samples are aggregated and the function is considered takingmore CPU (total percentage of samples). One of the perks is perf’s ability to also record the PID, andcall-stack to provide a full knowledge of what and whom caused the system to use all of the CPU, as thename of a single function might not only be complex to pinpoint, but might also be called from severalspots.

Kernel symbolsThe kernel keeps a mapping of addresses to name to be able to translate to a human-readable outputthe results of executions. The names can be matched to several things, but we will only pay attentionto function and variable names.

overhead calculation With the -g option which check the entire stack for the calculation of the totalpercentage of utilization, perf shows two percentage per function: self and children. This is because afunction can obviously call other functions recursively, making the ”actual” total amount of time spentin the caller function biased. Therefore the split makes perfect sense: the “self” number represent thepercentage of samples from the call of the function and the ”children” number corresponds to the totalpercentage induced by the function, including the function calls it performs and therefore their percentagebeing included in that number too.

28

Figure 2.9: Example of call-graph generated by perf record -g foo [38]

2.5.2 eBPF

BPF Historically, there was a need for efficient packet capturing. There were some other programs,but usually costly. Along came BPF, for Berkeley Packet Filter, with the idea of making a user-levelpacket capture efficient. eBPF is the extended version of BPF, as in recent versions of the kernel it hasbeen greatly enhanced. We will discuss those differences soon.The idea is to run user-space programs, i.e. filters, inside the kernel-space. While this may sound,dangerous, the code produced by the user-space MUST be secure. Meaning, there are only a fewinstructions that can be actually put inside this filter.

To restrict the available possibilities of coding instructions BPF has a created its own interpretedlanguage, a sort-of x86 assembly instruction set.

There is a structure (linux/filter.h) that can be used by the user-space defined program toexplicitly pass the bpf code to the kernel:

struct sock_filter { /* Filter block */__u16 code;__u8 jt;__u8 jf;__u32 k;

};

Listing 2.1: Structure of a BPF program

The variables within this structure being:

• code: unsigned integer which contains the opcode to be executed.

• jt: unsigned char containing the address to jump to in case of test being true

• jf: unsigned char containing the address to jump to in case of test being true

• k: unsigned long usually containing test variable with constant or addresses to load/store to.

To attach a filter to a socket (as it was originally designed for) one must pass through another structure:

29

struct sock_fprog { /* Required for SO_ATTACH_FILTER. */unsigned short len; /* Number of filter blocks */struct sock_filter __user *filter;

};[...]struct sock_fprog val;[...]setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));

Listing 2.2: Binding to a socket

The first parameter being simply the number of instructions and the second one a simple pointer to theprevious structure. The user macro adds an attribute for the kernel to understand that the code it isabout to run shall not be trusted. This is for the need of security. Last but not least, to actually makethe connection between the structure sock fprog and the socket itself, assuming we correctly opened asocket with a file descriptor sock by running setsockopt().

The complexity of BPF programming resides in the pseudo-assembly forced programming. This isdone automatically through libpcap or tcpdump. However for programs in C, this quickly becomes tooinconvenient and should not be done.

Figure 2.10: Assembly code required to filter packets on eth0 with tcp ports 22.

The figure 2.10 illustrates how complex creating a simple BPF program is.As you might recognize from the listing 1.1, each row is indeed divided in four field as explained.

extended BPFOver the last few years, the BPF program has been remodelled. It is not limited to the only usageof packet filtering any-more, and can now be seen as a virtual machine inside the kernel thanks to itsspecific instruction set [7].Breaking the shackles of packet filtering came with wholesome features which we will explain.

• The size of the register and arithmetic operations switched from 32 bits to 64 bits, unlocking thepower of nowadays CPU which are 64-bits oriented; at least for performance oriented systems.

30

• While eBPF programs are not retro-compatible with classical BPF, the translation between thetwo is done before execution making it seamless for the user.

• Instead of being bound to a socket, there is now a dedicated system call, bpf(), to be able toinsert eBPF programs easily from the user-space.

– The system-call is unique and takes as parameters the different actions that can be executed.There are wrapper functions that abstract the use of the system-call, making it more human-readable.

– On the execution of the system call, a verifier is executed to check the instruction(s) are con-sidered ”secure”. The program is in fact simulated to check if any access might be problematicto the security of the system.

– Since the kernel 4.4 (released at the beginning of the thesis) the bpf() sys-call does notrequire to be root to be launched; however this is of course only relevant for that is accessibleto regular user, therefore limited to socket filters. [40].

• The framework now has integrated maps:

– The maps are a simple key/value storage format.

– They can be accessed either by user-space or kernel-space.

– The key and value format can be a custom structure.

• eBPF programs can be used as a kprobe; mainly because of their ”secure property” making sure itwill not leave the system hanging. However certain functions are not allowed to be used a kprobes.

• eBPF programs can be used as a tc classifier.

Tool chaineBPF uses the Just-In-Time (JIT) compilation method, which makes the compilation happen at run-time into machine code [41]. We will not get into the details of how this is actually making the processfaster, but it is said to increase the performance 3.5x to 9x for interpreted languages [42].Note that it has to be turned on through the procfs to function, and your kernel must have theCONFIG BPF JIT option turned on.To generate optimized code through JIT, the tool-chain behind it is complex but it is black-boxed throughthe use of the Low Level Virtual Machine (LLVM): a project who provides a modern source and target-independent optimizer, along with code generation support for many popular CPUs. Their libraries arebuilt around the IR language that they use to represent data [43].To compile from C language, the clang program is used. Developed alongside LLVM, it is a fast compilerthat provide code for the LLVM.

BCCThe above paragraph shows how complex getting an eBPF program from C code to an execution trulyis. And because of the lack of example to actually compile eBPF, even producing a typical ”hello world”is not straightforward.The silver bullet to this problem is brought by the Iovisor project, by providing a ”compiler collection”to automate and simplify the creation of eBPF programs: the BPF Compiler Collection [44]Understanding and building programs with BCC was a great part of the work of this thesis, and will bedetailed in the BCC programming section.

31

32

Chapter 3

MethodologyIn this chapter we will be describing how the experiments were carried out.

3.1 Data yielding

During the experiments the ability to create a lot of data without having to constantly monitor theexecution of it became quickly a need.The principle is to create convenient methods that could automatically variate the parameters, more orless brutally according to the needs and store the results, along with the experiments settings, into a fileto be post-processed into human-understandable data; e.g. plots.The solution was to create a program that would take as parameters the setting(s) to variate, the step-ping and the limit to be reached.

Figure 3.1: Representation of the methodology algorithm used

33

3.2 Data evaluation

The data acquired was created through empirical testing, adjustment and tuning to the fit the situation.The data acquired must follow several rules in order to be kept as a final result of this work:

• Reproducibility : the experiment must yield the same results by being ran over the same settings.While it may sound obvious, a lot of data has been discarded after several hundreds tests due tothe behaviour not being exactly reproducible. It also does not necessarily mean that the resultsare bogus but is either due to bad measuring or that the anomaly is spurious and would take toomuch time pinpointing.

• Granularity : As this thesis focused mostly on high performance, a single byte might or might notchange the outcome of an experiment. Therefore the experiments were first run with an averagestepping; meaning with settings variation large enough to end a set of experiments in a reasonableamount of time (e.g. few hours) but small enough to reduce an anomaly to a particular range ofsettings.Of course finding this trade-off has also been part of the work and required experimenting.

• Interpretation: to ensure that the results are correctly interpreted, extra profiling tests were alwaysran to be certain that they are not being compromised by another program that would conflict inany way.

3.3 Linear statistical correlation

Throughout the thesis we used profiling whose goal was to find a correlation between a problem and itsorigin. The idea was to create a batch to tests and measure a particular event along with the test, andsee if there would be a possible match between the two sets.For instance, we would run a throughput test with pktgen and increasingly growing the size of the packet.For every experiment, we would also take the amount of cache-misses. To find out whether or not thosefactors are linearly correlated, we will use Pearson product-moment correlation coefficient.

r =n∑

xiyi −∑

xi∑

yi√n∑

x2i − (∑

xi)2√

n∑

y2i − (∑

yi)2

Figure 3.2: Pearson product-moment correlation coefficient formula.

Without getting into the details of the formula, the r value yielded by the above formula will locatebetween a range of -1 to 1: −1 ≤ r ≤ 1.The interpretation can go as such: a value of 1 will indicate a positive correlation between the two setsof data, -1 will indicate a negative correlation. A 0 implies no correlation.Now in the case of realistic data, none of the above will ever happen but rather a real value between 1and -1 hence the results will be interpreted as such [45]:

• 0.00 ≤ |r|≤ 0.19 ”Very weak”

• 0.20 ≤ |r|≤ 0.39 ”Weak”

• 0.40 ≤ |r|≤ 0.59 ”Moderate”

• 0.60 ≤ |r|≤ 0.79 ”Strong”

• 0.80 ≤ |r|≤ 1 ”Very Strong”

34

Chapter 4

Experimental setup

While most of the documented effort of this thesis is aimed towards unveiling the underlying frameworkand its associated performance bottlenecks, a lot of the concrete work resided in installing and setting upenvironments. Needless to say, it is irrelevant for the reader to know every detail however if interestingissues arose they will be mentioned.

4.1 Speed advertisement

During this research, a lot of traffic generators and libraries have been examined. No matter what theirperks were, they have a common problem: the ”speed advertised” by almost all of them is unreliable.This is a direct consequence of the lack of common practices within the traffic generation community,as a result there are no academical benchmarks that have been set for performance reviews, making theadvertised speed more of a marketing aspect than an factual processing indicator.There are three major factors that should be advertised to be able to assess accurately the throughputof a traffic-generator.

Hardware The first and foremost important factor should always be an accurate description of themachine’s architecture. Commenting as ”commodity hardware” is too far fetched, even though heavilyimplies off-the-shelf was used, and such wide ranges should not be tolerated in a careful investigation.The most important criteria are (but not exclusively):

• CPU(s): model, clock speed, number of cache level including their respective sizes, number of cores,maximum PCIe version capability.

• NIC(s): model, maximum theoretical speed, PCIe version and number of lanes, multi-queue capa-bility.

• Motherboard: model, block diagram, QPI if needed.

Those key performance criteria are obviously subject to change especially with new features being added.

Underlying software While this factor is less relevant in some cases, e.g. DPDK that bypasses theLinux Kernel, the configuration may still be relevant as it is very often subject to change and might stillaffect the overall performances of the system.The criteria that should be reported are straightforward: the version of the kernel used and any kind ofperformance-affecting options. Also the drivers and their versions should be mentioned, and optimizationsthat could affect performance.

Scalability SMP architecture being the exclusive architecture that one can buy nowadays, giving asingle example of your software’s performance is not good enough. The scalability of the process must bedocumented. And this is not exclusive to a single NIC, as Link Aggregation is a very common techniqueshowing the results of the software having several processes over different NICs is an excellent way totestify of its scalability performances.

35

4.2 Hardware used

This section will be about hardware specific informations. In total four machines were provided assupport for this thesis, two from the KTH CoS laboratory and two from the Ericsson performance lab.They will include a thorough description of their components and a block-diagram as a summary.

4.2.1 Machine A – KTH

First and foremost this is the machine that helped out calibrating and carrying most of the experiments,as a benchmark. It did not possess the most recent hardware but rather because of its convenientaccessibility in the laboratory at Electrum, Kista.

CPU

Model Xeon E5520@2,27GHz

L1i Cache 32K

L1d Cache 32K

L2 Cache 256K

L3 Cache 8192K

QPI 5.86 GT/s

Motherboard The motherboard used was the Tyan S7002. While it supports up to 2 CPUs, only onewas present while carrying out the tests. This implies there are NUMA nodes however the machine wasset up in a way that the memory bank will always be local to the unique CPU, hence making NUMAnodes almost irrelevant in our case.The CPU and the NIC are linked through a northbridge.

Memory Total available memory: 31GB.

NIC The network interface card assessed was the 82599ES 10-Gigabit controller, using a SFP+ transceiver.The driver used was ixgbe version 4.3.15.

XeonE5520

2.27GHz

DDR3

DDR3

DDR3

Memory channel

Northbridge

82599ES 10-Gigabit

EmptyCPU slot

QPI5.86 GT/s

QPI5.86 GT/s

MACHINE A

PCIe 2.08 channels

QPI5.86 GT/s

Figure 4.1: Simplification of block diagram of the S7002 motherboard configuration [46, p. 19]

36

4.2.2 Machine B – KTH

CPU

Model Xeon [email protected]

L1i Cache 32K

L1d Cache 32K

L2 Cache 256K

L3 Cache 25600K

QPI 9.6 GT/s

* 2

Motherboard The exact name of the product is ProLiant DL380 Gen9; an HP server board which isfrom 2014. There were two CPUs used, hence implying NUMA nodes. Important note, the official blockdiagram was not found.

Memory A total of 98GB of RAM were present on the system.

NIC The previous model from Machine A was moved over this machine to check performance differencesbetween the two, hence the same 82599ES 10-Gigabit controller was present. The driver used was ixgbeversion 4.3.15.

Distribution To carry out experiments on pktgen, the same Bifrost 7.2 distribution along with thekernel 4.4 was tested on this machine.Also a Fedora 23 server version was used, with a Kernel 4.4, to work on eBPF.

Xeon E5-2650 v3

2.30GHzDDR3

Memory channel

82599ES 10-Gigabit

QPI9.6 GT/s

MACHINE B

PCIe 3.08 channels

DDR3

DDR3

DDR3

10 cores

Memory channel

DDR3

DDR3

DDR3

DDR3Xeon E5-2650 v3

2.30GHz10 cores

Figure 4.2: Simplification of block diagram of the ProLiant DL380 Gen9 motherboard configuration.

37

4.2.3 Machine C – Ericsson

CPU

Model Xeon E5-2658 v2 @ 2.40GHz

L1i Cache 32K

L1d Cache 32K

L2 Cache 256K

L3 Cache 25600K

QPI 8 GT/s

* 2

Motherboard The intel motherboard S2600IP was used to carry out the experiments on this machine.Please take note that this board was faulty, as explained in the results section, and in no way we supportits use to carry out experiment related to high-speed networks over it.


NIC Intel’s Ethernet Controller XL710 for 40GbE QSFP+ transceivers was connected to this machine.The driver used was i40e version 1.5.16.

Xeon E5-2658 v2

2.40GHzDDR3

Memory channel

XL71040-Gigabit

QPI8 GT/s

MACHINE C

PCIe 3.016 channels

DDR3

DDR3

DDR3

10 cores QPI8 GT/s

Memory channel

DDR3

DDR3

DDR3

DDR3Xeon E5-2658 v2

2.40GHz10 cores

Figure 4.3: Simplification of block diagram of the S2600IP [47] motherboard configuration.

On a side note, we did not have physical access to this machine but we were allowed to supervise thesettings of the machine to check the hardware was put in correct PCI slots, as one can not map a BUSnumber to a physical slot from commands.

38

4.2.4 Machine D – Ericsson

CPU

Model Xeon E5-2680 v4@ 2.40GHz

L1i Cache 32K

L1d Cache 32K

L2 Cache 256K

L3 Cache 35840K

QPI 9.6 GT/s

* 1

Motherboard The intel motherboard S2600CWR was used to carry out the experiments on thismachine.


NIC Intel’s Ethernet Controller XL710 for 40GbE QSFP+ transceivers was connected to this machine.The driver used was i40e version 1.5.16.As previously, we did not have direct access to the machine but we were granted a permission to check

Xeon E5-2680 v4

2.40GHzDDR3

Memory channel

XL71040-Gigabit

EmptyCPU slot

QPI9.6 GT/s

MACHINE D

PCIe 3.016 channels

DDR3

DDR3

DDR3

14 cores

Figure 4.4: Simplification of block diagram of the S2600CWR [48] motherboard configuration

the settings.

39

4.3 Choice of Linux distribution

ELXTo follow Ericsson’s policy on security, it was strongly advised to install ELX, which is Ericsson’s home-brew version of Ubuntu with enhanced security updates. It was primarily used to compile whateverversion of the kernel needed for the experiments and transfer the resulting boot image to the targetdistribution.The installation is trivial as there is an included GUI (as for Ubuntu) that will make all the choices forthe user; e.g. ciphering the hard-drive by default.

Arch Linux [49]On the recommendation of an employee at Ericsson we installed Arch Linux, the principal reason beingthe very active community and the constant updates being brought to the distribution. The drawbackbeing the fact that it does not include any kind of graphical interface by default, making the installationfairly lengthy in command line.On the other hand seeing that, preferably, the latest stable release of the kernel should have been usedto carry out experiments it was the best choice to easily compile new versions make use of the kernel.This distribution was also mandatory used as the Machine C from Ericsson was pre-configured with theOS and we did not have the rights to modify it.

Bifrost– 7.2 [50]Bifrost is a distribution who aims to give a small, network-oriented Linux distribution. Its small size is aresult of a no-frills mentality stripping down a lot of usual commands and programs that are commonlyfound (e.g. Python, Perl) but comes with extra packages designed to monitor and help manage network-related attributes of the machine. This distribution’s kernel is not trivial to modify, as it have a specialinitramfs that has to compiled with the kernel in order for it to work.On another side note, in order to easily boot on different kernels by default there is a Syslinux bootloaderincluded with the bifrost image. While avoiding the need of installing one for the user, tweaking itscontent is a rather painful manoeuvre and we made the choice of installing grub to simplify the processof updating through a single command.The installation of bifrost is fairly simple, as there are two solutions to install it: either decompressthe OS directly on the root of the key but several commands must be executed in order to install theboot-loader or do a carbon-copy of the provided image from their website. This second method comeswith a drawback or having a mandatory file-system ext2 with a fixed size of 1 GiB. However this can beextended through several commands, Cf the appendix A.1.

Ubuntu – 16.04 [51]For the same reason as Arch Linux, Ubuntu was mandatory as it was installed on Machine D fromEricsson and we did not have the rights to modify it.

Fedora – 23 [52]As we ran into numerous troubles with the installation of the BCC framework (notably a total break-down from the packet-manager pacman on Arch) we decided to ultimately switch to a distribution whichwe were used to manipulating, and that had pre-compiled binaries for the framework. We went for theserver version to get less graphical interface bloated softwares, as installed on USB sticks having a GUImay cause severe latency at boot.

4.4 Creating a virtual development environment

As we did not have direct access to machines upon our arrival in Ericsson, we decided to set-up a virtualmachine to be able to develop without risking the safety of our machine and especially the office network.Therefore we installed, on referral from a colleague, Arch Linux.We then compiled our own version of the kernel 4.4 to acclimate ourselves to the procedure.However the limits of such an infrastructure were quickly reached, not only performance wise but as

40

a virtualized architecture is substantially different than an actual one, several problem occurred. Forinstance, trying to profile the virtual machine became a hassle as perf did not have access to hardwareevents.

4.5 Empirical testing of settings

A good length of our time was spent trying to find settings that could influence the overall performanceof pktgen, and hence the kernel itself.The first step was to check a large range of pktgen parameters and see whose presence or absence led tothe most significant change. Quickly the burst variable along with sk clone turned out to be sky-rocketingthe overall traffic. Running a ”vanilla” experiment of pktgen, meaning without options supposed to en-hance the speed or latency of the system turned out to be quite slow however drawing a good baselineto compare with.

An important note regarding pktgen experiments using the bifrost distribution, the version 7.0 and onincludes patch from Daniel Turull that have not been added in the official kernel tree. This is importantsince those patches concern the receiver side, and since we only care about transmission we can discardthe change. Moreover during our profiling of the system, the functions introduced by those patches oftenturned out to be on the most amount of sampled collected, implying they are still called from the senderside somehow and perhaps lowering the maximum amount of throughput. Note that this could be aside-effect from perf instead of an actual problem.

This sole process of trying several times, while being automated through scripts, took at least ahundred hours to carry out all the experiments required, usually because we did several nested loops tosee if two parameters somehow conflict or benefit according to the value of their parameters.

The scripting itself was first realised with a bash script.Interesting note, on bifrost the built-in echo command does not function correctly when redirect towardsthe /proc/net/pktgen, hence one should use /bin/echo instead.To avoid having to constantly monitor the experience and manually stop it, we always limit the amountof packets. A as explained in the literature review, pktgen must run for at least a few milliseconds toensure reliable results. To enforce this rule, we always set to at least a million packets when runningminimum sized packets transfers. This is usually enough for 10G and 40G networks.After a pktgen experiment was run, the results are caught and stored in a simple text file, as there is notany requirement for complex encoding or compression. Moreover it makes the operation fast, makingthe loop run faster.

Post-processing To make interesting data out of the one harvested, we used Perl scripts to be able toeasily loop through the results. With the -n or -p setting, Perl adopts a behaviour close to the one fromthe awk command, but providing more flexibility with its built-it regular expression parser, making therecognition of text patterns easy.As we usually ran the script from 1 to 8 or 10 cores the amount of lines expected were easily calculated(e.g. if ran on 5 cores, 5 lines of results expected) and allowed short and elegant parsing solutions likethe one provided in the appendix B, even when looped over several hundred times.

4.6 Creation of an interface for pktgen

While we believe that pktgen is not suited to be used for persons for whom the kernel performanceshave little interest, perhaps preferring plain bandwidth tests perfectly filled by tools like iperf, we thinkthat having data from a larger set of users would be interesting. But we believe the kernel interactionthrough the procfs is too esoteric for pktgen to be used by some users. On top of that the documentationprovided in [37] never in fact clearly stipulate how to interact with the /proc, which can be misleadingfor neophytes. In the same documentation the links at the bottom are not reachable any-more, on the

41

other hand in the kernel tree source the directory samples/pktgen is filled with concrete examples.We created a program whose aim would be to kill two birds with one stone:

• Provide a simple command-line interface for pktgen. This includes short-cuts to different settings tobe provided, and the possibility to aggregate them to several threads instead of having to programeach thread one by one.Also it stores the current configuration to a subsidiary configuration file for the user to re-create acarbon copy of the experiment.

• Standardize the performance results of pktgen through the ability to export easily its results andnumerous system metrics that might be of influence.As said previously, performances turn out to be meaningless if not coped with several paradigmshence the program aimed to export in a portable format that could be

– Parsed by the original program to produce a simple, and if needed reduced, output.

– Understood by browsers, as sharing on blogs/websites is a common practice in the kerneldevelopment community.

– Pretty easily human read if needed.

To fulfil the above requirements, the output was produced in the JSON format.

This program was also created because of a simple problem: constantly variating parameters with scriptsquickly became messy, as constant editing of the same file or having different versions of the same fileoften ended up in confusion. At least on the scale of thousand of different experiments over severalmachines. The program was written in the Perl language for several reasons:

• We already had a certain affinity with it.

• It is included with most Linux distributions.

• It has more advanced features than Bash.

• Several sample scripts included with the Linux kernel are already written in Perl.

• It is allegedly the language with the most performance in text parsing [53].

The software is about 300 lines of code and stores the custom configuration to a temporary file, to allowre-using the same configuration very easily. It will prompt the options given to the user by calling the--help parameter.

The primary strength of this script is to allow configuring all threads on a single line: when passingthe -t argument you can either give:

• A single integer

• A list separated by comas

• A range separated by a dash.

For example:pktgen -t 0-3 -d eth0 -c 10000 -b 10Will configure pktgen threads from 0 to 3 included, to use the interface (device) eth0 with 10000 packetswith an associated burst value of 10.To launch the program simply do pktgen run or append it to the previous command. Re-doing thesame command will launch the same exact configuration. To ensure two instances of the script can notbe ran concurrently, lockfiles were added.When giving the -w FILE parameter an output will be done on the given file. It will contain the resultsfrom each thread along with the integrity of the parameters pktgen took.

42

pktgen v0.1-p Print current configuration.-r Remove all devices-f Flush, clean all configuration and remove all devices.-t NUMBER Bind actions to a specific thread number.-d INTERFACE Bind actions to a specific interface. Mandatory.-c NUMBER Set number of packets. 0 For infinite generation.-s NUMBER Set size of packet. Minimum 60, maximum should be MTU.-D NUMBER Set delay between packets.-C NUMBER Set amount of cloned packets.-b NUMBER Set amount of bursted packets.-md MAC Modify MAC destination address.-ms MAC Modify MAC source address.-ad IP Modify IP destination address.-as IP Modify IP source address.-w FILE Output the results to a JSON file.

Here is the helper.

Figure 4.5: Output using the –help parameter on the pktgen script.

4.7 Enhancing the system for pktgen

As explained in 2.4.4 there are several things we can tune for pktgen to achieve maximum throughput.

Disabling frequency scalingThe purpose was to avoid frequency scaling which could skew the results especially if the transmissionwas short. The frequency scaling is a great power saver and should not be disabled on a normal basis.[54]On kernels post version 3.9 the frequency scaling is in fact regulated through a driver. For Intel CPU,which was the only brand tested here, the driver called pstate manages the frequency scaling.There should be ways to interact with the driver; however the commands are not available on all dis-tribution, to facilitate and generalize the procedure we simply disabled it. To do so: one must addintel pstate=disable to the booting kernel line. For instance if you use Syslinux:

1. Search for the configuration file. Usual emplacement are /boot/syslinux/syslinux.cfg,/syslinux/syslinux.cfg, /syslinux.cfg .

2. With an text editor, open the file and find the section matching your kernel version (uname -rprompts it.

3. On the line starting with ”APPEND”, add intel pstate=disable.

4. Reboot

You can now set a CPU frequency governor for your cores [55]. It is the policy your CPU will follow. Wehave to select the ”performance” governor to make sure the frequency will be at its maximum potentialwithout variation.Do do so, you must write inside the/sys/devices/system/cpu/cpu*/cpufreq/scaling governor file.Example:echo "governor" > /sys/devices/system/cpu/cpu0/cpufreq/scaling governor

To check the frequency scaling happens correctly you can monitor the frequency of all cores by runningwatch grep "cpu MHz" /proc/cpuinfo .

If you see variation, the governor was not set properly or the driver is still in place.

IRQ PinningThe kernel sets an interrupt ”affinity” for each interrupt that is registered, which can be translated as a

43

list of allowed CPU to catch and treat the interrupt. This is implemented as a bit-mask corresponding tothe cores allowed. The list of interrupts registered by the OS can be found in /proc/interrupts, and its as-sociated number. To check the allowed cores we must check the value of /proc/irq/X/smp affinity,with X being the corresponding number of the affinity.When you need to pin an affinity to a particular core, you need to calculate the bitwise mask like so:2core. You can also stack them by adding several bit-masks together. Example to pin interrupt 40 tocore 3:23 = 8, then echo "3" > /proc/irq/40/smp affinity . Keep in mind that the numbering startsfrom 0.

Interrupt load balancing On certain systems there is a daemon that does takes care of setting theinterrupt masks to balance the system load. It is called ”irqbalance”, however as we do manual IRQpinning this will collide are change our settings, hence if this is present on your system you must disableit.

C-STATES could be an issue as it was shown to introduce added latency.To disable it one must go in the BIOS and look for C-STATES and set them to ”performance” orequivalent setting to have minimum issues with it.

Adaptive Interrupt ModerationThe hardware creates interrupts at a certain interval when receiving and sending frames. If there is 1interrupt per frame, the overhead caused can end up in such a CPU usage that it will ultimately becomethe bottleneck. Adaptive Interrupt Moderation has to be left on for to achieve maximum throughput asit saves CPU consumption. This can be set up with the modprobe command and given as a parameter.For instance with the i40e driver: modprobe i40e InterruptThrottleRate=1 enables it, howeverthis is the default value.

Segmenting sender and receiverThis scenario was always respected as we tried to gather the best possible performance from one machinetherefore did not want to get issues from having to share the same system for the two functions.However the receiver was never investigated as it was used as a black-hole for packets, its only purposewas to give the interface a carrier and checking the connectivity happened correctly to avoid flooding aregular network with pktgen packets.

4.8 pktgen parameters clone conflict

While examining the code of pktgen we found out that the current implementation of the xmit more API(aka burst parameter) was currently colliding with the cloning of packets (aka skb clone parameter).The code manually tampered the reference counter to be incremented for the same value as the burstparameter, making it a clone even though no clone skb had been passed.

3450 atomic_add(burst, &pkt_dev->skb->users);34513452 xmit_more:3453 ret = netdev_start_xmit(pkt_dev->skb, odev, txq, --burst > 0);

Listing 4.1: Lines proving the incoherent behaviour in pktgen.c (v2.75)

This did not create any problems when using both parameters but according to the program’s spec-ification, there is no obligation to clone the packet when using the xmit more API. In the current stateof things you can not use the burst without an inherent skb clone.A patch was crafted to fix this issue (Cf Appendix C.4 ). However under the lack of review it got, it wasnot applied.

For our experiments this problem is not critical as we are looking for the best performance achievable,hence stacking the xmit more capabilities on top of cloning would have been mandatory anyway.

44

Chapter 5

eBPF Programs with BCC

This section is dedicated to showing the reader how eBPF programs were created. Figuring the waysto create eBPF programs was a lengthy process. For the reader to understand the code created in theresults section, we will give a short overview of its structure with the help of BCC [44].

5.1 Introduction

The programming on BCC is divided into two parts, the eBPF program which will be the in-kernel codeexecuted, and the front-end interaction written in Python to read the results from the execution of thelatter.eBPF can be executed at several points in the kernel.

• socket: As it was firstly intended, you can bind an eBPF program to a socket to check the trafficin between.

• xt bpf: Kernel module for netfilter

• cls bpf: Kernel module for classifier for traffic shaping

• act bpf: BPF based action (since Linux 4.1)

• kprobes: BPF based kprobes

The above list is not inclusive however we only use the kprobe hook.

5.2 kprobes

While they are not a new technology, kprobes are an efficient way of putting a tracepoint in the kernel.The traditional way of adding them requires to compile a whole new kernel module (or modifying apreviously existing one) and adding it to the system, for the OS to register and activate the kprobe.This method is complex especially as the user shall see kprobes as a dynamic, convenient way to addbreakpoints to functions that require monitoring or tracing. The perf events tool allows to add kprobesbut can not run dedicated programs onto them. This is where eBPF comes in line.

Hello, world! The bcc/examples/hello world.py file gives a good and simple overview of howto add a kprobe, however for the sake of segmenting the code into relevant parts we will tweak it intothe following two code listings. This segmentation between C and Python should always be used for thesake of the clarity of the code; even though the original hello world.py found in the repository does not.

1 #include <uapi/linux/ptrace.h>2 void handler(void *ctx){3 bpf_trace_printk("Hello, World!\n");4 }

Listing 5.1: hello world.c

45

The listing 5.1 code shows the minimum amount of code needed to create a kprobe with a .c file attached.

1. Inclusion the user-API of ptrace, required to bind a kprobe with BCC.

2. The handler function will be called when the probe is hit, and MUST have a context pointer, asall other eBPF functions do when in kprobes. If needed, extra arguments can be added to fit theprototype of the function probed and then access their values.

3. A helper included in the helpers.h and simplifies the interaction between user-space and kernel-space print, as the original printk normally redirects to the dmesg output.

1 #!/usr/bin/python2 from bcc import BPF3 b=BPF(src_file="hello_world.c")4 b.attach_kprobe(event="sys_clone", fn_name="handler")5 b.trace_print()

Listing 5.2: hello world.py

The above code represents a classical kprobe monitoring program with BCC. The first two lines aremandatory, and will not work without the kernel headers being installed on the distribution.This is because of the .c file including the user-API for ftrace, but the prompted error is not explicit.The 3rd line consists in initializing the program by giving it a source file for the eBPF program to be run.Note that the BPF object in fact recognizes key-words inside the program and will add automaticallythe corresponding headers if they are missing; however the list is small and the user should not rely onit.When initializing you can pass the program either as a single separate c file, as recommended, or a stringof text. Doing both will not function correctly.The 4th line will create and attach the kprobe to the given event, which is a kernel symbol. The handlerfunction must be provided in the ”fn name” parameter.Last but not least the 5th lines simply waits indefinitely for data from a bpf trace printk call and willprompt it to the user.

5.3 Estimation of driver transmission function execution time

We aimed to use eBPF to check if the performances showed by pktgen were accurate or not. In this casewe aimed to verify if some unusual driver latency could be revealed. Thus we used eBPF as a way tocalculate the latency by binding kprobes onto it.

This experiment was based on the assumption that if we were running a single pktgen thread, onlya single core should be generating traffic hence driver functions should not be called concurrently.Therefore, one can calculate the amount of time the driver took to send packets by taking a timestamp atthe beginning of a driver function, and making the difference with a timestamp taken at the return of thesame function. With pktgen there is not any way to uniquely identify each SKB: because of the burstoption, the same packet is being passed to the driver for copy. Hence there is no way to differentiate twoSKBs as they are the exactly the same when cloned.

In our case, the function traced was i40e lan xmit frame. The program was split into two partsas explained in the work section. The C code is composed of two handler functions:

BPF_TABLE("hash", u64, u64,start, 262144);int handler_entry(struct pt_regs *ctx,struct sk_buff *skb, struct net_device *netdev) {u64 ts = bpf_ktime_get_ns();u64 key=0;start.update(&key,&ts);return 0;

}

Listing 5.3: kprobe at entry of i40e lan xmit frame

The BPF TABLE macro will create the eBPF maps, called start.The handler entry will execute as such:

46

• Firstly we will fetch the current timestamp through an eBPF helper function bpf ktime get ns().

• We store the value of the timestamp in the start map under the key 0.

• Important note: Anything that has to be stored inside an eBPF map has to be done through avariable pointer with an initialized value. If one tries to store by giving an immediate and gettingits pointer (e.g. &0) the program will not compile.Hence the ”key” variable with value 0.

• We store the timestamp value in the map under the ”0” value. This is done to avoid having todedicate a second map just for this sole variable. This is not a problem since the other values,representing the length of time taken by the function to run, can not be equal to 0.

void handler_return(struct pt_regs *ctx,struct sk_buff *skb, struct net_device *netdev){u64 *tsp=NULL, delta=0;u64 key=0;tsp=start.lookup(&key);if(tsp != 0){delta =bpf_ktime_get_ns();delta-=*tsp;start.increment(delta);

}}

Listing 5.4: latency measurement from driver interaction

For handler return:

• We create a u64 pointer to hold the timestamp value, and initialize a delta value which will be thedifference between the current time and the one stored in the map.

• The key has the same purpose as in the handler, i.e. retrieving the timestamp and respecting theeBPF access paradigms.

• We retrieve the value of the previous timestamp in tsp.

• If the timestamp has a value 6= 0, we retrieve the current time, make the difference and store it indelta.

• The increment method of the table looks for the key and increment the value associated.Hence the map content will be a pair of a key representing the execution time and a value repre-senting the number of occurrences under which the function was executed at the same speed.e.g. key 450 and value 12 means that the execution time of 450 nanoseconds happened 12 times.

The associated python code is minimal.

#!/usr/bin/env pythonfrom bcc import BPFfrom time import sleepb = BPF(src_file="xmit_latency.c")b.attach_kprobe(event="i40e_lan_xmit_frame", fn_name="handler_entry")b.attach_kretprobe(event="i40e_lan_xmit_frame", fn_name="handler_return")print "Probe attached"try:sleep(100000)

except KeyboardInterrupt:for k,v in b["start"].items():#Calculate mean, variance and standard dev

Listing 5.5: Python code to attach the probes and retrieve the map data

• We must import the bcc module in order for it to function.

• We associate the C program with the python front-end.

47

• Then the functions are bound as a kprobe and a kretprobe onto i40e lan xmit frame which wewant to investigate.

• We wait for the user to create an interrupt (CTRL+C)

• We then loop through the pair of key/values present in the map.Be cautious to ignore the key with a value of 0, as it hold a timestamp and will completely skewthe statistics.

48

Chapter 6

Results

This chapter will be dedicated to the results of the pktgen experiments along with the profiling realisedwith perf and eBPF.

6.1 Settings tuning

This section will regroup the results from the different metrics tested on each machine, to establish whichones are current producing the greatest amount of throughput.All the data of this section use a packet size of 64 bytes.

6.1.1 Influence of kernel version

It was required to investigate the influence of the kernel version on global results.To do so, all the long term versions from 3.18 (introduction of the xmit more API) and onwards weretested: 3.18, 4.1, 4.4, 4.5 and 4.6 (as a release candidate as the time of the tests).The figures 6.1a and 6.1b clearly demonstrate how close of a performance there are between the differentversion of the kernel. The kernel 3.18.32 seems to have a slightly off calculation of the throughput, as itshowed performances above the theoretical limit that is achievable. On the other hand, the kernel 4.5.3seems to be slightly under the others, perhaps due to another miscalculation or simply because of thekernel being less efficient under that version.

Because of this benchmark we decided to settle for version 4.4:

• It was the latest long term version available at the beginning of this thesis.

• It seemed to show accurate performance.

• As of April the machines C and D were patched under the version 4.4 by the administrator, henceusing it on the machine A and B for equity seemed a fair strategy.

• It had extra eBPF features compared to older versions, which could come in handy for later.

6.1.2 Optimal pktgen settings

There are only a few options that could be modified variate the performance: skb clone, burst. Allthe following graphs only have a few tested parameters shown however the actual testing range was infact much greater. For instance if a graph shows a burst value of 10 and 1000, ranges from 10-100 werealso tested, including 1000 10000...But if they did not show any significant difference hence they arehidden for readability purposes.As it turns out the skb clone parameter is currently embedded when using burst (Cf 4.8), and willnot be shown on the graph also for readability purposes.A value of ”burst 1” is the baseline, it is the default setting and does in fact not profit from the xmit moreAPI.

49

(a) pktgen throughput with no options

(b) pktgen throughput with burst 10

Figure 6.1: Benchmarking of different kernel version under bifrost (Machine A)

Interpreting the results:

• The figure 6.2a shows a slight advantage of speed with a burst value of 10, until the different profilesmerges into the line rate (14.88) around 4 cores used.

• The figure 6.2b on the other hand shows a small disadvantage advantage of speed with a burstvalue of 10, until the different profiles merges into the line rate (14.88) around 4 cores used.

• The figure 6.2c is on another scale than the others because of its 40G NIC.While during the starting phase (1 to 4 cores) the results are similar, however from 5 to 4 coresthere seems to be an advantage by a great distance from 5 core and on.

On a machine with default ring settings, it seems the best performances are around a value with 10burst to provide consistency among all machines.

50

(a) burst variance on machine A – 10G

(b) burst variance on machine B – 10G

(c) burst variance on machine D – 40G

Figure 6.2: Performance of pktgen on different machines according to burst variance.

51

6.1.3 Influence of ring size

The default transmission ring size can be obtained with the help of ethtool -g dev .This represents the size of the ring buffer shared between the kernel and the NIC, and is managed by thedriver. In the pktgen official documentation [37] it is advised to augment the size of the original bufferas ”the default NIC settings are (likely) not tuned for pktgen’s artificial overload type of benchmarking,as this could hurt the normal use-case.”. This recommendation is likely to be issued by Jesper Brouer[56] as he is also the author of the xmit more API.Therefore we created a nested loop to test the influence of the ring size along with associated burst value,while monitoring the throughput. This is done on a single core, on machine D.

Figure 6.3: Influence of ring size and burst value on the throughput

The above figure starts its burst value at 5, and ends a 100. This choice was made due to the baseline(burst 1) yielding results too small to be shown on the graph properly, far under 8 millions on any ringsize.

The collected data depicts two things:

• The original ring size (512) is indeed too little compared to the maximum amount of performance.Setting a value between 1024 and 4096 does not seem to seem to influence the maximum throughputof pktgen.

• Whilst the burst size of 10 may in fact be the best setting for small ring sizes (512-640 on thisgraph) but the best settings ever achieved are around a burst value of 25 to 40, with a ring sizefrom 800 to 4096.

The recommended value of 1024 seems to be more of a mnemonic than a factual threshold that influencesgreatly the overall performances of the software, but increasing it until maximum performance is reachedis necessary.

52

6.2 Evidence of faulty hardware

We will now showcase the discovery of a problem related to hardware from machine C, leading tocompletely discarding the results acquired from it.

Figure 6.4: Machine C parameter variation to amount of cores

The above figure shows is a classical benchmark of the system by varying the burst and skb clone values.However the plateau shown when the burst is being set at 10 and 100 barely reaches a total of 22 Mpps.While one might think this is due to limitation from either the CPU of the kernel’s performance, makinga test with MTU sized packets revealed achieving a simple bandwidth test for 40G did not work correctly.The figure 6.5 clearly shows a bandwidth threshold maintained around 26 to 28 Megabits per second.And since the figure 6.4 shows the hardware should be capable of producing at least 20 Million packetsper second, it is very unlikely the NIC is not able to do the required 3.289 million packets per secondrequired to saturate the link.

A good hint to unveil the issue was located in the kernel buffer output read with dmesg | grep"i40e":”PCI-Express bandwidth available for this device may be insufficient for optimal performance”.As we assumed the administrator did not place the NIC in the correct slot, as the block diagram showedanother slot with PCIe 2.0 instead of 3.0. But even after moving the fastest slot available (Slot 1 – PCIe3.0 x16) the results still stalled at the same total throughput, and the kernel message remained.

After doing research we found out that Intel issue several technical advisories [57] [58] stating thereare issues with PCIe connexions. It is therefore very likely that the board is automatically downgradingthe PCIe 3.0 to a 2.0 version. In [57] it is stated there upgrading the BIOS would fix the issue, howeverwe did not have physical access to the machine, located in a data-center.We asked for an administrator to do the upgrade, however we were answered that the BIOS-upgradeattempt he performed was not successful, and hence will not be performed on live machines. This iswhat led us to discard all performance results from this machine.

53

Figure 6.5: Machine C bandwidth test with MTU packets.

6.3 Study of the packet size scalability

This section has entirely been realized over the machine D, as it was the only one with a 40G NIC. Werealized similar tests on machines A and B however the same behaviour could not be reproduced mostprobably because of the throughput being too little to notice issues.

6.3.1 Problem detection

The idea of the test was fairly simple: scaling incrementally the size of the packet until line-rate isreached.

Figure 6.6: Throughput to packet size, in million of packets per second.

54

The expected behaviour is defined by having the initial performance constant until the theoreticallimit is met, under which the amount of packet should obviously fall under. The figure 6.6 clearly showsregular loss of packet throughput at growing intervals, for instance at 606,703, 840, 1044 and 1386. Notethat varying the burst size (> 1 otherwise it does not reach the line rate) or ring buffer size does notprevent for those drops.

Another approach to visualize the problem is to simply trace the amount of throughput in Mbps.This is done in the figure 6.7 and another problem become obvious: because of the packet loss, thelinespeed of 40G is not met when using an MTU (1500) sized packet, whilst it when using smaller sizedpackets under the exact same configuration (e.g. 1200 bytes) a 40G bandwidth is achieved.This implies that if used on a 100G board, because of this ”sawtooth” profile it is very unlikely thatpktgen is able to saturate a link with a single core.

Figure 6.7: Throughput to packet size, in Mbps.

6.3.2 Profiling with perf

As we did not have any machine to compare to with a 40G NIC, we first tried to monitor events withperf.The idea was to run a batch of perf statistics tests over particular events, mostly hardware events wereinvestigated as we thought of cache issues on a hunch, and to compute the Pearson product-momentcorrelation coefficient. Also to to visualize the results, as the latter formula is only valid for linearity andmight obfuscate other behaviours, we put them on top of the 6.7 graph to see if there were any eventsmatching the losses. Needless to say the same event monitoring experiment had to be run several times,as there might have been some spurious events that skewed the results, making it more of a coincidencethan a meaningful implication between the problem and the explanation.However none of the hardware metrics tested corroborate a direct implication of the hardware in thisissue, the Pearson’s formula always yielded r values considered as ”very weak”, often between 0.05 to0.1.Figure 6.8 is an example of this procedure, the green data representing the same pattern as found of thefigure 6.7 but slightly shrunk due to the range of the y axis being grown. This choice was made to alignthe two sets of data and finding correlation between the two more easily.

55

By interpreting this specific result, one can not see an obvious match between the two sets, implyingthere might not be one at all.The same sort of result were found in all the perf hardware events tested, including: LLC-store-misses,LLC-stores, branch-load-misses, branch-loads, iTLB-loads, node-load-misses, node-loads, node-store-misses, dTLB-load-misses, dTLB-loads, dTLB-store-misses, iTLB-load-misses, cpu-migrations, page-faults, context-switches.So far none of them revealed an issue that would rationalize the latter problem.

Figure 6.8: Superposition of the amount of cache misses and the throughput ”sawtooth” behaviour.

The figure 6.8, when applied Pearson product-moment correlation coefficient formula yield an r value of≈ 0.09, which we therefore interpret as having close to no linear correlation at all.

6.3.3 Driver latency estimation with eBPF

Another attempt into finding a justification to the problem was to create a kprobe through an eBPFprogram to calculate the amount of time the driver took to send each packet. If some unusual latencywas found it would possibility indicate hardware problems located on the NIC or driver issues. Thisprocedure was explained in section 5.3.

Execution overheadWhile the program will work on the system, it is inapplicable in real life situation in high-speed networkbecause of the amount of overhead caused by it.

Size of packet Without eBPF With eBPF Mean latency measured

64 5200 Mb/s 686 Mb/s 520ns

500 33100 Mb/s 4410 Mb/s 542 ns

1000 39600 Mb/s 10800 Mb/s 533 ns

1500 38100 Mb/s 16170 Mb/s 550 ns

Table 6.1: Comparison of throughput with eBPF program

The table shows the amount of overhead created by the implementation of the probe with eBPF. Notethat the speed with a size of 1500 bytes and the eBPF program loaded is in fact the only scenario found

56

were a better throughput is achieved compared to a size of 1000 or 1200 bytes.However the interpretation is trivial: when a minimum sized packet is used, there are more packets persecond therefore more kprobe hit, causing more overhead. Hence a bigger packet size causes less overheadbecause the amount of packet sent is lesser.

Moreover the mean latency calculation did not show an increase in the latency of execution, perhapsbecause of the xmit more API being enabled, the actual sending is delayed making the function executiontime almost static.Regarding the actual results of the latency measured they enable us to draw two conclusions:

• The time spent executing the function is in fact not related to the size of the packet. While thename of the function is in fact ”xmit” (for transmission) and is the lowest function the kernel hasaccess to for packet transmission, it does not immediately transfer the packet to the NIC. Instead,it copies the packet content located inside the SKB to the driver ring buffer.Hence the constant nature of the latency measurement are expected.

• It is quite hard to figure the accuracy of such measurements because of the extreme granularityrequired. Nano-second precise measurements are extremely hard to achieve because any kind ofinstruction takes several nano-seconds to be executed and hence can not be neglected. In thisexample, as we can not assess the amount of time taken accessing or storing the information ofthe timestamps inside the eBPF maps, those operation might skew the results dramatically, or anyother eBPF instruction for that matter.And giving a measured latency of (Cf Table 6.1) ≈ 500 nano-seconds, this implies copying a 64bytes (512 bits) packet into the ring buffer takes the equivalent time of sending 2500 bytes (20000bits) over the wire, which is unrealistic for high-speed transmission.

While this attempt at using eBPF for profiling was not successful, there are currently efforts to put anew eBPF hook optimized for throughput [59] performance assessment.

57

Chapter 7

Conclusion

Concerning the results gathered, it provides us with several insights:

• The clone skb option of pktgen is currently useless when coped with burst. This is due to bursthaving the same advantages but being even more efficient because of the xmit more API.

• To use the full potential of the machine, having a ring-size of 512 is not enough.However the recommended value of 1024 is not mandatory, as anything in the range of 800 to 4096is functional as well. An associated burst value of 30 to 40 is the best possible setting found, witha rate of 13.2 million packet per second achieved on a single core on machine D.

• There is an issue issue with the packet size on machine D, as experiments with a single core andMTU sized packets are producing less bandwidth than packets with a size 1000.The hardware profiling was unsuccessful.

• eBPF profiling seems like an interesting option but raises too much overhead to provide usabledata.

We are now able to answer our initial hypothesis. With pktgen, the line-rate of 10G is reached with 2to 4 cores depending on the machine. However the line-rate of 40G with minimum sized packets is notreached but an amount of ≈40 million packets per second were achieved, with high-end hardware.Therefore we can conclude aiming towards 100G line-rate is currently unrealistic given the current kerneland hardware conditions.During this thesis pktgen was deeply investigated, and we also gave a solid technical background goingfrom computer hardware to profiling tools.We also gave an insight to the reader about the eBPF technology and its possible uses, notably with theuse of kprobes. As this project is constantly evolving, this might become a very powerful technology toprofile entire frameworks in the future, perhaps even drivers.

7.1 Future work

It is paramount that the packet size scaling problem is addressed, and attempt to recreate the problemon machines with different configuration but having a 40G NIC, as we believe 10G is too slow to triggerthe problem.Extra profiling techniques, combining both eBPF and perf could provide a new angle of approach to theproblem and help pinpointing it.Re-coding the pktgen interface currently written in Perl into C should help making it portable andusable for all distributions, hence growing the community of users. And the more data provided by thecommunity, the more chances of studying the capabilities of a Linux-based system through pktgen.

58

Bibliography

[1] Sebastian Gallenmuller et al. “Comparison of Frameworks for High-Performance Packet IO”. In:ANCS ’15 Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for networkingand communications systems (2015), pp. 29–38.

[2] Alessio Botta, Alberto Dainotti, and Antonio Pescape. “Do You Trust Your Software-Based TrafficGenerator?” In: IEEE Communications Magazine (2010), pp. 158–165.

[3] Olof Hagsand, Robert Olsson, and Bengt Gorden. “Open-source routing at 10Gb/s”. In: (2009).

[4] Robert Love. Linux Kernel Developement. 4th ed. Addison-Wesley, 2010.

[5] Robert Olsson. “pktgen the linux packet generator”. In: Proceedings of the Linux Symposium 2(2005). url: https://www.kernel.org/doc/ols/2005/ols2005v2-pages-19-32.pdf.

[6] Arnaldo Carvalho de Melo. “The New Linux ’perf’ tools”. In: Linux Kongress (2010).

[7] Jonathan Corbet. Extending extended BPF. 2 July 2014. url: https://lwn.net/Articles/603983/.

[8] Linux Foundation. NAPI. 2009. url: http://www.linuxfoundation.org/collaborate/workgroups/networking/napi.

[9] Jonathan Corbet. Bulk network packet transmission. 17 May 2016. url: https://lwn.net/Articles/615238/.

[10] Christoph Lameter. “NUMA (Non-Uniform Memory Access): An Overview”. In: acmqueue 11.7(2013). url: https://queue.acm.org/detail.cfm?id=2513149.

[11] PCI-SIG. PCI Express Base Specification. Specification. Version Rev. 3.0. PCI-SIG, Nov. 2010,pp. 192–200.

[12] S. Bradner and J. McQuaid. Benchmarking Methodology for Network Interconnect Devices. RFC.IETF, 1999.

[13] Bryan Henderson. Linux Loadable Kernel Module HOWTO. 10. Technical Details. Version v1.09.2016. url: http://www.tldp.org/HOWTO/Module-HOWTO/x627.html.

[14] Patrick Mochel and Mike Murphy. sysfs - The filesystem for exporting kernel objects. 16 August2011. url: https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt.

[15] Joel Becker. configfs - Userspace-driven kernel object configuration. 31 March 2005. url: https://www.kernel.org/doc/Documentation/filesystems/configfs/configfs.txt.

[16] Thomas Petazzoni. Network drivers. Free Electrons. 2009. url: http://free-electrons.com/doc/network-drivers.pdf.

[17] David S. Miller. David S. Miller Linux Networking Homepage. 2016. url: http : / / vger .kernel.org/˜davem/skb.html.

[18] Hyeongyeop Kim. Understanding TCP/IP Network Stack Writing Network Apps. CUBRID. 2013.url: http://www.cubrid.org/blog/dev- platform/understanding- tcp- ip-network-stack/.

[19] Sreekrishnan Venkateswaran. Essential Linux Device Drivers. 2008.

[20] Dan Siemon. “Queueing in the Linux network stack”. In: Linux Journal 2013.231 (July 2013).

[21] Martin A. Brown. Traffic Control HOWTO. 2006-10-28. url: http://www.tldp.org/HOWTO/Traffic-Control-HOWTO/classless-qdiscs.html.

59

https://www.kernel.org/doc/ols/2005/ols2005v2-pages-19-32.pdf

https://lwn.net/Articles/603983/


http://www.linuxfoundation.org/collaborate/workgroups/networking/napi

http://www.linuxfoundation.org/collaborate/workgroups/networking/napi



https://queue.acm.org/detail.cfm?id=2513149

http://www.tldp.org/HOWTO/Module-HOWTO/x627.html

https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt

https://www.kernel.org/doc/Documentation/filesystems/configfs/configfs.txt

https://www.kernel.org/doc/Documentation/filesystems/configfs/configfs.txt

http://free-electrons.com/doc/network-drivers.pdf

http://free-electrons.com/doc/network-drivers.pdf

http://vger.kernel.org/~davem/skb.html

http://vger.kernel.org/~davem/skb.html

http://www.cubrid.org/blog/dev-platform/understanding-tcp-ip-network-stack/

http://www.cubrid.org/blog/dev-platform/understanding-tcp-ip-network-stack/

http://www.tldp.org/HOWTO/Traffic-Control-HOWTO/classless-qdiscs.html

http://www.tldp.org/HOWTO/Traffic-Control-HOWTO/classless-qdiscs.html

[22] Tom Herbert and Willem de Bruijn. Scaling in the Linux Networking Stack. 2015. url: https://www.kernel.org/doc/Documentation/networking/scaling.txt.

[23] Jon Dugan et al. iPerf. 2016-04-12. url: https://iperf.fr/.

[24] Juha Laine, Sampo Saaristo, and Rui Prior. RUDE & CRUDE. 17 May 2016. url: http://rude.sourceforge.net/.

[25] rick jones. RUDE & CRUDE. 17 May 2016. url: http://rude.sourceforge.net/.

[26] P. Srivats. Ostinato. 17 May 2016. url: http://ostinato.org/.

[27] Larry McVoy. lmbench. 17 May 2016. url: http://www.bitmover.com/lmbench/.

[28] Sebastian Zander, David Kennedy, and Grenville Armitage. KUTE A High Performance Kernel-based UDP Traffic Engine. Technical. Centre for Advanced Internet Architecture, 2005.

[29] ntop. PF RING Website. 17 May 2016. url: http://www.ntop.org/products/packet-capture/pf%5C_ring/.

[30] L. Deri. “nCap: wire-speed packet capture and transmission”. In: End-to-End Monitoring Tech-niques and Services, 2005. Workshop on (15 May 2005), pp. 47–55.

[31] Luigi Rizzo. netmap. 2016-04-12. url: http://info.iet.unipi.it/˜luigi/netmap/.

[32] Intel. DPDK. 2016-04-12. url: http://dpdk.org/.

[33] Paul Emmerich et al. “MoonGen: A Scriptable High-Speed Packet Generator”. In: Internet Mea-surement Conference 2015 (IMC’15). Tokyo, Japan, Oct. 2015.

[34] Spirent Communications. Website. 2016-04-12. url: http://www.spirent.com/.

[35] IXIA. Website. 2016-04-12. url: https://www.ixiacom.com/.

[36] Daniel Turull, Peter Sjodin, and Robert Olsson. “Pktgen: Measuring performance on high speednetworks”. In: Computer Communications 82 (Mar. 2016), pp. 39–48.

[37] Robert Olsson. HOWTO for the linux packet generator. 17 May 2016. url: https://www.kernel.org/doc/Documentation/networking/pktgen.txt.

[38] Stephane Eranian. Perf tutorial. 14-May-2016. url: https://perf.wiki.kernel.org/index.php/Tutorial.

[39] Brendan Gregg. Linux Perf Examples. 14-May-2016. url: http://www.brendangregg.com/perf.html#Events.

[40] Kernel Newbies. Linux 4.4. 14-May-2016. url: http://kernelnewbies.org/Linux_4.4.

[41] Jay Schulist, Daniel Borkmann, and Alexei Starovoitov. Linux Socket Filtering aka Berkeley PacketFilter (BPF). 24 Aug 2015. url: https : / / www . kernel . org / doc / Documentation /networking/filter.txt.

[42] Suchakra Sharma. BPF Internals - I. 24 Aug 2015. url: https://github.com/iovisor/bpf-docs/blob/master/bpf-internals-1.md.

[43] Alexei Starovoitov. LLVM Website. 17 May 2016. url: http://llvm.org/.

[44] Iovisor project. BCC repository. 17 May 2016. url: https://github.com/iovisor/bcc.

[45] statstutor. Pearson’s correlation. 20 June 2014. url: http://netoptimizer.blogspot.se/2014/06/pktgen-for-network-overload-testing.html.

[46] MiTAC Computer Corporation. S7002 technical specification. 2009.

[47] Intel. Server Board S2600IP. march 2015. url: http://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/g34153004_s2600ip_w2600cr_tps_rev151.pdf.

[48] Intel. Server Board S2600CW. April 2016. url: http://www.intel.com/content/dam/support/us/en/documents/server-products/S2600CW_TPS_R2_1.pdf.

[49] Judd Vinet and Aaron Griffin. Arch Linux. 2016-04-12. url: https://www.archlinux.org/.

[50] Bifrost Network Project. bifrost. 2016-04-12. url: http://www.bifrost-network.org/.

60

https://www.kernel.org/doc/Documentation/networking/scaling.txt

https://www.kernel.org/doc/Documentation/networking/scaling.txt

https://iperf.fr/

http://rude.sourceforge.net/



http://ostinato.org/

http://www.bitmover.com/lmbench/

http://www.ntop.org/products/packet-capture/pf%5C_ring/

http://www.ntop.org/products/packet-capture/pf%5C_ring/

http://info.iet.unipi.it/~luigi/netmap/

http://dpdk.org/

http://www.spirent.com/

https://www.ixiacom.com/

https://www.kernel.org/doc/Documentation/networking/pktgen.txt

https://www.kernel.org/doc/Documentation/networking/pktgen.txt

https://perf.wiki.kernel.org/index.php/Tutorial

https://perf.wiki.kernel.org/index.php/Tutorial

http://www.brendangregg.com/perf.html#Events

http://www.brendangregg.com/perf.html#Events

http://kernelnewbies.org/Linux_4.4

https://www.kernel.org/doc/Documentation/networking/filter.txt

https://www.kernel.org/doc/Documentation/networking/filter.txt

https://github.com/iovisor/bpf-docs/blob/master/bpf-internals-1.md

https://github.com/iovisor/bpf-docs/blob/master/bpf-internals-1.md

http://llvm.org/

https://github.com/iovisor/bcc

http://netoptimizer.blogspot.se/2014/06/pktgen-for-network-overload-testing.html


http://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/g34153004_s2600ip_w2600cr_tps_rev151.pdf



http://www.intel.com/content/dam/support/us/en/documents/server-products/S2600CW_TPS_R2_1.pdf

http://www.intel.com/content/dam/support/us/en/documents/server-products/S2600CW_TPS_R2_1.pdf

https://www.archlinux.org/

http://www.bifrost-network.org/

[51] Canonical Ltd. Ubuntu Website. 14-May-2016. url: http://www.ubuntu.com/.

[52] Inc Red Hat. Fedora. 2016-04-12. url: https://getfedora.org/.

[53] Tim O’Reilly and Ben Smith. The Importance of Perl. 17 May 2016. url: http://archive.oreilly.com/pub/a/oreilly/perl/news/importance_0498.html.

[54] Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. “Fine-Grained Dynamic Voltage andFrequency Scaling for Precise Energy and Performance Trade-off based on the Ratio of Off-chipAccess to On-chip Computation Times”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 24 (1 27 December 2004), pp. 18–28.

[55] Dominik Brodowsk and Nico Golde. CPU frequency and voltage scaling code in the Linux(TM)kernel. 17 May 2016. url: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt.

[56] Jesper Dangaard Brouer. Pktgen for network overload testing. 4 June 2014. url: http : / /netoptimizer.blogspot.se/2014/06/pktgen-for-network-overload-testing.html.

[57] Intel Corporation. PCI Express* 3.0 Add-in Adapter Support Issue. technical advisory. Intel, 2014.url: http://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/ta_102105.pdf.

[58] Intel Corporation. PCIe link width may intermittently downgrade to x4 or x2 with one third partyPCIe add-in card. technical advisory. Intel, 2012. url: http://www.intel.com/content/dam/support/us/en/documents/server-products/ta1000.pdf.

[59] Tom Herbert and Alexei Starovoitov. eXpress Data Path. 10 June 2016. url: https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf.

61

http://www.ubuntu.com/

https://getfedora.org/

http://archive.oreilly.com/pub/a/oreilly/perl/news/importance_0498.html

http://archive.oreilly.com/pub/a/oreilly/perl/news/importance_0498.html

https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt




http://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/ta_102105.pdf

http://www.intel.com/content/dam/support/us/en/documents/motherboards/server/sb/ta_102105.pdf

http://www.intel.com/content/dam/support/us/en/documents/server-products/ta1000.pdf

http://www.intel.com/content/dam/support/us/en/documents/server-products/ta1000.pdf

https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

Appendix A

Bifrost install

A.1 How to create a bifrost distribution

The following instruction assume the device you are currently trying to mount on is a USB-Stick; most-likely mounted asynchronously. In case you had sensitive data on the device you are trying to burn on,please consider using the shred command.Before you get started, it is important that you check which one of your devices maps to the /dev/sdX

Burning the image on the device

• mkdir /tmp/bifrost && cd /tmp/bifrost

• wget http://bifrost-network.org/files/distro/bifrost-7.2.img.gz#Or download thru browser

• gzip -d bifrost-7.2.img.gz

• dd if=bifrost-7.2.img of=/dev/sdX bs=4096Where X is the correct letter to the device :you want to burn to. You can check the lists of devicesconnected by running fdisk -l with admin rights.

• sync

Scaling the filesystem in case you wish to make usage of the entire space of your stick.

• parted /dev/sdX

• (parted) resizepart 1 -1s

• (parted) quit

• resize2fs /dev/sdX1

• fsck /dev/sdX1

• sync

62

A.2 Compile and install a kernel for bifrost

You probably want to have the same configuration as the original kernel provided through the image.You can copy the previous configuration with zcat /proc/config.gz > .config

• Assuming you downloaded and extracted the kernel code and you are currently in the folder, andyou mounted the bifrost distribution on /media/user/bifrost

• wget http://jelaas.eu/pkg64/bifrost-initramfs-15.tar.gz

• tar xvf bifrost-initramfs-15.tar.gz ./boot/initramfs.cpio -O > initramfs.cpio

• makeThis should take a while.

• cp arch/x86/boot/bzImage /media/user/bifrost/boot/kernel-XXX

• make modules install XXXX

• sync

63

Appendix B

Scripts

Example of a simple Perl post-processing script to yield data for gnuplot.

#!/usr/bin/perl -nwuse List::Util qw(sum);BEGIN{$nbcore=1; $i=0; }if( grep /bps/,$_){

(my $pps) = $_=˜/(\d+?)pps/;$pps=˜s/\d{3}$//;push @res,$pps;

(my $bps) = $_=˜/(\d+)Mb\/sec/;push @res2,$bps;$i++;

}if($i == $nbcore){

$i=0;print $nbcore++," ",sum(@res)/1000," Mpps ";print sum(@res2)/1000," Mbps \n";@res=();@res2=();

}

../pkt–dat.pl

64

65

Appendix C

Block diagrams

Figure C.1: Block diagram of motherboard Tyan S7002

66

Figure C.2: Block diagram of the motherboard S2600IP

67

Figure C.3: Block diagram of the motherboard S2600CW

68

--- pktgen_old.c 2016-06-04 15:47:00.881493623 +0200+++ pktgen.c 2016-06-04 15:46:17.953491778 +0200@@ -3447,10 +3447,8 @@

pkt_dev->last_ok = 0;goto unlock;

}- atomic_add(burst, &pkt_dev->skb->users);--xmit_more:- ret = netdev_start_xmit(pkt_dev->skb, odev, txq, --burst > 0);+ atomic_inc(&(pkt_dev->skb->users));+ ret = netdev_start_xmit(pkt_dev->skb, odev, txq, pkt_dev->sofar % burst != 0);

switch (ret) {case NETDEV_TX_OK:

@@ -3458,8 +3456,6 @@pkt_dev->sofar++;pkt_dev->seq_num++;pkt_dev->tx_bytes += pkt_dev->last_pkt_size;

- if (burst > 0 && !netif_xmit_frozen_or_drv_stopped(txq))- goto xmit_more;

break;case NET_XMIT_DROP:case NET_XMIT_CN:

@@ -3478,8 +3474,7 @@atomic_dec(&(pkt_dev->skb->users));pkt_dev->last_ok = 0;

}- if (unlikely(burst))- atomic_sub(burst, &pkt_dev->skb->users);+unlock:HARD_TX_UNLOCK(odev, txq);

Figure C.4: Patch proposed to fix the burst anomalous cloning behaviour

69

TRITA TRITA-ICT-EX-2016:118

www.kth.se

Documents

Linux Kernel Packet Transmission Performance in High-speed …956412/FULLTEXT01.pdf · 2016. 8. 30. · This thesis aims to investigate the maximum capabilities of Linux packet transmissions