Proceedings of the Linux Symposium · Proceedings of the Linux Symposium July 23th–26th, 2003 Ottawa, Ontario Canada. Contents ... making the Linux kernel run in user space has

Proceedings of theLinux Symposium

July 23th–26th, 2003Ottawa, Ontario

Canada

Contents

UML Simulator 8

Werner Almesberger

SCSI Mid-Level Multipath 23

Mike S. Anderson

IPv4/IPv6 Translation 34

William Atwood

Building Enterprise Grade VPNs 44

Ken S. Bantoft

Linux memory management on larger machines 50

Martin J. Bligh

Integrating DMA into the Generic Device Model 63

James Bottomley

Linux Scalability for Large NUMA Systems 76

Ray Bryant

An Implementation of HIP for Linux 89

Catharina L. Candolin

Improving enterprise database performance 98

Ken W. Chen

High Availability Data Replication 109

Paul R. Clements

Porting NSA Security Enhanced Linux to Hand-held devices 117

Russell Coker

Strong Cryptography in the Linux Kernel 128

Jean-Luc R. Cooke

Porting Drivers to the 2.5 Kernel 134

Jonathan M. Corbet

Class-based Prioritized Resource Control in Linux 150

Hubertus Franke

Linux Support for NUMA Hardware 169

Patricia A. Gaughen

Kernel configuration and building in Linux 2.5 185

Kai Germaschewski

Device discovery and power management in embedded systems 201

David W. Gibson

Gnumeric – Using GNOME to go up against MS Office 213

Jody E. Goldberg

DMA Hints on IA64/PARISC 219

Grant Grundler

A 2.5 Page Clustering Implementation 233

William L. Irwin

Ugly Ducklings – Resurrecting unmaintained code 242

Dave Jones

udev – A Userspace Implementation of devfs 249

Greg Kroah-Hartman

Reliable NAS from Dirt Cheap Commodity Hardware 258

Benjamin C.R. LaHaise

Improving the Linux Test Project with Kernel Code Coverage Analysis 260

Paul W. Larson

Effective HPC Hardware Management and Failure Prediction Strategy Using IPMI 275

Richard Libby

Interactive Kernel Performance 285

Robert M. Love

Machine Check Recovery for Linux on Itanium Processors 297

Tony Luck

Low-level optimizations in the PowerPC Linux Kernels 304

Paul Mackerras

Sharing Page Tables in the Linux Kernel 315

Dave McCracken

Kernel Janitors: State of the Project 321

Arnaldo Melo

Linux Kernel Power Management 325

Patrick Mochel

Bringing PowerPC Book E Processors to Linux 340

Matthew D. Porter

Asynchronous IO Support in Linux 2.5 351

Badari Pulavarty

Towards an O(1) VM 367

Rik van Riel

Developing Mobile Devices based on Linux 373

Tim Riker

Lustre: Building a File System for 1,000-node Clusters 380

Phil Schwan

OSCAR Clusters 387

Stephen L. Scott

Porting Linux to the M32R processor 398

Hirokazu Takata

Linux in a Brave New Firmware Environment 410

Matthew E. Tolentino

Implementing the SMIL Specification 424

Malcolm Tredinnick

Benchmarks that Model Enterprise Workloads 434

Theodore Y. Ts’o

Large Free Software Projects and Bugzilla 447

Luis Villa

Performance Testing the Linux Kernel 457

Cliff White

Stressing Linux with Real-world Workloads 470

Mark A. Wong

Xr: Cross-device Rendering for Vector Graphics 480

Carl D. Worth

relayfs: An efficient unified approach for trasmitting data from kernel to user space 494

Karim Yaghmour

Linux IPv6 Networking 507

Hideaki Yoshifuji

Fault Injection Test Harness 524

Louis lz Zhuang

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.Stephanie Donovan,Linux SymposiumC. Craig Ross,Linux Symposium

Review Committee

Alan Cox,Red Hat, Inc.Andi Kleen,SuSE, GmbHMatthew Wilcox,Hewlett-PackardGerrit Huizenga,IBMAndrew J. Hutton,Steamballoon, Inc.C. Craig Ross,Linux SymposiumMartin K. Petersen,Wild Open Source, Inc.

Proceedings Formatting Team

John W. Lockhart,Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.

UML Simulator

Werner Almesberger∗

[email protected]

Abstract

umlsim extends user-mode Linux (UML) withan event-driven simulation engine and otherinstrumentation needed for deterministicallycontrolling the flow of time as seen by theUML kernel and applications running under it.

umlsim will be useful for a wide range of ap-plications in research and kernel development,including simulations involving the network-ing code, regression tests, proof of race con-ditions, validation of configuration scripts, andalso performance analysis.

This paper describes the design and implemen-tation of umlsim, gives a brief overview of thescripting language, and shows a real-life usageexample.

1 Introduction

Simulation is an effective means for examiningproperties of systems that are too complex, toovolatile, too expensive, or simply too large tobuild and test in real life.

In the development of the Linux kernel, simu-lations only play a niche role, and are rarelyused for more than helping in the design ofindividual components. Also for performanceevaluation, there is broad reliance on bench-mark suites, but little is done with simulations

∗The work presented in this paper isconducted as part of the FAST projectat the California Institute of Technology,http://netlab.caltech.edu/FAST/

that would allow to pinpoint bottlenecks andregressions with much more accuracy.

umlsim provides an environment that allowsthe use of regular Linux kernel or applicationcode in event-driven simulations. It consists ofan extension of user-mode Linux (UML, [1])to control the flow of time as seen by the UMLkernel and applications running under it, and asimulation control system that acts like a de-bugger, and that is programmed in a C andPerl-like scripting language.

The key feature of umlsim is that—unlike mostother simulators, which implement an abstractmodel of the system being simulated—it usesthe original Linux kernel code, with only mi-nor changes. This reduces the risk of creatingsimulations that differ in some important de-tails from the original, avoids code forking, andgenerally shortens the process of designing andbuilding a simulation.

The simulation environment is deterministic,i.e. running a simulation multiple times willproduce exactly the same results, although onecan of course also introduce real or pseudo ran-domness. This makes umlsim suitable for re-gression tests, and for exercising specific exe-cution patterns that exhibit problems.

The project’s home page is athttp://umlsim.sourceforge.net/

One of the first uses of umlsim is to exam-ine the behaviour of Linux TCP in gigabit net-works [2], but it will also be useful for manyother applications in research and kernel de-

Linux Symposium 9

velopment, including regression tests, exami-nation of race conditions and other kernel bugs,validation of configuration scripts, and perfor-mance analysis.

This paper is intended for two different au-diences: first, it aims to introduce the capa-bilities and concepts of umlsim to prospectiveusers. Second, it gives other kernel develop-ers an overview of the kernel changes, and de-scribes mechanisms that could also be useful inother projects.

This introduction continues with the historicalbackground and related work. Section 2 dis-cusses overall design and implementation as-pects, and section 3 describes the most impor-tant elements of the scripting language. A real-life simulation example is given in section 4.We conclude with a discussion of future usesand improvements.

1.1 History

The basic concept behind umlsim, namely touse original kernel and user space code in sim-ulations, was already explored in the earliertcng (“Traffic Control Next Generation” [3])project.

The main component of tcng is a compiler thattranslates traffic control configurations from ahigh-level language to the low-level commandsunderstood by the tc command-line utility. Inthat project, a simulator called tcsim is usedto validate that these commands are formallycorrect, and that they also yield the desiredbehaviour. In particular, since the configu-ration process involves many inter-related pa-rameters with poorly documented semantics, ithappened quite frequently that the use of someparameters or constructs was mis-interpreted.

Simulators usually implement an abstractedmodel of the system they simulate. In the caseof tcng, this approach could lead to a simulator

that contains the same mis-interpretations asthe program being tested, so both would hap-pily agree on incorrect results.

To avoid this problem, tcsim reduces theamount of abstraction needed by building thesimulation environment from portions of theoriginal traffic control code of the kernel, andthe tc configuration utility. The structure oftcsim is depicted on the left-hand side of Fig-ure 1.

This approach also allows the use of powerfuluser-space debugging tools like ElectricFence[4] and valgrind [5] to find bugs in the originalcode.

Kernel

net

tcsim

tcsim mergescode fromkernel anduser space

tc

spaceUser

controlSimulation

Kernel

UML

Simulationcontrol

spaceUser

supportSimulation

user spaceunchangedumlsim uses

UML

Simulatorcontrols aslightly extended UML kernel

anyprog.

extracted

Figure 1: tcsim uses a monolithic approach,with many dependencies on kernel and userspace internals. umlsim is modular, and re-quires only very minor changes.

The difficult part in writing tcsim was to extractprecisely the right amount of kernel code, andto make it fit in the simulation environment.In many cases, some small code modificationsare needed to eliminate unwanted references tostructure elements, variables, and functions notavailable in the simulator. All this makes theextraction procedure very sensitive to even the

Linux Symposium 10

smallest changes.

tcsim was written for 2.4 kernels. Early in 2.5development, it became clear that the network-ing code had changed sufficiently to require amajor rewrite of the extraction process.

Another limitation of tcsim is that it only cov-ers a very small part of the networking stack.For instance, it would be interesting to use TCPas a reactive traffic source.

The bottom line of the experience with tcsim isthat, while using the original source also for thesimulator works well, the process of extractingit causes problems and confines the simulatorto only a small part of the system. So, whynot avoid the extraction step at all, and use theentire kernel?

This is the approach chosen for umlsim, asshown on the right-hand side of Figure 1: in-stead of extracting the interesting bits from thekernel, it builds on UML, where all the work ofmaking the Linux kernel run in user space hasalready been done, and adds a few functionsfor simulation control to it. User space is leftcompletely unchanged.

1.2 Other simulators

Particularly in the area of networking, simula-tors are rather common tools. In many cases,they focus only on a very limited set of func-tions, such as a specific protocol. Among themore general simulators, the network simula-tor ns-2 [6] is certainly the one most widelyknown.

ns-2 consists of a modular simulation corewritten in C++, which is configured throughscripts written in an extended version of the Tclscripting language. The core provides networkelements, protocol engines, and traffic genera-tors.

umlsim also has a “core,” but this core providesonly very low-level primitives, and higher levelfunctions are implemented by scripts. On theother hand, large subsystems, such as TCP,are simply reused without needing any specialtreatment in the simulator, and they behave inevery detail like in a real system.

ns-2 is much faster than umlsim, and will prob-ably always be, while umlsim is more generaland can also be used for simulations involvingother subsystems, instead of or in addition tonetworking.

2 Simulator design

umlsim consists of a simulation control process(we shall call it simply “the simulator”), andthe UML systems that are being studied in thesimulation. Besides UML systems, a simula-tion can also include other processes, e.g. toimplement communication services. The gen-eral structure of a simulation system is shownin Figure 2.

The simulator executes a script in a C/Perl-likelanguage. Scripts serve two purposes: (1) theydefine the simulation and control its execution,and (2) they provide the “glue” between the ac-tual simulation and the processes used in it, andalso between elements in these processes.

A simulation can choose how closely the sim-ulator and the UML systems interact, i.e. thesimulator may just watch a few variables andperform basic synchronization, but exercise nofurther control over execution, but it may aswell intercept even the slightest activity, ma-nipulate variables in the UML kernel, and evenalter the flow of execution. Typically, umlsimcontrols UML at a very low level, but hidesmost of these interactions inside the simula-tor and behind library functions that provide ahigher level of abstraction.

Linux Symposium 11

Idle task

Idle task

Script

umlsim"Simulator"

UML systemsIdle messages

Process controls other process(es)

Direct communication through pipes

Applications

Standaloneprocesses

threadTracing

threadTracing

idleumlsim

Applications

umlsimidle

Figure 2: The simulator controls UML systemsand other processes. Each UML system in turnconsists of several processes.

The simulator basically acts like a debugger,and places breakpoints into the UML kernel.When the kernel is stopped, the simulator canread and change variables. The simulator canalso call functions, make them return, etc.

In addition to this, the simulator exchangestime updates with the umlsim idle thread de-scribed in the next section directly through apair of pipes.

Figure 3 shows the current structure of li-braries. Work in this area of umlsim is still verymuch in progress.

2.1 Virtual time

It is frequently desirable to run simulations in adeterministic virtual time instead of real time.umlsim can accomplish this by adding code tothe kernel that intercepts all functions reporting

Simulation scripts

High−level subsystem interface

General helper Subsystem−specifichelper functions

Core primitives (scripting language)

functions

exten−Language

sions

register,(string,

run, ...) (kernel, queue, ...)

(netsim)

(tcp−peek, tcp−quarter, ping−peek, ...)

(nettap, skbuff, ...)

Processes, breakpoints, time control, ...

Figure 3: Organization of libraries in umlsim.

or advancing time, and puts them under its owncontrol. This code also introduces a umlsim-specific idle thread that yields to all other tasks,except the kernel’s regular idle task.

Whenever the kernel is idle (i.e. no process isscheduled to run), the umlsim code in the ker-nel does one of the following:

• if a soft-interrupt is pending: it generates atimer interrupt, but does not advance time

• if a timer will expire within the next jiffy:1

it generates a timer interrupt, and allowsdo_timer to advance time by one jiffy

• if the next timer will expire in the future:the UML kernel reports this back to thesimulator, and waits for further instruc-tions

Since the kernel always runs some timers(such as neigh_periodic_timer andrt_check_expire ), umlsim does not needto handle the case of a timeout without furtheractivity.

1The “jiffy” is the basic time unit in the kernel. Onejiffy typically equals 1–10 ms.

Linux Symposium 12

The kernel can also become active when a de-vice interrupt arrives. umlsim currently onlyhandles network events. Instead of using sig-nals (which correspond to interrupts in UML),it calls the functions invoked by interrupt han-dlers directly.

When all kernels in a simulation report thatthey are waiting for a timer, the simulator picksthe earliest expiration time among these timers(or a timeout specified in thewait command,if it is earlier), sets the global simulation timeto that value, and updates the local time in allkernels.

2.2 Running UML under a debugger

When running UML under a debugger or asimilar program (such as strace), the tracingthread watches the debugger withptrace , in-tercepts calls to functions likeptrace andwaitpid , and emulates them or redirectsthem to the process currently executing the ker-nel. This part of UML is called the “ptraceproxy.”

Unfortunately, this design allows only a singleUML system per debugger, because a processcan be watched withptrace by at most oneprocess at any given time.

In order to control multiple UML systems,umlsim forks a forwarder process for each suchsystem. This process communicates with themain simulator through pipes, and executes theptrace calls on its behalf. This is shown inFigure 4.

As an example, Figure 5 shows a simplifiedflow of control when the simulator performs aptrace call on a UML system.

2.3 Debugging the kernel

There are several idiosyncrasies of kernel codeand of the way gcc compiles it that need spe-

umlsim

Forwarder

Message exchange over pipes

ptrace calls fromforwarder are redirectedto process executing thekernel

UML ptraces the forwarder tointercept calls to ptrace,

waitpid, etc.

Tracingthread

Processes

Figure 4: umlsim uses an intermediate for-warder process to “debug” the UML system.

cial attention in umlsim. This section describessome of them.

Because thejiffies variable is defined inthe linker, the debugging information gener-ated by the compiler only contains its decla-ration, but not its location. umlsim thereforeretrieves this information from the symbol ta-ble of the kernel executable, and augments thedeclaration with it.

Some functions use registers instead of thestack to pass arguments (e.g. those declaredwith FASTCALL). umlsim currently does notsupport or even recognize this.

The kernel makes heavy use of inline func-tions. One peculiarity of inline functions is thatbreakpoints in an inline function need to be re-peated for each instance of this function. Fur-thermore, gcc rearranges and sometimes evenremoves labels (the ones used as targets forgoto statements) when optimizing. umlsimintroduces a mechanism called “reliable mark-ers” that includes an explicit label in the func-tion, which can then be used for breakpoints.Reliable markers also work in functions that

Linux Symposium 13

umlsim Forwarder

SIGCHLD for intercepted ptrace system call

ptrace call(s) to retrieve ptrace arguments

ptrace call(s) to return the result

Resume execution (PTRACE_SYSCALL)

Message requesting a ptrace call

Message returning the result

ptrace call to process executing the kernel

Tracing thread Process

Figure 5: Simplified control flow when thesimulator performs aptrace call on a UMLsystem.

are not inlined, and are used as follows:

void some_function(int a){

int b = 10;

MARKER(label_name,a,b);...

Variables that may be accessed by the simula-tor when stopped at the location of the markerare listed after the label name. This makes surethat the variables in question have a memorylocation, that they are not cached in registerswhen passing the marker, and that no code ac-cessing these variables gets moved across themarker. (E.g. in the example above, the com-piler might otherwise try to move the initializa-tion of b after the marker.)

Also, since many low-level service functionsare declaredstatic inline , they cannotbe called directly. umlsim generates callableinstances of the most common inline functionsby including their definitions in a file compiledwith -fkeep-inline-functions .

2.4 Kernel changes

The kernel changes required for umlsim arecomparably minor, and most of them are in thearea specific to the UML architecture.

umlsim requires the following changes ingeneric kernel code:

• calibrate_delay explicitly waits fora timer interrupt, which would never hap-pen under umlsim, because at that time,the umlsim idle thread does not yet ex-ist. Therefore, umlsim simply skipscalibrate_delay when using virtualtime, and setsloops_per_jiffy toone.

• functions are added totimer.c to re-trieve the expiration time of the next timer.

umlsim replaces the following functions usingthe linker’s--wrap mechanism:

• do_gettimeofday andgettimeofday return the simula-tion time instead of the system’s realtime.

• setitimer becomes a no-op, becauseumlsim generates all timer interrupts un-der its own control.

• a switch is added to control whether atimer interrupt invokesdo_timer . Thisway, timer interrupts can be used torun soft-interrupts without advancing thejiffies count.

• idle_sleep leads to the timeout han-dling code of umlsim.

The timeout handling code decides which ac-tions to take (e.g. to raise a timer interrupt

Linux Symposium 14

if there are pending soft-interrupts), commu-nicates with the simulator, and maintains thevarious “current time” variables.

The umlsim patches also add the reliable mark-ers, and callable definitions of common inlinefunctions, which are both described in the pre-vious section.

2.5 Network simulation

When simulating network elements, uml-sim builds on the infrastructure used byuml_switch, but the script intercepts the trans-mit function and replaces most of the UML-specific part of the stack. Figure 6 shows thekey functions invoked when sending and re-ceiving packets.

dev_queue_xmit uml_net_interrupt

Packet flow in unchanged UML

qdisc_run

qdisc_restart

qdisc_dequeue

uml_net_start_xmit

netif_stop_queue

daemon_user_write

daemon_write

net_sendto

sendto

netif_wake_queue

dev_kfree_skb

dev_alloc_skb

daemon_read

ether_adjust_skb

net_recvfrom

recvfrom

eth_protocol

eth_type_trans

netif_rx

uml_net_rx

uml_

switc

h

Packet flow with umlsim

UML−specific function

Script handles flow control,queuing, transport, etc.

Figure 6: Call sequence and packet flow withand without umlsim (simplified).

With umlsim, the UML-specific networkingpart is only used for device setup, but all thetransport and low-level packet manipulationfunctionality is provided directly by the sim-ulation script.

umlsim currently only provides a single-linkmodel, which will be extended and general-ized in the near future. Figure 7 shows howthe device interface and link are implemented.This code can be found in the filesinclude/netsim.umlsim andinclude/nettap.umlsim of the umlsim distribution.

$__netsim_outbound_handler

$grab_outgoing_packet

$enqueue

netif_stop_queue

$__netsim_buffer_handler

$dequeue

$enqueue

netif_wake_queue

$__netsim_link_handler

$dequeue

Timeout

Timeout

Breakpoint

$deliver_packet

dev_alloc_skb

ether_adjust_skb

eth_type_trans

netif_rx

Devicebuffer

Link queue

Stack

All

hand

lers

are

dis

pa

tche

d th

roug

h $n

ets

im_l

oo

p

Stack

Figure 7: Control flow in script implementingnetwork device and link.

When reaching the breakpointuml_net_start_xmit , umlsim retrieves the packet,calculates the queuing delay, and stores it in aninternal queue. If the device queue is full, thescript calls the flow-control functionnetif_stop_queue .

When the packet is due for sending, it is de-queued and put into the link queue, from whichit emerges after the transfer delay. umlsimthen invokes basically the same functions asthe original code, and finally pushes the packetto the stack by callingnetif_rx .

Linux Symposium 15

3 The scripting language

The scripting language is mainly based on Cand Perl, but it also borrows concepts from theBourne shell, Pascal, and LISP. This sectiongives a brief overview of the most importantconcepts, and differences to similar languages.

umlsim passes scripts through the C preproces-sor, so the usual comment handling, macro ca-pabilities, and include files are available.

3.1 Variables and functions in the simulator

The names of variables in the simulator alwaysbegin with a dollar sign, like in Perl. Also likein Perl, variables can be used without prior dec-laration, their value can be of any type, and thetype can be change.

An uninitialized variable has the so-calledun-defined value. The undefined value can also beused explicitly, with the constructundef , andone can test whether an expression yields theundefined value withdefined expression.

The scripting language, like C, uses lexicalscoping, i.e. the visibility of a variable is de-termined by its location in the program, but notby the sequence of function calls that leads toa specific access.

By default, variables are visible within the en-closing function, but not in other functions.This can be changed by either declaring themlocal , which creates a new, uninitialized in-stance that is visible in the current block andany blocks inside it, or by declaring themglobal , which creates a new instance in thecurrent function, which is visible also in func-tions defined inside this function. Example:

$a = 5;{

local $a = 0;

$a++;{

$a++;printf("inner %d\n",$a);

}printf("middle %d\n",$a);

}printf("outer %d\n",$a);

yields

inner 2middle 2outer 5

The goal of these slightly unusual scoping rulesis to avoid explicit declarations as much as pos-sible, but also to avoid the problem of functionsaccidentally altering global variables, which iscommon in other scripting languages.2

Functions are anonymous, similar to lambdaexpressions in LISP. In order to reference afunction by name, the function has to be storedin a variable. Example:

global $gcd = function ($a,$b){

if ($a == $b) return $a;return $a > $b ?

$gcd($a-$b,$b) : $gcd($a,$b-$a);};

print $gcd(300,90);

The scripting language also supports associa-tive arrays. Indices can be integers, pointers,strings, processes, or breakpoints. Elementscan be of any type, including arrays. Exam-ples:

2Only time—and users—will tell whether this isindeed an improvement over more traditional scopingrules. Users preferring to declare all their variablescan set the-Wundeclared option to enable warningswhen trying to access undeclared variables.

Linux Symposium 16

$a[0] = 3;$a["string"] = $a;print $a["string"][0];

3.2 Printing and files

The scripting language has two output state-ments: the C-likeprintf , and the “smart”and somewhat Perl-likeprint .

print accepts a list of items to print, appendsa newline after the last item, and it pretty-printsstructured types. Example:

$proc = uml("linux");print "xtime = ",xtime;

yields

xtime = {tv_sec = 0 (0x0)tv_nsec = 0 (0x0)

}

print outputs integers as decimal and as hex-adecimal numbers, enumeration type membersby name, strings and signed character arrays astext strings, and arrays of unsigned charactersas a hexdump. Like in Perl, a separator be-tween printed arguments can be introduced bysetting the special variable$, .

To send output to a file, the file first has to beopened with theopen function, which has afile name argument like Perl’sopen , but re-turns a file handle. Then, the file handle canbe used as the first argument ofprint orprintf . Example:

$file = open(">tmp");print $file,"example data";printf($file,"answer = %d\n",42);close($file);

Data can be read from files with thereadfunction, but this is rarely used.

3.3 Control statements

if -else , while (with break andcontinue ), and for work exactly likein C. There is nodo-while loop, becausewhile can be used in its stead.3

switch -case is similar to C, with the differ-ence that variable expressions can be used forcase labels.

There is nogoto .

3.4 Processes

Simple programs are started with the functionrun , and UML systems are started with thefunction uml . Both functions return a handlethat identifies the process. They also set the“magic” variable$$ to this value.$$ alwaysidentifies thecurrent process, i.e. the processthat has most recently been created or stopped,and that is currently being manipulated.$$ canbe changed by the simulator (when a differentprocess becomes current) and by the script (ifone wants another process to be current).

run and uml also support some basicIO-redirection, e.g. run("/bin/date",">/tmp/xyz")

After run or uml , the process is in thestartingstate, but not yet running. A starting orstoppedprocess is run with thecontinue statement.4

Thewait statement is used to make the sim-ulator wait for the next event (process termina-tion, breakpoint, timeout, etc.). If the event isrelated to a process,wait sets$$ to this pro-cess. Example:

3. . . and because the way programs are representedinternally makesdo-while somewhat difficult to ex-press. It may be added at a later time.

4This continue has the process handle as argu-ment, in parentheses, and therefore differs syntacticallyfrom the continue control statement. If continuingthe current process, the parentheses can be left empty.

Linux Symposium 17

$proc =run("/bin/echo","hello world");

continue();wait();print $$ == $proc;

yields5

hello world1 (0x1)

3.5 Breakpoints, functions, and timeouts

Breakpoints can be placed at function entry, atthe point to which the current function returns,at labels, and at the reliable markers describedin section 2.3. Breakpoints are set with thebreak function, which returns a handle thatidentifies the breakpoint.

In the example below, we set breakpoint$b1at the entry of themain function in the cur-rent process, breakpoint$b2 at the label or re-liable markerlabel inside the main function,and breakpoint$b3 at the location to which thecurrent function returns.

$b1 = break(main);$b2 = break(main.label);$b3 = break(return);

When reaching a breakpoint, umlsim sets$$ tothe process in which the breakpoint is located,and$! to the breakpoint handle.

When calling a function in the process, alsoa breakpoint is generated. This breakpoint istriggered when the function returns. The re-turn value of the function is stored in the spe-cial variable$? . Example:

5After warning that/bin/echo has neither sym-bols nor debugging information, so there is very littleumlsim can do with this process.

$b = call fn(1,2,3);continue();...wait();if ($! == $b)

printf("result = %d\n",$?);

This example shows anasynchronouscall, be-cause other breakpoints, timeouts, or events inother processes can be handled before the func-tion returns. If this flexibility is not needed,one can use the simplersynchronousform,which does not change$! or $? . Example:

printf("result = %d\n",fn(1,2,3));

Breakpoints can be removed implicitly, by de-stroying all references to them, or explicitlywith delete( breakpoint);

A script can not only call functions ina process, but it can also make a func-tion return. For example,$__netsim_outbound_handler in Figure 7 forcesuml_net_start_xmit to return, withoutexecuting any code of that function, with$$.return 0; .

Besides terminating or reaching a breakpoint,a process may also stop with a timeout. Time-outs are specified with a time argument towait . When the specified absolute time isreached,wait sets$$ to the undefined value,and the “current time” variable$@to the time-out, rounded up to the next nanosecond. Ex-ample:

wait(10.2);/* wait until t = 10.2 seconds */

if (!defined $$) print $@;

If more than one timeout can occur at a giventime (e.g. packets arriving within the same

Linux Symposium 18

nanosecond at different points in the simula-tion), wait($@) must be called after han-dling each event, so that breakpoints reachedwhen handling a timeout can be processedbefore handling further timeouts. An exam-ple for usingwait($@) can be found in theevent loop at the end ofinclude/netsim.umlsim in the umlsim distribution.

Starting$$ = process$! = undef

Running

Terminated

after function return

exit or signal

$? = exit status$$ = process

Stopped Timeout

function returnbreakpoint or

umlsim_idle

continue() or wait() wait()

$! = breakpoint$$ = process

$$ = undef$? = return value

(if function)

continue()function call

Figure 8: User-visible process states. “Func-tion call’ and “return” refer to asynchronousfunction calls.

Figure 8 summarizes the process states de-scribed in this section. States shown in greyallow manipulations of the process, such asthe creation of new breakpoints, access to vari-ables, or function calls. A function call fromtimeoutputs the process in a state equivalent tostopped, but it does not affect any of the specialvariables.

3.6 Data in a process

umlsim scripts can directly read and write vari-ables in a process, follow pointers, select structor union members, and so on.

The basic operation is to access a variable. Inmany cases, simply specifying the variable’sname is enough, e.g. given the example pro-gram below,foo retrieves the value 42.

static int foo = 42;

int main(void){

static int bar = 5;

MARKER(stop_here,bar);return bar;

}

Accessingbar is more complicated. If theprogram has not been started yet, umlsim looksfor variables only in the global scope. Toaccess a variable local to a function, it hasto be qualified with the function name, i.e.main.bar .

If the program is stopped at the labelmain.stop_here , umlsim searches the lo-cal scope first, so justbar is sufficient.

Variables, functions, and labels can also bequalified with the process and the compilationunit. Compilation units are in double quotes.Examples:

$b1 = break($proc.main);$b2 = break("fs/ext2/super.c".

parse_options);"drivers/net/tun.c".debug = 1;"tun.c".debug = 0;

Since distinct processes may use the samename for different types, also struct orunion tags can be qualified, e.g.struct$proc_a.sk_buff .

Type definitions withtypedef differ from Cin that umlsim cannot usefully distinguish atparse time between typedef names and otheridentifiers. Therefore, typedef names are al-ways prefixed with the keywordtypedef ,

Linux Symposium 19

e.g. typedef pte_addr_t . Like all othernames, they can be qualified. For convenience,the C99 standard integer typesuint32_t ,int8_t , etc. are predefined.

Conflicts between C identifiers and keywordsof the scripting language (e.g.printf ) can beresolved by escaping the word with a backslashwhen the C identifier is meant, e.g.\printf .

A peculiarity in the way umlsim handles dataare array copies: when accessing an object ofarray type, the entire array is copied. To obtaina pointer to the array, the& operator must beused. Example:

C program fragment:

int a[10];int b[10];

umlsim script:

$array = a;b = a; /* memcpy equivalent */$ptr = &a;

This can also be used in type casts. E.g. thefollowing construct copies the content of a net-work packet:

$pkt = (unsigned char [skb->len])skb->data;

4 Simulation example

In this section, we use umlsim to demonstratea bug in Linux TCP, and to show the effect of apossible fix. The problem in question, whichwas first observed on a simulator by ChengJin, is that Linux TCP6 decreases the conges-tion window (cwnd, TCP’s estimate of how

6Most if not all 2.4 and 2.5 kernels are affected. Atthe time of writing, this bug still exists in the mainstreamkernel. The entire discussion can be found at [7].

many packets can be “in flight” for a given con-nection) too much if there are multiple packetlosses in a single round-trip time.

When a packet is lost, TCP assumes that thiswas due to congestion, and reducescwnd byhalf. However, if multiple losses occur withina single round-trip time, they should be treatedonly like a single loss. Linux TCP does notdo this, and may reducecwnd to as low as aquarter of the original value. This causes TCPto send data a little slower than it would be al-lowed to.

Sender Receiver

20 packets/s

500 ms one−way delay

5 packets

560 kbps1500 bytes MTU

packet rate)Policer (limits

Figure 9: Network setup used in the simula-tion.

Figure 9 shows the network configuration usedin the simulation: the TCP sender and receiverare connected by a single link with a round-triptime of one second. The maximum through-put is rate-limited to twenty packets per sec-ond. We simulate the transfer of a 1 MB file.

The left-hand side of Figure 10 shows thetransfer with an unchanged 2.5.66 kernel.snd_cwnd is the congestion window, in seg-ments. snd_ssthresh marks the pointwhere TCP switches between “slow start” and“congestion avoidance” mode.snd_cwndshould not fall below snd_ssthresh .snd_una is the number of bytes that havebeen acknowledged by the receiver.packetslost is the cumulative number of packetsdropped by the rate limiter.

For the second simulation, we use the same

Linux Symposium 20

05

10152025303540

0 20 40 60 80 1000

200kB

400kB

600kB

800kB

1MB

Seg

men

ts (c

wnd

, sst

hres

h)

Byt

es (s

eque

nce)

t (sec)

With cwnd quarter bug

snd_cwndsnd_ssthresh

snd_unapackets lost

05

10152025303540

0 20 40 60 80 1000

200kB

400kB

600kB

800kB

1MB

Seg

men

ts (c

wnd

, sst

hres

h)

Byt

es (s

eque

nce)

t (sec)

Work-around for cwnd quarter bug

snd_cwndsnd_ssthresh

snd_unapackets lost

Figure 10: Simulated transfer with and without the “cwnd quarter” bug.

kernel, but set a breakpoint at the beginningof tcp_cwnd_down , and execute a replace-ment in the script instead of the original func-tion. This replacement implements a fix thatkeepscwndfrom being lowered too far.

With this work-around in place,snd_cwndnever falls belowsnd_ssthresh , and thetransfer finishes considerably earlier than in thebuggy version.

5 Future work

As a complex but relatively young project,umlsim still has shortcomings in many areas.This section discusses some of the problems,and outlines approaches for solving them.

Future work on umlsim will primarily focus onthe needs of network simulations, and in par-ticular the analysis of TCP performance.

5.1 Functionality

Networking simulations are essentially limitedto a single-link scenario at the time of writ-ing. Work is under way for providing buildingblocks that allow the construction of arbitrarynetwork topologies.

Another issue all but ignored so far is porta-bility to architectures with other byte order orword size than ia32. Also support for multipro-cessing is absent so far.

The simulator has currently no direct controlover processes running in the user space undera UML kernel. It would be useful if simula-tions could treat such processes like ordinaryprocesses, i.e. by launching them with a sim-ple command, by placing breakpoints, etc.

It would be interesting to explore the possi-bility of using umlsim to reconstruct the in-ternal state of the kernel, based on traces ob-tained from “live” systems. For example, thiscould be used to explain anomalies in networkactivity captured with tcpdump. The open is-sue here is how quickly unavoidable time dif-ferences and events not recorded in the trace(such as soft-interrupt execution after a hard-ware interrupt) will cause the simulation to di-verge from the original system, and how sucherrors can be compensated.

5.2 Usability

umlsim today is clearly a hacker’s toy. Mostusers will want high-level components whenimplementing their simulations, and the script-

Linux Symposium 21

ing language could also use some minorcleanup.

Dark corners of the language include thecast operator, pointers to data in the simula-tor, inconsistencies in the syntax (e.g. nor-mally, $$. thing is equivalent to justthing, butreturn is very different from$$.return ),and subtle differences in the semantics of arrayindices andcase expressions.

To be useful outside the kernel hacker commu-nity, umlsim needs libraries with application-oriented building blocks that provide a conve-nient level of abstraction. At the time of writ-ing, such a library is slowly emerging for net-working, with the main focus on TCP.

Beyond libraries, also preprocessors that trans-late simpler application-oriented languages,like the one used by tcsim, to umlsim may beuseful.

One of the most important aspects of simula-tions is the visualization of results. While itis desirable to retain a maximum of flexibil-ity, examples for data formats, and visualiza-tion packages for common tasks will help usersto obtain results more rapidly.

Also, as befits a hacker’s toy, documentation isincoherent and spotty.

5.3 Performance

At the time of writing, umlsim is rather slow.While some optimization work has been doneto reduce startup time and to accelerate somelookup operations, and more recently alsoto accelerate the communication between thesimulator and the UML processes, several ar-eas remain where major speed improvementare possible.

ptrace is a rather inefficient means for ac-cessing process memory. It would be better

if the simulator entirely bypassed the tracingthread when reading or changing variables, andaccessed the address space of the UML pro-cesses directly.

Also the performance of UML itself is theobject of on-going work [8]. In particular,the so-called “skas mode” (“skas” stands for“Separate Kernel Address Space”) has beenadded recently, to accelerate context switchesof processes under UML [9]. By followingthese changes, umlsim will permit UML to runfaster, which in turn will benefit overall systemperformance, and may perhaps also itself beable to access UML systems more efficiently.

Last but not least, several algorithms and datastructures inside the simulator are rather ineffi-cient, and will have to be improved for largersimulations. For example, associative arraysjust store their elements in a linear list. Also,results of identifier lookups could be cached.

6 Conclusion

umlsim provides the infrastructure for turningthe (UML) Linux kernel into a versatile event-driven simulator, that can be customized usinga scripting language most programmers willfind easy to learn.

The next challenges in the project will be tobring performance closer to that of compara-ble simulators, to improve overall usability, toapply umlsim to concrete problems, and to useexperience gained from such real-life applica-tions to further improve the simulator.

References

[1] Dike, Jeffet al. The User-mode Linux KernelHome Page,http://user-mode-linux.sourceforge.net/

Linux Symposium 22

[2] C. Jin, D. Wei, S.H. Low, G. Buhrmaster, J.Bunn, D.H. Choe, R.L.A. Cottrell, J.C. Doyle,W. Feng, O. Martin, H. Newman, F. Paganini,S. Ravot, S. Singh.FAST TCP: From Theoryto Experiments, Submitted for publication inIEEE Communications Magazine, InternetTechnology Series, April 1, 2003.http://netlab.caltech.edu/pub/papers/fast-030401.pdf

[3] Almesberger, Werner.Linux TrafficControl—Next Generation, Proceedings of the9th International Linux System TechnologyConference (Linux-Kongress 2002), pp.95–103, September 2002.http://www.linux-kongress.org/2002/papers/lk2002-almesberger.html

[4] Perens, Bruce.Electric Fence, ftp://ftp.perens.com/pub/ElectricFence/

[5] Seward, Julian; Nethercote, Nick.Valgrind, anopen-source memory debugger forx86-GNU/Linux, http://developer.kde.org/~sewardj/

[6] The Network Simulator – ns-2,http://www.isi.edu/nsnam/ns/

[7] Almesberger, Werner; Jin, Cheng; Kuznetsov,Alexey.snd_cwnd drawn and quartered,Discussion thread on the netdev mailing list,December 25, 2002.http://marc.theaimsgroup.com/?t=104078129500001

[8] Dike, Jeff.Making Linux Safe for VirtualMachines, Proceedings of Ottawa LinuxSymposium 2002, pp. 107–116, June 2002.http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz

[9] Dike, Jeff. skas mode, http:

//user-mode-linux.sourceforge.

net/skas.html

SCSI Mid-Level Multipath

Michael Anderson, Patrick MansfieldIBM Linux Technology Center

[email protected], [email protected]

Abstract

Multipath IO is the ability to address the samestorage device over multiple connections, pro-viding improved reliability and availability.This concept is not new to Linux®. Multi-path capabilities exist in the volume manage-ment layer, SCSI upper level, and in the SCSIlower level device driver. This paper examinesan approach to providing multipath support inthe Linux 2.5+ SCSI mid-level. An implemen-tation at this level gives the reduced resourceusage and better performance of lower levelimplementations, along with the device inde-pendent capabilities of upper level implemen-tations.

The target audience is developers knowledge-able about SCSI or Linux SCSI internals thatare also interested in multipath storage support.

1 Introduction

This section provides an overview of the char-acteristics of the multiple paths and multipleports presented to the Linux kernel by the stor-age IO transport and by the storage device it-self.

A path is the connection between the host sys-tem and the storage device. Multiple paths toa device result from the storage device havingmore than one port (multi-port storage device)or the host system having multiple connectionsinto a given bus or fabric that connects to a stor-

Interconnect

Multi-Port

Multi-Initiated

Host

HA-AHA-B

Figure 1: Multi-Initiated and Multi-portedmultipath configuration

age device port (multi-initiated interconnect).

Utilization of a multipath device can increasethe availability of the storage device presentedto the operating system by reducing the lossof access due to a single transport problem.Multipath device support may also provide anincrease in performance due to load balanc-ing if the performance attributes of the stor-age device are greater than a single transportcan deliver. When a system architecture likeNUMA exhibits increased latency between lo-cal memory and non-local adapters multipathwith NUMA aware routing can be used to routeIO to adapters with the lowest latency.

Although the necessity for multipath in enter-prise systems is clear, the selection of where to

Linux Symposium 24

implement it has lead to several different ap-proaches in the Linux kernel. This SCSI mid-level multipath solution was created to addressthe following:

• Implementation at a level higher in thestack than vendor unique lower leveldriver solutions, while not exposing de-vice specific knowledge to layers aboveSCSI.

• Reduction of kernel resources while stillutilizing the existing IO scheduler and in-terfaces of the block layer.

• Binding paths to devices without obtain-ing information from the devices media,allowing support for both block and char-acter devices.

• The ability to distinguish between pathand device errors.

• Selection of the optimal path for IO basedon Lower Level Device Driver (LLDD)attributes, NUMA topology, and deviceattributes.

1.1 Multi-Initiated Interconnect

A multi-initiated interconnect results from at-taching multiple host adapters to a single inter-connect. For example, a multi-initiated SCSIbus or multi-initiated Fibre Channel.

Multi-initiated paths to a single storage devicenormally present equal characteristics. Somehardware platforms can create performance in-equalities down separate paths to a storagedevice port when different latencies exist be-tween a host adapter and the memory it is ref-erencing. A NUMA architecture based plat-form can present such latencies; depending onthe magnitude of the latency, platform spe-cific routing policies can increase performance.Path selection will be discussed in detail laterin the document.

1.2 Multi-Port Storage Device

A storage device can present multiple proto-col communication ports to an IO intercon-nect. These ports can be accessed from thehost through a single host bus adapter (singleinitiator) or multiple host bus adapters (multi-initiated).

While the performance characteristics of IOdown multiple paths to a single port of a de-vice are nominally equal (excluding NUMA),the performance to different ports of the devicecan vary greatly due to the architecture of thestorage device.

These different storage device architecturescan be grouped into three behavior modelsbased on the device’s differing response to IOsent to more than one port. A device may ex-hibit differing port behavior only on ports thatcross logical unit ownership boundaries. Somestorage devices can be configured to operate inmore than one behavior mode.

• Failover – When a device is operatingwith Failover behavior, IO to a secondaryport must be preceded by control com-mands indicating a redirection of all IO toan alternate port. Once the storage devicehas transitioned, “Failed over” IO may bedirected to an alternate port. This be-havior model is exhibited in devices withsome performance penalty in the tran-sition of logical unit ownership. Clus-tered shared storage systems may use thismodel to keep port thrashing from degrad-ing performance.

• Transparent Failover – IO to a singlevolume should only be sent to a singleport until an availability condition arisesto cause IO to be redirected to a secondaryport. The storage device will transitiontransparently to using the secondary port

Linux Symposium 25

on the receipt of the first IO to this port.The storage device transition delta for thisfirst IO is significant when compared tosubsequent IOs, such that performancewould degrade noticeably if port transi-tions occurred frequently (i.e., if roundrobin routing policy were used).

• Active Load Balancing – IO to a singlestorage device volume can be sent downany path without degrading performance(note: some cache warmth benefits maybe achieved by using more sophisticatedpath selection algorithms, but this is ven-dor unique). These storage devices usu-ally have a cache that can be symmetri-cally accessed from any input port or asingle cache that all input ports feed into.

1.3 Linux Multipath Implementations

Multipath support to a storage device can beimplemented at different levels in the Linuxoperating system’s IO stack. This support canbe provided by storage vendors, adapter ven-dors, and the base kernel.

• Volume Management– Multipath at thislevel is usually implemented as a modifiedcase of existing RAID support. Multipleblock devices exposed by the operatingsystem point to the same storage deviceand are configured to be failover paths forIO. The “md” driver [3] and LVM multi-path patch [2] are examples of support atthis level.

• Upper Level – Support at this level in-volves chaining or linking the multipleblock devices exposed by the operatingsystem as failover paths. An example ofthis type of implementation is the T3 Mul-tipath failover driver written by LinuxcareInc. [5].

• Mid-Level – This is the level of imple-mentation described in this document.

• Lower Level – An implementation atthis level involves a binding of commonvendor adapters exposing only one de-vice to the operating system. On fail-ure, the adapter driver re-drives the IOthrough another adapter previously pairedas a failover adapter. An example of thistype of implementation is the Qlogic Fi-bre Channel failover driver.

2 Data Model

2.1 Current Linux SCSI Device Data Model

This section provides a high level overview ofthe Linux 2.5 SCSI subsystem data structureswith a focus on providing a background forlater discussion on multipath support. Gen-eral Linux SCSI information can be obtainedby viewing the "The Linux 2.4 SCSI subsys-tem HOWTO" [1] and [4] listed in the Refer-ence section.

Each LLDD that wishes to register with theLinux SCSI sub-system provides a SCSI hosttemplate (Scsi_Host_Template ) datastructure that describes the capabilities of thedriver and interface functions.

The LLDD can register with the SCSI subsys-tem in two ways. One method is a Legacyinterface, which is driven from the SCSI mid-level code and calls into the LLDD detect rou-tine, which causesscsi_register() to becalled for each adapter card detected by thedriver. The other method allows the LLDDto callscsi_register() directly and thencall scsi_add_host() when it is initial-ized and ready to be scanned. These two meth-ods result in aScsi_Host data structure be-ing allocated for each instance of an adaptercard.

Linux Symposium 26

If an adapter contains multiple busses or chan-nels (not a PCI bridge of two cards), there willbe only one SCSI host structure (Scsi_Host )created.

After a kernel boot, a insmod of a LLDD, orhotplug event, a list (scsi_hostlist ) willcontain a SCSI host structure representing eachadapter detected.

During device scanning a SCSI device datastructure (scsi_device ) will be allocatedfor each logical unit discovered. Each SCSIdevice structure will be added to a linked listmember of its SCSI host parent.

See figure 2 for a diagram of the data struc-tures and their relationships.

scsi_host list_head

scsi_device request_queue

scsi_hostlist scsi_host list_head




Figure 2: Linux SCSI Data Structures

Once the scanning phase is complete astruct scsi_device will be associatedto only onestruct Scsi_Host .

2.2 Mid-Level Multipath Data Model

When no multipath capabilities are enabled inthe Linux SCSI subsystem, multiple paths re-sult in a SCSI device structure being createdand eventually exposed through the block orcharacter layer for each path. This redundancywastes system resources and creates a non-optimal presentation of structures to the blocklayer and user level.

The mid-level multipath implementation coa-lesces these extraneous SCSI device structureswhile still maintaining the information relatingto the paths.

The child relationship of the SCSI device struc-ture to the SCSI host structure is removed anda new relationship is created using the mid-level multipath structures. The SCSI multi-path structure (struct scsi_mpath ) is acontainer for all the paths to the storage de-vice. The SCSI mpath structure also containsinformation on the path routing policy, a countof active paths, and a reference to where thelast IO was routed. Thescsi_mpath struc-ture is associated with the SCSI device struc-ture through a new member,sdev_paths .

Each path is represented by a SCSI pathstructure (struct scsi_path ). This pathstructure contains a fast reference to the nextsibling path, the state of the path, and a SCSInexus structure (struct scsi_nexus ).

The nexus object contains the transport spe-cific knowledge to communicate with the stor-age device. In an ideal world, this nexus wouldbe an opaque object (i.e., a handle) that washanded to the SCSI mid-level during the devicescanning process.

See figure 3 for a diagram of the multipath datastructures and their relationships.

2.2.1 Multipath Data Model General Ap-plicability

The data model presented for multipath has ad-vantages even in non-multipath cases.

Because the mid-level model has separated therequest queue presented to the block layer fromthe nexus object that is associated with theSCSI hosts, paths containing nexus objects canbe added to or removed from a SCSI device

Linux Symposium 27

scsi_device request_queue sdev_paths

scsi_mpath list_head

scsi_sdev_list

scsi_path nexus

scsi_path nexus

scsi_nexus host

scsi_nexus host

scsi_host scsi_host

scsi_hostlist

scsi_device request_queue sdev_paths

Figure 3: Mid-Level Multipath Data Structures

structure.

When all paths have failed or there are no pathsto a device, a policy could be created to allowthe SCSI device structure to remain in place,but suspend SCSI IO request processing. Thiswould allow block and file system level at-tachments to remain established while trans-port connectivity is in flux.

If UUID authentication can be ensured (thiswould be the case for multipath devices sup-ported by this implementation) new pathscould bind to the SCSI device and IO requestprocessing would resume. If authenticationcannot be ensured, lower level resources canbe released in less time than current structuresallow.

Depending on the future direction of the SCSImid layer, this separation could be used to addor remove SCSI subsystem components while

IO is active.

3 Mid-Level Multipath

3.1 Linux SCSI Scanning Overview

Following is an overview of the SCSI scanalgorithm for a given logical unit withinscsi_scan.c as pertains to modificationsfor use with multipath (per code in linux ver-sion 2.5.68).

A call to scsi_alloc_sdev allocates andinitializes ascsi_device (sdev). Note thata sdev , logical unit, and I_T_L nexus are allequivalent in the current linux SCSI code. TheLLDD slave_alloc() function is calledfor thesdev [4].

The sdev is sent an INQUIRY. Deviceattributes settings are obtained by callingscsi_get_device_flags() .

If the logical unit responds, and it has a logi-cal unit configured, thesdev is left in place,otherwise it is removed.

scsi_load_identifier() function iscalled to get a UUID (universal unique iden-tifier) via SCSI INQUIRY VPD pages [6], andthe result is stored insdev->name .

scsi_device_register() is called,generating a hotplug event.

Last of all, the LLDD slave_configure() function is called forthesdev .

Upper level attaches are done after all scan-ning (on insmod or initialization of a LLDD)or the upper level attach is done after a singlelogical unit is scanned via /proc/scsi/scsi (inscsi_add_device() ). These in turn cangenerate their own set of hotplug events as theupper level drivers (sd, st, sr, and sg) attach to

Linux Symposium 28

eachsdev .

This means that a series of hotplug events oc-curs for many scsi_devices, followed by a se-ries of hotplug events for each upper level de-vice (such as a block device).

3.2 Scan Modifications for Multipath

The scanning code is changed as follows formid-level multipath support.

The allocation of ansdev is modified to notonly allocate the actualscsi_device ,but to also allocate and add a singlepath (struct scsi_path , includinga struct scsi_nexus ) to the sdev .slave_alloc() is modified to take bothan sdev andscsi_nexus as an argument,such that it can access both logical unit data(in thesdev ) and nexus specific data.

Future plans are to supply a set of parallelslave_nnn interfaces for use with multipath, sothat existing drivers not supporting the new in-terfaces will behave as if the multipath patchwere not applied (each path to a storage devicewill generate a newscsi_device ).

After a UUID is retrieved, all existingsdev ’sare searched for a match.

If no match is found, this is the first path to thedevice, and it is handled the same way as thecurrent non-multipath code.

If a match is found, the new path is added tothe matchingsdev (the paths are coalesced),and the currentsdev is freed.

slave_configure() is also modified totake both ansdev and ascsi_nexus as ar-guments.

3.3 UUID

The immutability of the UUID is key to deter-mining if more than one nexus can access thesame storage device. This is not a simple prob-lem to deal with, as some devices return noUUID, some return a UUID that is not unique,and others require device specific methods toretrieve a truly unique UUID. Future changes(such as a UUID white list) are required toproperly handle the UUID in all cases; userlevel scanning would simplify the problem.

Discussions were actively in progress at thetime this paper was written on whether or notto keep the existing UUID retrieval code inthe kernel. Depending on the outcome, andamount of time available, the multipath patchmight have to carry the UUID retrieval.

The primary problem with moving UUID re-trieval to user level (for use with multipath, as-suming full user level scanning is beyond thescope of the current implementation) is that thecurrent scan and upper level attach are initi-ated without the ability for user level interven-tion - all upper level devices are attached toall scsi_devices with no synchronizationfrom user space.

Without the coalescing of paths as describedabove devices can show up multiple times,leading to potential resource shortages (mem-ory as well as major/minor numbers), and po-tential problems for applications dependent onthe hiding of duplicate paths.

Further investigation is needed to determine ifit is practical to modify the scan and upperlevel attach to be user initiated (versus the moredifficult problem of complete user level scan-ning). Such that all devices are scanned in ker-nel code, and then from user level: UUID’sretrieved, coalesced, and then upper level at-taches triggered.

Linux Symposium 29

3.4 Current SCSI I/O Request Flow Overview

Following is an overview of the current IO re-quest flow as it pertains to functions modifiedfor use with SCSI mid-level multipath IO.

A SCSI device structure request queue mem-ber (request_queue ) is registered with theblock layer at SCSI scan time for en-queuingrequests. Both SCSI character and block de-vices utilize this queue, as do the commandsissued during SCSI scan and by upper level at-tachment.

For block IO devices, an IO requestis sent to the block layer via the__make_request() function. In turnit eventually calls the SCSI block requestfunction,scsi_request_fn() .

For SCSI character devices, scanning, andcommands sent during upper level attaches,scsi_do_req() or scsi_wait_req()are used to send SCSI commands tothe scsi_device . These functionssetup the sr_done function pointer, in-sert a request, and trigger a call to thescsi_request_fn() via a call to theblock queueblk_insert_request() .

The scsi_request_fn() func-tion retrieves a request and calls thescsi_prep_fn() via a call to theelv_next_request() .

scsi_prep_fn() allocates and initial-izes the scsi_cmnd . The scsi_cmnddone function is set in upper level driversvia calls to their init_command func-tions. For users ofscsi_wait_req() orscsi_wait_req() , the done function isset to thesr_done .

Thescsi_cmnd is the key data structure usedto issue a request to a LLDD.

Control continues in scsi_request

_fn() , where resource limitations and hard-ware limits (such as queue depth) are checkedvia calls to scsi_dev_queue_ready()andscsi_host_queue_ready() .

If resources are available, scsi_dispatch_cmd() is called, it adds atimeout, and transfers control to the LLDDby calling thescsi_host queuecommandfunction, passing thescsi_cmnd , andscsi_done() .

The LLDD is responsible for sending the com-mand to the actual logical unit. After the re-quest is submitted,queuecommand returns.

Upon completion of the IO request, the LLDDcalls thescsi_cmnd scsi_done() func-tion.

scsi_done() puts the completed com-mand onto a per-CPU queue, and raises theSCSI_SOFTIRQ.

scsi_softirq() determines the comple-tion status of eachscsi_cmnd via calls toscsi_decide_disposition() .

scsi_decide_disposition() clas-sifies the completion status of the IO (thescsi_cmnd) and returns the following values toscsi_done() , that lead to further actions:

SUCCESS: the IO has completed without er-ror, the command is completed by callingscsi_finish_command() . This includesan IO completion with failures (for example,an IO went to a disk, but had media errors).

ADD_TO_MLQUEUE: the IO completedwith a SCSI QUEUE FULL status. Thecommand is re-queued for a retry viascsi_queue_insert() , effectivelyresending the command.

NEEDS_RETRY: the IO had a temporary orother condition such that it can be immediately

Linux Symposium 30

resent, resend the IO (without re-queuing it) bycallingscsi_retry_command() .

FAILURE or any other value: a de-vice or transport error occurred. Callscsi_eh_scmd_add() to queue the failedcommand for error handling; when the hostadapter has no more IO outstanding (when theactive_count equals thehost_failedcount) the error handler wakes up and handlesall failed commands.

scsi_finish_command() is themain path for IO completion, it callsthe scsi_cmnd done function, ei-ther the upper level completion function(for sd, sd_rw_intr ) or the func-tion specified in scsi_wait_req() orscsi_wait_done() .

3.5 Modifications for IO Path Selection andPath Failures

The SCSI code is modified as follows for mid-level multipath.

A path (including nexus) is selected viaa call to scsi_get_best_path() fromscsi_request_fn() .

Path selection is affected by the number ofpaths, path state, path policy, and NUMAtopology.

A list of all available paths and all active pathsto a device are kept. In addition, for NUMAsystems, there is a list of paths local to a givennode.

If no active paths are available, thescsi_request_fn() function failsthe IO request.

Currently, path selection policy iscontrolled via the global variablescsi_path_dflt_path_policy . Thisis set via the kernel config and can be modified

at boot time, with future plans to allow settingthis both per device and via sysfs.

Settingscsi_path_dflt_path_policyto SCSI_PATH_POLICY_LPU (1) sets thepath selection policy to last path used. Thismeans that another path will only be used onfailure.

Settingscsi_path_dflt_path_policyto SCSI_PATH_POLICY_ROUND_ROBIN(2) sets the path selection policy to roundrobin. This means that paths will be rotatedacross all available paths on every request sentto the device.

A last-path used policy is safest for generalpurpose use (for example with a device usinga transparent failover model). Future plans areto add device specific attribute hooks and codeto fully support a transparent failover model,so that round-robin path selection can be doneacross a subset (lowest path weight) of allpaths, not just a single path.

NUMA path selection where possible picks apath local to the node containing the memoryto be used for the IO operation. If no localpath is available, all paths are valid for selec-tion. So, for round-robin path selection, pathselection is either round-robin with respect toall paths local to a given node, or for all pathsto a device.

Current NUMA multipath support is limited toa one-to-one mapping of path to node. Futureplans are to support multiple nodes connectedto the same path (the same bus). Changes arerequired in the current kernel NUMA topologyin order to support topologies that have variedor unequal distances (that is, where the inter-node distances can vary), or for cases where anode contains no CPU’s.

For multipath, thescsi_cmnd device ischanged from astruct scsi_device

Linux Symposium 31

pointer to astruct scsi_nexus pointer;the scsi_nexus retains the same fields asthose used by the LLDD in order to deter-mine the nexus (that is, thescsi_nexus con-tains ascsi_host pointer, channel , idandlun ).

Then as part of the path selection, thescsi_cmnd device pointer is set to pointto the selected path’s nexus. (Renaming thescsi_cmnd device tonexus would be ap-propriate.)

The LLDD queuecommand function is called,passing the scsi_cmnd that includes apointer to ascsi_nexus . Existing LLDDcode can then be used, with no changes re-quired to the core of the LLDD.

Upon IO completion,scsi_path_decide_disposition() categorizes failures astransport (path failure) or device specific (de-vice failure). Path specific failures cause a pathto fail, and the IO can be retried on any remain-ing paths. Device specific failures generally of-fline the device, and do not allow an IO to beretried.

scsi_check_paths() is then called withan indication as to whether a path failure hasoccurred or not, it updates the path state, andreturns a result specifying the action to take:the standardSUCCESS(meaning the IO hascompleted, not that it has completed success-fully), FAILURE, or a newREQUEUEvaluesignaling that the IO should be re-queued.NEEDS_RETRYis no longer returned.

In scsi_softirq() , when aREQUEUE result is returned fromscsi_decide_disposition() , the IOis re-queued viascsi_queue_insert() .

So, on a path failure, the IO is re-queued, andin scsi_request_fn() the IO can be re-tried on any of the remaining paths.

3.6 User Space Interface

3.6.1 Procfs

The mid-level multipath code provides a procfsinterface for viewing and setting attributes re-lated to paths. The path to the procfs fileis /proc/scsi/scsi_path/paths . Thefile supports both read and write operations,and displays attributes about the paths. The ta-ble below provides a description of each of thecolumns in the output.

Column Description1 UUID2 Host Number3 Host Channel4 Target ID5 Lun6 State7 Failures8 Weight

Table 1: Procfs Columns

An edited output to show only one device of aread is shown as follows:

#cat /proc/scsi/scsi_path/paths...2000002037171f24 3 0 1 0 1 0 02000002037171f24 4 0 1 0 1 0 0...

Writing to the file allows path attributes to bemodified. Currently the only meaningful writeoperation is to modify path state. A path statemay be modified from dead to good or good todead. A good path state has a value of "1" anda dead path has a value of "3".

An example of failing a path is shown as fol-lows:

echo ‘2000002037171f24 3 0 1 0 3 0 0‘

Linux Symposium 32

> /proc/scsi/scsi_path/paths

In the near term the procfs interface will be re-placed with a sysfs interface. At the time of thiswriting, the interface was not complete and isdiscussed in the Future Work section.

4 Future Work

4.1 Multipath Device Personality

The use of device-neutral information throughstandard SCSI interfaces limits the set of mul-tipath capabilities that can be supported on agiven device to a minimal set known to be safefor all devices. To utilize the extended capa-bilities of some storage device’s, the device at-tributes or a device’s “personality” needs to beexposed.

The interfaces provided for obtaining this per-sonality knowledge will not be restricted tokernel space. Some data can be set from userspace, but other operations will need to be ker-nel resident to avoid deadlock. Further direc-tion toward user level scanning will affect theseinterfaces.

The single kernel config time path policy canbe enhanced with device attribute informationallowing support for device specific path poli-cies.

Path weighting values related to a device’s at-tributes would allow proper primary and sec-ondary paths to be determined. The ability todetermine preferred paths to assist in the bal-ancing of load across a storage device’s portcan also be determined.

4.2 SCSI Reservations

The support of SCSI Reserve/Release affectsthe type of path selection policy that can be se-lected for a storage device restricting it to the

last path used. Support of SCSI persistent re-serve requires an interface to accept a reserva-tion key and special IO operations before pathscan used for normal IO. Future work is tryingto meet the requirements of SCSI reservationwith a general purpose path preparation capa-bility.

4.3 Sysfs Interface

As mentioned in a previous section, the procfsinterface to mid-level multipath is being mi-grated to a sysfs-based interface. The migra-tion to a sysfs interface allows for the linkageto the device tree for increased path topologyinformation and the utilization of common ker-nel code infrastructure, reducing code duplica-tion.

5 Availability

The SCSI Mid-Level Multipath project page islocated at:

http://www-124.ibm.com/storageio/multipath/scsi-multipath/

Legal Statement

This work represents the view of the author anddoes not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may betrademarks or service marks of others.

References

[1] Douglas Gilbert, “The Linux 2.4 SCSIsubsystem HOWTO”http://www.linuxdoc.org/HOWTO/

SCSI-2.4-HOWTO/index.html

Linux Symposium 33

[2] “LVM multipath support”,http://oss.software.ibm.com/

linux390/useful_add-ons_lvm.shtml

[3] “MD Multiple Devices driver”,drivers/md/*

[4] “SCSI mid_level - lower_level driverinterface”,Documentation/scsi/

scsi_mid_low_api.txt

[5] “Linux T3 Driver”,http://open-projects.linuxcare.com/t3/

[6] “SCSI Primary Commands - 3 (SPC-3)”,ftp://ftp.t10.org/t10/drafts/

sam3/sam3r06.pdf

[7] “SCSI Architecture Model - 3 (SAM-3)”,ftp://ftp.t10.org/t10/drafts/

sam3/sam3r06.pdf

IPv4/IPv6 TranslationAllowing IPv4 hosts to communicate with IPv6 hosts without modifying the software on the

IPv4 or IPv6 hosts

J. William Atwood∗

Kedar C. DasXing (Scott) Jiang

Concordia UniversityDepartment of Computer Science

Montréal, Québec H3G 1M8

[email protected], http://www.cs.concordia.ca/˜faculty/bill

Abstract

As the Internet makes the transition from IPversion 4 to IP version 6, it will be nec-essary to allow IPv4-based clients to accessIPv6-based servers, and IPv6-based clientsto access legacy services. Network Ad-dress Translation–Protocol Translation (NAT-PT) can provide network protocol transla-tion, and Application Layer Gateways (ALGs)can handle the cases where peer addressesare embedded in application-layer messages.We describe an implementation on a Linuxrouter/translation server, the necessary config-uration of the IPv4 and IPv6 environments, andthe operation of ALGs for the File TransferProtocol and for the Session Initiation Proto-col. We will present a demonstration of ba-sic communication (web client to web server),and of multimedia communication (based onthe open-source VOCAL project).

∗On leave at Ericsson Research Canada, Open Sys-tems Research Laboratory, Montréal, Québec, Canada

1 Introduction

IPv6 is the next generation protocol designedby the IETF to replace the current versionof the Internet Protocol, IPv4. During thelast decade, IP has conquered the world’s net-works. Most of today’s Internet uses IPv4,which has been remarkably resilient in spite ofits age, but it is beginning to have problems.

One motivation for developing IPv6 was theanticipated exhaustion of addresses for indi-vidual hosts. While the rate of depletion hasbeen slowed through the use of Network Ad-dress Translation (NAT) [1], it does continue,and the other virtues of IPv6 (routing and net-work autoconfiguration, and enhanced supportfor IP Security (IPsec) and IP Mobility (MobileIPv6)), will encourage its deployment muchmore widely in coming years.

Although a significant percentage of the clientsand servers will bedual stack(i.e., capable ofusing either IPv4 or IPv6), there will be a largenumber of existing clients and servers (legacysystems) that will only be capable of usingIPv4, and there will be a growing number of

Linux Symposium 35

clients and servers that will only be capable ofusing IPv6. For example, the Third GenerationPartnership Project (3GPP) has mandated thatthird generation cellular networks will be “All-IP,” and that the “IP” will be IP version 6only.

In the same way as NAT has been used toconnect hosts on private networks [2] withhosts on the public network, Network AddressTranslation–Protocol Translation (NAT-PT) [3]has been standardized as a way of connectinghosts in the IPv4 address space and hosts in theIPv6 address space. NAT and NAT-PT work byaltering the IP headers. For NAT, only the ad-dress fields are replaced; for NAT-PT, the entireheader is changed. This use of NAT-PT solvesthe network-layer problem for the IPv4/IPv6transition, but it requires some auxilliary ser-vices to operate properly, and it does not solvea number of application-layer problems associ-ated with crossing the IPv4/IPv6 boundary.

In this paper, we discuss the auxilliary servicesneeded, report an experimental validation ofthe requirements, outline the solution to someof the application-layer problems, and specu-late on the solution to the rest.

2 System Architecture

Figure 1 identifies the typical components thatwill be used to support communication be-tween IPv4 hosts and IPv6 hosts. The IPv4region represents the entire IPv4-based Inter-net of today. The IPv6 stub region contains thehosts that are to be granted the privilege of ac-cessing legacy (IPv4-based) services. The IPv6region represents the rest of the IPv6 addressspace. The v4/v6 Border Router provides theconnection between the IPv6 stub hosts and theIPv4 hosts. The v6/v6 Border Router providesthe connection between the IPv6 stub regionhosts and the rest of the IPv6 address space. Insome systems, the v4/v6 Border Router and the

v6/v6 Border Router will be co-located. How-ever, we leave them separate in the following,to make the explanations clearer.

DNS6Server

IPv6 Region

localDNS Server

IPv6 Host

IPv6 Host

IPv6 Stub Region

v6/v6 Border Router

IPv4 region

DNS4Server

IPv4 Host

IPv4 Host

v4/v6 Border Router

Figure 1: System Architecture

Each host has ahost nameand ahost address.The host name is a (globally unique) characterstring that is intended to be human-readable.The host address is a (globally unique) 32-bit(IPv4) or 128-bit (IPv6) number. A particularhost may have more than one name, and morethan one address, especially if it has multipleinterfaces.

Two address pools are associated with thev4/v6 Border Router. TheIPv4 address poolis a sequence of addresses that are associated(temporarily) with the IPv6 hosts that are com-municating with IPv4 hosts. Given the scarcityof IPv4 addresses, this pool will be sized tocorrespond to the number of IPv6 hosts (in theIPv6 stub region) that areactively communi-cating with IPv4 hosts at a particular time. TheIPv6 address pool is a sequence of addressesthat represent hosts in the IPv4 region. Giventhe large size of the IPv6 address space, this

Linux Symposium 36

pool is structured as a 96-bit prefix, catenatedwith a 32-bit IPv4 address. A motivation forthis will be presented later, in Section 3.6.

2.1 Translation Requirements

As packets move between the IPv4 region andthe IPv6 region, it is necessary to rebuild theirheaders, since the IPv4 and IPv6 packet head-ers have different formats. This process isstateless—the necessary mapping informationis determined by tables in the v4/v6 BorderRouter (for packets travelling from the IPv4 re-gion to the IPv6 stub region) or by informationcarried in the packet address (for packets trav-elling from the IPv6 stub region to the IPv4 re-gion). For NAT-PT, the mapping between anIPv4 pool address and the corresponding IPv6host address is one-to-one. When there are in-sufficient IPv4 pool addresses available, thenNAPT-PT can be used, with a mapping from(IPv4 address, IPv4 port) to (IPv6 address).This allows about 65,000 IPv6 hosts to be ser-viced using a single IPv4 address, as long asthe IPv4 application does not care about theport that is being used to access it.

Certain packets require special treatment. Ingeneral, these packets contain application datathat have embedded IPv4 or IPv6 addresses.They are identified when their headers are pro-cessed (usually by noting which port they areusing), and they are handled by application-specific code called anApplication LayerGateway(ALG). The ALGs are application-specific, because they need to be able to parsethe packets being exchanged by the applicationend-points. The specific ALG then modifiesthe contents of the packet, to reflect the addresstranslation that has just taken place in the head-ers.

2.2 Centralized Architecture

One approach to handling the entire require-ment is to co-locate the NAT-PT software andthe set of ALGs needed to support the desiredapplications. In this case the interaction be-tween the ALG and the translation tables in theNAT-PT software is simplified. However, thev4/v6 Border Router must provide processingpower for all functions, which could overloadit.

2.3 Distributed Architecture

An alternate approach is to separate the ALGsfrom the NAT-PT software. This lowers theprocessing requirements for the v4/v6 BorderRouter, but it introduces a requirement to de-fine a protocol for interaction between the ALGand the NAT-PT. This approach is favouredwhen the application makes use of some formof “proxy” server, because the proxy functionsand the ALG functions can often be advanta-geously combined in a single host, and sepa-rated from the v4/v6 Border Router. Commer-cial systems will need to adopt this approach,to handle the large number of IPv6 hosts thatwill be in a typical IPv6 stub region. However,our project was concerned with exploring is-sues relating to establishing the right environ-ment, and we did not require high performanceat this time.

3 NAT-PT Tool

The experimental system was based on a user-space NAT-PT implementation developed atETRI [4]. The original implementation wasbased on a Linux 2.4.0 kernel, and requiredmodification to make it work on the more re-cent (2.4.20) Linux kernel.

Linux Symposium 37

3.1 General Flow

To establish communication between IPv4 andIPv6 using NAT-PT we need interactions of atleast 3 modules. These are-NAT, PT, ALG.

3.2 Network Address Translation (NAT)

The NAT module implements the trans-port/network layer translation mechanism. Ituses a pool of IPv4 addresses for assigning toIPv6 nodes dynamically, and this assignmentis done when sessions are initiated across thev4/v6 boundary.

3.3 Protocol Translation (PT)

As all the fields of IPv6 headers are not thesame as that of the IPv4 header, the PT mod-ule translates IP/ICMP headers to make end-to-end communication possible. Due to the ad-dress translation function and because of pos-sible port multiplexing, PT also makes appro-priate adjustments to the upper layer protocol(TCP/UDP) headers, e.g., the checksum.

The IPv4-to-IPv6 translator replaces the IPv4header of IPv4 packet with an IPv6 header tosend it to the IPv6 host. Except for ICMP pack-ets, the transport layer header and data portionof the packet are left unchanged. In IPv6, pathMTU discovery is mandatory but it is optionalin IPv4. This implies that IPv6 routers willnever fragment a packet—only the sender cando fragmentation. Path MTU discovery is im-plemented by sending an ICMP error messageto the packet-sender stating that the packet istoo big. When an IPv6 router sends an ICMPerror message, it will pass through a translator,which will translate the ICMP error to a formthat the IPv4 sender can understand. In thiscase an IPv6 fragment header is only includedif the IPv4 packet is already fragmented. Thepresence of df flag in the IPv4 header is the in-dication of Path MTU discovery.

However, if the IPv4 sender does not performpath MTU discovery, the translator has to en-sure that the packet does not exceed the pathMTU on the IPv6 side. The translator frag-ments the IPv4 packet so that it fits in a 1280byte IPv6 packet, since IPv6 guarantees that1280 byte packets never need to be fragmented.Also, when the IPv4 sender does not performpath MTU discovery the translator must alwaysinclude an IPv6 fragment header to indicatethat the sender allows fragmentation.

3.4 Application Layer Gateway (ALG)

Several applications send IP addresses and hostnames within the payload of the IP packet.Because NAT-PT does not snoop the payload,it requires some Application Level Gateways(ALG) to extract that address to replace it ei-ther by IPv4 or IPv6. That is, ALG could workin conjunction with NAT-PT to provide supportfor many such applications. Two examples arethe FTP-ALG and the SIP-ALG. These are dis-cussed in Section 3.8 and Section 4.

While the FTP-ALG and the SIP-ALG are op-tional (provision of a specific ALG dependson the desire to support that particular appli-cation), the DNS-ALG is essential, because itis the trapping of the DNS queries that allowsNAT-PT to discover the need for mapping be-tween IPv4 addresses and IPv6 addresses.

In the DNS, “A” records represent IPv4 ad-dresses and “AAAA” or “A6” records representIPv6 addresses.

3.5 Communication from V4 to V6

Any packet originating on the IPv4 side des-tined to the IPv6 stub network should cross thev4/v6 Border Router, which is the NAT-PT de-vice. When any packet is received on the IPv4side, NAT-PT will check the destination ad-dress. If it is an IPv4 pool address, and if a

Linux Symposium 38

mapping exists to an IPv6 host, then the IPv4packet header is converted to an IPv6 packetheader, otherwise the packet is dropped. TheIPv6 source address will be the PREFIX cate-nated with the IPv4 source address; the IPv6destination address will be the mapped IPv6host address.

If the packet is a DNS query, then the DNS-ALG will change the query type from “A” to“AAAA”, and alter the format of strings end-ing in “IN-ARPA.ARPA” to the IPv6 formatending in “IP6.INT”. When the response is re-turned, then the DNS-ALG will change the“AAAA” record to an “A” record (if the resolu-tion was successful), and change the resolvedIPv6 address to the corresponding IPv4 (pool)address.

When any other response packet is returned tothe NAT-PT, the IPv4 destination address willbe found by removing the PREFIX from theIPv6 destination address. The source addressto be used in the generated IPv4 packet is theIPv4 pool address corresponding to the IPv6host.

3.6 Communication from V6 to V4

When a packet is received from the IPv6 side,its destination address will consist of the PRE-FIX catenated with the actual IPv4 destination,so the IPv4 packet can be created using thisaddress as the destination, and the IPv4 pooladdress corresponding to the IPv6 host as thesource address.

The key to the creation of the mapping is theDNS queries. If a DNS query is received froman IPv6 host, it is not knowna priori whetherthe target host is a v4 host or a v6 host. TheDNS-ALG will therefore split the query into an“A” query sent to the v4 DNS, and an “AAAA”query sent to the v6 DNS. If a v6 response is re-ceived, then any v4 response is discarded (the

v6 path is preferred). Otherwise, a received v4response triggers creation of a mapping entry,and then an “AAAA” response is generated us-ing PREFIX catenated with the v4 address.

Returning traffic on the IPv4 side will arriveat a pool address. This is used to determinethe correct IPv6 destination address, so that thepacket can be forwarded. The IPv6 source ad-dress will be the PREFIX catenated with theoriginal IPv4 source address.

3.7 TCP/UDP/ICMP Checksum Update

NAT-PT retains the mapping between a spe-cific IPv6 host address and an IPv4 addressfrom the pool of IPv4 addresses available.The mapping between IPv6 address and anIPv4 from the pool of IPv4 addresses is usedin the translation of packets passing throughNAT-PT. With the translation of IP header,TCP/UDP/ICMP checksum is also updated inNAT-PT according to specific algorithm.

3.8 FTP Application Layer Gateway (FTP-ALG)

Existing FTP works with IPv4 addresses. Twoimportant FTP commands PORT and PASVwill no longer exist in IPv6. The PORT com-mand is used to specify a port different fromthe default one, and it contains the IPv6 ad-dress information. So it can’t be used with-out translation. The PASV command is used toput the server into passive mode, which meansthe server listens on a specific data port ratherthan initiating the transfer. This command in-cludes the host name and address of the FTPserver and therefore does not work over IPv6without modification. The PORT command isreplaced by the EPRT command, which allowsthe specification of an extended address for theconnection. The extended address specifies thenetwork protocol (IPv6 or IPv4), as well as theIP address and the port to be used. The EPSV

Linux Symposium 39

command replaces the PASV command. TheEPSV command has an optional argument thatallows it to specify the network protocol, ifnecessary. The server reply contains only theport number on which it listens, but the for-mat of the answer is similar to the one used forthe EPRT command and has a placeholder forthe network protocol and address informationmight be used in the future to provide flexibil-ity in using FTP through firewalls and NATs.An FTP control session carries the IP addressand TCP port information for the data sessionin its payload; an FTP-ALG provides applica-tion level transparency.

If a V4 host originates the FTP session anduses PORT or PASV command, the FTP-ALGwill translate these commands into EPRT andEPSV commands respectively prior to for-warding to the V6 node. Likewise, EPSVresponse from V6 nodes will be translatedinto PASV response prior to forwarding to V4nodes.

If a V4 host originated the FTP session andwas using EPRT and EPSV commands, theFTP-ALG will simply translate the parametersto these commands, without altering the com-mands themselves.

3.9 Payload Modifications for V6 originatedFTP sessions

If a V6 host originates the FTP session theFTP-ALG has two approaches:

In the first approach, the FTP-ALG will leavethe command strings “EPRT” and “EPSV”unaltered and simply translate the <net-prt>,<net-addr> and <tcp-port> arguments from V6to its NAT-PT (or NAPT-PT) assigned V4 in-formation. <tcp-port> is translated only in thecase of NAPT-PT. The same goes for the EPSVresponse from V4 node. With this approach,the V4 hosts must have their FTP applicationupgraded to support EPRT and EPSV exten-

sions to allow access from V6 hosts.

In the second approach, the FTP-ALG willtranslate the command strings “EPRT” and“EPSV” and their parameters from the V6 nodeinto their equivalent NAT-PT assigned V4 nodeinfo and attach to “PORT” and “PASV” com-mands prior to forwarding to the V4 node.However, the FTP-ALG would be unable totranslate the command “EPSVALL” issued byV6 nodes. In such a case, the V4 host, whichreceives the command, may return an errorcode indicating unsupported function, and thiserror response may cause FTP applications tosimply fail. The benefit of this approach is thatis does not impose any FTP upgrade require-ments on V4 hosts.

3.10 Header updates for FTP control packets

All the payload translations considered in theprevious sections are based on ASCII encodeddata. As a result, these translations may resultin a change in the size of packet. If the newsize is the same as the previous, only the TCPchecksum needs adjustment as a result of thepayload translation. If the new size is differ-ent from the previous, TCP sequence numbersshould also be changed to reflect the change inthe length of the FTP control session payload.The IP packet length field in the V4 header orthe IP payload length field in the V6 headershould also be changed to reflect the new pay-load size. A table is used by the FTP-ALGto correct the TCP sequence and acknowledge-ment numbers in the TCP header for controlpackets in both directions.

The table entries should have the source ad-dress, source data port, destination address anddestination data port for V4 and V6 portionsof the session, sequence number delta for out-bound control packets and sequence numberdelta for inbound control packets.

Linux Symposium 40

4 SIP-ALG Operation

Many communication applications on the In-ternet require a session protocol to negotiateand maintain the data exchange between end-points in a session.

Moreover, as the evolution of the mobile com-puting and wireless networks, a session is re-quired to handle user mobility, different me-dia type, and media addition and removal inan existing session. Upon these requirements,the Internet Engineering Task Force (IETF) is-sued the Session Initiation Protocol (SIP) to en-able the Internet endpoints (called user agentsin SIP) to discover one another and to agree onthe parameters of a session.

4.1 SIP overview

“SIP is an application-layer control protocolthat can establish, modify, and terminate mul-timedia sessions (conferences) such as Inter-net telephony calls” [5]. SIP is specified as anagile and general-purpose tool that works in-dependently of underlying transport protocolsand without dependency on the various mediatypes.

SIP establishes sessions by its invitations inwhich session descriptions are used to negoti-ate a set of compatible media types to be sharedamong participants. In addition, SIP can in-vite participants to join in an already existingsession. SIP transparently supports name map-ping and redirect services. SIP proxy severscould be used to facilitate routing SIP requeststo the user’s current location. SIP also providesa registration function that stores users’ currentlocations that could be used by proxy servers toredirect requests.

The characteristics of SIP are simplicity andflexibility. SIP is not a complete communica-tion system. SIP is rather a component that can

cooperate with other IETF protocols to providecomplete services to the users. Typically, theReal-time Transport Protocol (RTP) [6] couldbe combined with SIP to support real-timedata transfer and provide QoS feedback; theSession Description Protocol (SDP) [7] couldbe used for describing multimedia sessions;the Real-Time Streaming Protocol (RSTP) [8]could be used to control delivery of streamingmedia; and the Media Gateway Control Pro-tocol (MEGACO) [9] could be used for con-trolling gateways to the Public Switched Tele-phone Network (PSTN).

It should be noted that SIP does not dependon any of the protocols above that provide ser-vices. Rather, SIP provides basic functional-ities and operations that can be used to im-plement different services. Furthermore, “SIPprovides a suite of security services, which in-clude denial-of-service prevention, authentica-tion (both user to user and proxy to user), in-tegrity protection, and encryption and privacyservices” [5].

4.2 SIP Messages

SIP defines two distinct types of messages: re-quests and responses. “A SIP message is eithera request from a client to a server, or a responsefrom a server to a client. ... Both types of mes-sages consist of a start-line, one or more headerfields, an empty line indicating the end of theheader fields, and an optional message-body”[5].

4.3 SIP Behavior

SIP requests can be sent directly from a useragent client to a user agent server, or they cantraverse one or more proxy servers along theway. User agents send requests either directlyto the address indicated in the SIP URI or to adesignated proxy (outbound proxy), indepen-dent of the destination address. The current

Linux Symposium 41

destination address is carried in the Request-URI. Each proxy can forward the request basedon local policy and information contained inthe SIP request. The proxy may rewrite the re-quest URI.

4.4 SIP-ALG Behavior

There exists an increasing need of IP addresstranslation because the networks based on IPv6addresses have being extended and because thesupply of IPv4 addresses is inadequate.

SIP is an application layer control protocolfor establishing media sessions. It encountersproblems with NAT-like devices, because thepayloads of SIP packets carry the addressesfor the sessions to be established. However,the NAT function in NAT-PT is application un-aware and does not snoop the payloads. Alongwith NAT-PT and DNS-ALG, a SIP-ALG isneeded on the boundary between IPv4 andIPv6.

This section describes a simple implementa-tion of a SIP ALG to enable simple SIP ses-sions to pass through a NAT-PT box based onthe Vocal system (an open source SIP imple-mentation created by Vovida.org [10]). Ratherthan attempt to make a full specification forSIP-ALG, we have implemented a subset of thefunctionalities that is sufficient for typical use.

An IP packet carrying a SIP message is identi-fied by NAT-PT box by the characteristic of theSIP protocol, using the port 5060 as the desti-nation.

Whenever a Vocal user agent (UA) initiates aSIP session by the host name of a callee, itlooks up the IP address from DNS services.DNS servers transparently provide the addressas normal except that the DNS-ALG will setupan address mapping once the DNS query istraversing a boundary of IPv4 and IPv6, whichhas been described in the previous sections. If

a mapping occurs, the SIP message is sent tothe NAT box. The SIP-ALG in the NAT boxwill build a table storing two pairs of the sourceaddress and the corresponding destination ad-dress for both IPv4 and IPv6 domains. Accord-ing to the table, the various fields in the SIPmessage will be modified. If the Content-Typeis SDP, the SDP message and the Content-Length will also be adjusted. After that, themodified message is forwarded to another IPversion of the network. Similarly, the responsemessages will be modified back to the originalmessages correspondingly when they return tothe NAT box. Thus, it seems as if the IPv4 UAand the IPv6 UA are communicating with theNAT-PT box respectively.

5 DNS Considerations

The IPv4 region has a DNS server, which re-turns Address (A) records when queried abouta host name that exists in the IPv4 region.The IPv6 stub region has a local DNS server,which returns IPv6 address (AAAA) recordswhen queried about a hostname that exists inthe IPv6 stub region. In addition, there is a“global” IPv6 DNS server.

If any IPv6 hosts in the IPv6 stub region areto provide services to IPv4 clients (initiatedby the IPv4 clients), the NAT-PT must perma-nently associate an IPv4 (pool) address to theIPv6 address of the serving host. In addition,one of the following must be true:

1. The IPv4 DNS server must map a hostname (probably different from its IPv6host name) to the associated IPv4 (pool)address,or

2. The IPv4 DNS server must refer the DNSrequest for the associated host name to aDNS server located at a specific IPv4 pooladdress. Queries sent to this address are

Linux Symposium 42

processed by the DNS-ALG at the Bor-der Router, and sent to the local IPv6 DNS(in the stub region). The returned reply isprocessed by the DNS-ALG, and sent tothe original requesting host as a DNS “A”record. This record will contain the IPv4pool address that was assigned to the IPv6server.

We had to install and configure alocal IPv4DNS server to filter the requests for resolutionof host names associated with the IPv6 stub do-main. This was done on the host that wouldserve as the client, to minimize the impact onthe rest of the system. Queries about namesending in “.ip6.lmc.ericsson.se” were sent tothe statically assigned IPv4 pool address thatwas associated with the IPv6 DNS server. Allother queries were forwarded to the regularIPv4 DNS server. (Note: this “extra” DNSserver would be unnecessary in a productionenvironment. The problem would be addressedby putting the necessary directives in the IPv4DNS server itself.)

6 Conclusion and Future Work

We have demonstrated that it is possible toprotect the investment in past hardware andsystems, by installing a NAT-PT on the Bor-der Router that provides access to IPv6-basedequipment. To do so requires that special carebe taken in configuring the Domain Name Sys-tem servers (for both IPv4 and IPv6), the NAT-PT itself, and the IPv6 hosts (so that their DNSrequests are sent to the NAT-PT).

The NAT-PT solution requires one IPv4 pooladdress for each IPv6 host that is concurrentlyaccessing the IPv4 address space. This couldclearly result in exhaustion of the IPv4 addresspool. A potential solution is to use NAPT-PT[3], where host/port number pairs on the IPv6side are mapped to a single IPv4 pool address

and multiple port numbers on the IPv4 side.NAPT-PT bears the same relationship to NAPTthat NAT-PT bears to NAT; the implementationdetails are more complex, but the use of NAPT-PT will be necessary when large IPv6 stub re-gions are able to use only a small pool of IPv4addresses to provide the desired services.

7 Acknowledgments

The authors acknowledge the support of Erics-son Research Canada through the use of its lab-oratory. The first author acknowledges the sup-port of the Natural Sciences and EngineeringResearch council of Canada, through its Dis-covery Grants program.

8 Availability

A link to the developed implementation is onthe web site

http://www.linux.ericsson.ca/ipv6

References

[1] K. Egevang and P. Francis. The IPnetwork address translator (NAT).Request for Comments 1631, InternetEngineering Task Force, May 1994.

[2] Y. Rekhter, B. Moskowitz,D. Karrenberg, G. J. De, and E. Lear.Address allocation for private internets.Request for Comments 1918, InternetEngineering Task Force, February 1996.

[3] G. Tsirtsis and P. Srisuresh. Networkaddress translation - protocol translation(NAT-PT). RFC 2766, InternetEngineering Task Force, February 2000.

Linux Symposium 43

[4] ETRI-PEC. User space NAT-PTimplementation. http://www.ipv6.or.kr/english/natpt-overview.htm.

[5] J. Rosenberg, H. Schulzrinne,G. Camarillo, A. R. Johnston,J. Peterson, R. Sparks, M. Handley, andE. Schooler. SIP: Session InitiationProtocol. RFC 3261, InternetEngineering Task Force, June 2002.

[6] Henning Schulzrinne, S. Casner,R. Frederick, and V. Jacobson. RTP: atransport protocol for real-timeapplications. Request for Comments1889, Internet Engineering Task Force,January 1996.

[7] M. Handley and V. Jacobson. SDP:Session Description Protocol. Requestfor Comments 2327, InternetEngineering Task Force, April 1998.

[8] H. Schulzrinne, A. Rao, and R. Lanphier.Real Time Streaming Protocol (RTSP).Request for Comments 2326, InternetEngineering Task Force, April 1998.

[9] F. Cuervo, N. Greene, A. Rayhan,C. Huitema, B. Rosen, and J. Segers.Megaco protocol version 1.0. Requestfor Comments 3015, InternetEngineering Task Force, November2000.

[10] VOVIDA. VOCAL home page.http://www.vovida.org.

Building Enterprise Grade VPNsA Practical Approach

Ken S. Bantoftfreeswan.ca / MDS Proteomics

[email protected], http://www.freeswan.ca

Abstract

As Linux scales its way into the enterprise,things like high availabilty and redundancy be-come more and more important.

As is the case with many Open Source appli-cations, you can integrate several applicationstogether to build in as much redundancy andfailover as you need for your network.

By combining several Open Source applica-tions, including Linux [1], FreeS/WAN [2],GNU/Zebra [3] and Heartbeat [4] we are ableto build a very reliable, robust VPN solution.

1 Building an Enterprise VPN

FreeS/WAN has been used for several years bymany Linux system administrators to build Vir-tual Private Networks (VPNs) between sites,end-users, and business partners. It is very con-figurable, inter-operates well with other IPSecimplementations and is generally quite stable.It has few limits, however two of the morecommon limits sysadmins run into are:

1. IPSec doesn’t tunnel all IP traffic—it doesnot handle Multicast or Broadcast traffic.This means if we wanted to do Multicas-ting over our VPN, or run OSPF fromZebra on ipsec0 for dynamic routing, wecan’t. While it’s possible to run OSPF in

NMBA mode, we ran into problems withthis as well.

2. You need one tunnel per network combi-nation pair—thus if you have 4 IP subnetsbehind Secure Gateway #1, and 2 behindSecure Gateway #2, you will need to con-figure 8 separate tunnels (unless you canaggregate your IP network space to /23s,/16s or other CIDR compatible blocks).

Point 1 isn’t too much of a limit, since manynetworks don’t need multicast and broadcasttraffic to be routed between sites. In large net-works, point 2 quickly becomes an adminitra-tive nightmare to deal with.

This can quickly get out of hand for large net-works, especially when more than two sites areinvolved. The ideal solution is to setup a sin-gle tunnel between each site, and then routeall traffic from site 1 destined for site 2 overthe VPN, and hope the other side accept it.The problem here is that IPSec policies (whichFreeS/WAN enforces) prevents this—if there isno explcit tunnel defined for the Source + Des-tination pair, the packet is dropped.

The solution here is to use GRE—GenericRouting Encapsultion. This has been part ofthe Linux kernel for quite some time, andallows us to solve both problems identifiedabove. By setting up a GRE tunnel over oneof our IPSec tunnels, we can then route any-

Linux Symposium 45

thing we want over the GRE tunnel, includingMulticast traffic.

Once we are using GRE, we need some way todynamically inform the other side of the tun-nel which networks we know about, and howto get to them. GNU/Zebra can provide thisfunctionality, using either OSPF or BGPv4.

2 Challenges & Solutions

Integration of all of the applications used pre-sented several challenges, since none of themwere designed to work together. Some hadto be extended, and others had to be scriptedaround in order for them to notify each otherof various different sorts of failures.

Luckily, with source in hand, this was mucheasier than expected.

2.1 Zebra

Zebra was the easiest to integrate, as no di-rect code changes were required. There wereinitially some problems with using OSPF onaliased interfaces (eg: eth0:0) but those weresolved by an upgrade to the latest version(0.91a or 0.93 are known to work).

2.2 Heartbeat

Heartbeat needed some modifications so it wasaware of the status of the physical interfaces,and then just needed a basic configuration anda lot of scripting.

One of the most significant differences be-tween commercial grade routers and Linux isthat if an interface is physically unplugged, orthe switch/hub on the other side goes down, acommercial router drops the interface, and allroutes that travel over it are removed. Linuxdesperatly needs this capability, but until re-

cently many network cards were unable to re-port the link status.

Donald Becker[5] wrote a handy toolset forthis—mii-tool/mii-diag. During my tenure atIBM Canada, one of the developers I workedwith took this and turned it into a patch againstHeartbeat. If Heartbeat detects a physicalproblem with the network card/cable/switch,Heartbeat initates a failover event. This codesupported the Intel 10/100 (eepro.o|e100.o) aswell as the Intel Gigabit Ethernet adapters(e1000.o).

The bulk of the time went into the rewritingof many scripts to properly bring interfaces upand down, to restart Zebra’s bgpd correctly,and to cleanly restart FreeS/WAN.

2.3 FreeS/WAN

FreeS/WAN required no code changes, onlyconfiguration and some scripting to keep theconfiguration syncronised between each set ofSecure Gateways. We used SSH for this.

3 Gluing it all Together

The complicated part is making all of these ap-plications and protocols work together seam-lessly. Our basic network layout is below:

Linux Symposium 46

We have two Secure Gateways, each with aconnection to the internet. Ideally, each wouldhave its own link to the Internet, preferably re-dundant, but you could share the link if needed.Each gateway also has a connection to the locallan, which Heartbeat sends keep-alives over.If possible, use a Null-modem cable betweeneach pair of gateways for out of band keepalives. If this isn’t possible, Heartbeat also sup-ports UDP keep-alives over any network inter-face.

Ensure each gateway has all of the applicationsinstalled, and the GRE and FreeS/WAN con-figurations should be syncronised. Heartbeatand GNU/Zebra configurations differ slightly,so they should not be shared.

A diagram showing the key interactions be-tween the applications:

Heartbeat is the control center in this setup,as it monitors each node in the group for fail-ure, and checks its own Ethernet devices viaMII calls. Heartbeat also starts and stopsvarious scripts (from /etc/rc.d/init.d) whichbring up FreeS/WAN, the GRE tunnels, andGNU/Zebra.

3.1 Dealing with Startup Scripts

FreeS/WAN and GNU/Zebra both providescripts suitable for use in /etc/rc.d/init.d, so weused those. We also needed to add a GRE tun-nel, so we wrote our own startup scripts forthat. A quick example:

#!/bin/sh# chkconfig: 2345 50 64# description: Set up a GRE tunnel from

Linux Symposium 47

# here to somewhere else

case "$1" instart)

ip tunnel add MYTunnel mode gre \remote 216.1.1.1 local 116.1.1.1 ttl 255ip link set MYTunnel upip addr add 172.16.0.1 dev MYTunnelip route add 172.16.0.2/32 dev MYTunnel;;

stop)ip route del 172.16.0.2/32 dev MYTunnelip addr del 172.16.0.1 dev MYTunnelip link set MYTunnel down;;

restart)$0 stop$0 start;;

*)echo "Usage: $0 {start|stop|restart}" >&2exit 2

esac

exit 0

}

Our script supports the traditional argumentsstart, stop, and restart, and will bring the GREtunnel up and down when called from Heart-beat.

3.2 Heartbeat Configuration

Full details on configuring Heartbeat are avail-able from the package itself [7], so I will coveronly the /etc/ha.d/haresources config file here.From Heartbeat, we need to control the IP ad-dress takeover, and the services (/etc/rc.d/init.dscripts) we start and stop when a node fails.This can be done with a simple, single line en-try in /etc/ha.d/haresources:

cluster1 116.1.1.1/28 192.168.0.1/24 \ipsec gre zebra bgpd}

The above line tells Heartbeat to do IP ad-dress takeover on 159.18.124.254 (our Exter-nal IP address) 192.168.0.1 (our Internal IP ad-dress). It also lists the scripts (in order) torun when a takeover happens. It passes each

of these scripts a parameter—either “start”or “stop” determined by what is occurring—taking over the IP, or releasing it. (I.e.,/etc/rc.d/init.d/ipsec start .)

3.3 FreeS/WAN Configuration

FreeS/WAN configuration was straightfor-ward. PSK (pre shared secrets) use is not rec-ommended, as recent bugtraq postings haveshown some potential security flaws. We rec-ommend RSASig’s, however X.509 Digitalcertificates can also be used. The following ex-ample uses PSK authentication for one remotesite:

config setupinterfaces="ipsec0=eth0:0"klipsdebug=noneplutodebug=noneplutoload=%searchplutostart=%searchuniqueids=yes

conn %defaultkeyingtries=0

conn site1tosite2authby=rsasig

left=116.1.1.1leftnexthop=116.1.1.30leftrsasigkey=0xA0S8PIPI...leftid=@site1.company.comright=216.1.1.1rightnexthop=216.1.1.30rightrsasigkey=0xA0QKJ986...rightid=@site2.company.comauto=start

}

The critical line of the config isinterfaces=

"ipsec0=eth0:0" , as by default Free-S/WAN won’t bind to an aliased interface.Since Heartbeat brings up the service IP ad-dresses on aliases, we need to bind our ipsecinterface to the alias.

Linux Symposium 48

3.4 GNU/Zebra Configuration

From GNU/Zebra, we use the BGPv4 daemonto handle our dynamic routing. This gives usmuch more control over which routes we sharethan OSPF would, as well as makes configura-tion simple.

Sample bgpd.conf file:

!hostname torcofw1password a_secure_passwordenable password a_more_secure_passwordlog file bgpd.loglog stdout

router bgp 65432bgp router-id 172.16.0.1network 172.16.0.0/30redistribute kernelredistribute connectedredistribute staticneighbor 172.16.0.2 remote-as 65432neighbor 172.16.0.2 next-hop-self

!}

We use a reserved AS number (65432) incase we ever need to do BGP with peersfrom another company, or the Internet. Allof the network and neighbour statements referto our GRE tunnel IP addressing, as we wishto communicate with our BGP peer over theGRE tunnel—not the IPSec tunnel. neighbour172.16.0.2 next-hop-self is critical—we needour BGP peer to send any traffic destined forour local networks through us directly, sincewe have an established tunnel.

4 Conclusions

It took a few weeks to get this setup sta-ble, during which we changed from OSPF toBGPv4, which cleared up several problemswe encountered with neighbours failing to ex-change routes consistantly. The current designhas been running in production at 4 sites for

over 2 years now, and we have had several suc-cessful failovers (several faulty network cards,a bad switch port, and the more common sys-tem administrator error).

Connections that do not pass though netfil-ter connection tracking (i.e., NAT/MASQ) areusually unaffected—with a keepalive time of2 seconds, and a deadtime of 10 seconds,dead peer detection is fairly quick. Thiscan be optimized down to about 5 seconds ifneeded. Changing over the IP addresses, start-ing FreeS/WAN, GRE tunnels and GNU/Zebratakes less than 10 seconds on modern hard-ware, so our total time between failover is lessthan 20 seconds.

5 Future Improvements

There are more improvements to be made thatcould bring detection and failover down intothe 1–3 second range.

Heartbeat seems currently limited to 1-secondkeepalives—this could be brought down to 1/4second over the serial interface, meaning adeadtime of 1 second would be reasonable (3missed polls).

FreeS/WAN has a routing limitation wherebyyou can’t have two tunnels for the same source+ destination pair going to two different remotegateways. Hopefully, this limitation will not bepresent in either kernel 2.6’s IPSec implemen-tation, or future versions of FreeS/WAN thatimplement the MAST [8] device.

Connections that do utilize the netfilter con-nection tracking are currently cut off, sincethe secondary firewall is not aware of the cur-rent state of the conntrack table on the primaryfirewall. There was some discussion at theOLS 2002 Netfilter BOF, and on the NetfilterFailover list [9] on how to handle syncroniza-tion of the conntrack table, however no code

Linux Symposium 49

has emerged.

6 Acknowledgments

I would like to acknowledge the FreeS/WANteam, and extra thanks to Michael Richardsonand JuanJo Ciarlante who helped me setup aUML environment setup so I could demon-strate much of this. I’d also to acknowledgeIBM Canada, who employed me during the ini-tial development of this solution, and my cur-rent employer, MDS Proteomics who allowsme to continue to refine it. Also, thanks to As-taro Corporation for funding some of my de-velopment of Super FreeS/WAN, and coveringmy hosting costs for freeswan.ca.

7 Availability

Software:

http://www.freeswan.cahttp://www.zebra.orghttp://www.linux-ha.org

Documentation:

http://www.freeswan.ca/docs/HA

References

[1] Linux, http://www.linux.org

[2] The FreeS/WAN Project,http://www.freeswan.orghttp://www.freeswan.ca

[3] The GNU Zebra Project,http://www.zebra.org

[4] The Linux-HA Project,http://www.linux-ha.org

[5] Donald Becker, Scyld ComputingCorporation,http://www.scyld.com/diag#mii-diag

[6] Ken S. Bantoft,Building HA VPNS withFreeS/WAN, http://www.freeswan.ca/docs/HA/

[7] Getting Started with Heartbeat,http://www.linux-ha.org/download/GettingStarted.html

[8] John S. Denker,Next-Generation IPsecPacket Handling, http://www.quintillion.com/moat/ipsec+routing/mast.html

[9] Netfilter Failover Archives,http://lists.netfilter.org/pipermail/netfilter-failover/

Linux memory managementon larger machines

Martin J. [email protected]

David [email protected]

Abstract

A large amount of work has gone into thememory management subsystem during the 2.5series of Linux® kernels, and it is more sta-ble under a wide variety of workloads than the2.4 VM (virtual memory subsystem). Manyscalability problems have been solved, mak-ing memory managment perform much betteron larger machines (meaning either with morethan 1GB of RAM, or more than one processor,or both). Some of these changes also benefitsmaller machines.

During the 2.4 series of kernels, the mainLinux distributions diverged massively fromthe mainline kernel, particularly in the area ofVM. This causes ongoing maintainance prob-lems, and wasted duplicated effort in problemsolving and feature implementation. Many ofthe enhancements made by the distributionshave been brought back into the mainline ker-nel during the 2.5 series, under the leadershipof Andrew Morton, providing a solid base forfuture development, and a greater potential forco-operative work.

This paper discusses the changes made to theLinux VM system during 2.5 that will signif-icantly impact larger machines. It also coverschanges that are proposed for the future, mostof which are currently available as separatepatches. Larger machines also have to copewith a larger number of simultaneous tasks—I have focused on up to 5000.

For the sake of simplicity, clarity, andbrevity, we assume an IA32 machine withPAE mode (3 level pagetables) and normalmemory layout settings throughout the pa-per. Unless otherwise specified, measurementswere taken on a 16-CPU NUMA-Q® system(PIII/700MHz/2MB L2 cache) with 16GB ofRAM.

1 Introduction

Market economics dictate the prevalance oflarge 32 bit systems, despite the software com-plexity involved. Though cheap 64 bit chipsare beginning to appear, they are still not avail-able as large systems. However, the techniquesand discoveries described in this paper are byno means only applicable to such machines.

2 The global kernel virtual area

The fundamental problem with 32 bit machinesis the lack of virtual address space for both userprocesses and the kernel—32 bits limits us to4GB total. Each user processes’ address spaceis local to that process, but the kernel addressspace is global. In order to ensure efficient op-eration, the user address space is shared withthe global kernel address space (see Figure 1).

The default address space split for Linux 2.4and 2.5 is 3GB user: 1GB kernel. It ispossible to change this split, but it is of-ten not desirable—some applications (such as

Linux Symposium 51

Figure 1: The process address space

databases) want as much address space for theprocess as possible for the application, whilstthe kernel also wants as much space as possi-ble for its data structures.

The first 896MB of physical memory ismapped 1:1 into the shared global kernel ad-dress space. This memory range is known aslow memory (ZONE_NORMAL), and mem-ory above the 896MB boundary is known ashigh memory (ZONE_HIGHMEM). The morephysical memory we add to the machine, thebigger the kernel control structures need to be,but the control area is fixed size by the virtualspace limitation.

Thus the more RAM we add to the machine,the more pressure there is on the global kernelarea. The standard Linux 2.4 kernel copes verybadly with large amounts of memory, perhapslimited to 4GB at best. The Linux 2.4 enter-prise distributions will work with 16–32GB ofmemory, depending on the distribution. Linux2.5 will cope with approximately 32GB ofmemory.

Unfortunately, most of the data that is put intothe kernel address space is not swapable, andthe Linux kernel often does not shrink the datagracefully under memory pressure. Thus, thefailure condition is often difficult to diagnose;kswapd goes into a flat spin, all kernel mem-ory allocations stop, and the system appears tohave hung. Monitoring the “lowfree” field of/proc/meminfo in the runup to the system hangwill often help to detect this condition.

The main space consumers for the kernel spaceare:

• mem_map (physical page control struc-tures)

• Slab caches, particularly:

– buffer_head

– dentry_cache

– inode_cache

• Pagetables

mem_map is an array of page control struc-tures, one for each physical page of RAM onthe system. On a 16GB machine, that takes19% of the kernel’s address space. For 64GB,it takes 78% of all the space we have, leavinginsufficient space for the normal kernel text anddata. Whilst the machine may boot, it will notbe usable.

William Irwin and Hugh Dickins are imple-menting a technology called “page clustering,”that makes one page control structure governa group of pages, thus dramatically reducingthe space taken (e.g. 8 page groups reduces usfrom 78% of space to 9%).

3 kmap

The kernel has permanent direct access to lowmemory, but needs to perform special opera-

Linux Symposium 52

tions to map high memory (however, note thatuser space can directly map high memory).High memory is usually mapped one 4K pageat a time, via two main mechanisms: persistentkmap, and atomic kmap.

Persistent kmap uses a pool of 512 entries. Allentries start out as clean; each entry is used inturn, and has a usage count associated with it.As entries are freed (usage count falls to 0) theyare marked as dirty. When we reach the endof the pool, all dirty entries with 0 usage aremarked as clean, a system-wide tlbflush is in-voked, and now the buffers may be reused. Allof these operations are global, and done undera global lock (kmap_lock).

Atomic kmap has a small number of entriesper CPU—one for each of a few specific op-erations (that may need to be done in conjunc-tion, so one entry is not sufficient). To reusean atomic kmap slot, a single TLB entry needsto be flushed, and only on 1 CPU. This allowslockless operation, and CPU local data man-agement (i.e., no cacheline bouncing). How-ever, due to the CPU-local nature of the map-ping, it is not possible to sleep, or rescheduleonto another CPU whilst holding the mapping.

The problem comes in that persistent kmapturns out to be heavily used and rather slow.Not only is the data & locking global, but theglobal TLB flushes are very expensive (par-ticular on machines without tlb_flush_range,such as IA32). Persistent kmap scales as O(N2)where N is the number of CPUs in the sys-tem (N times the frequency that the pool isexhausted * N times the impact from the tlbflushes). As CPU:memory speed ratios con-tinue to grow, the caches become ever moreimportant, and such algorithms are not suitablefor heavy use.

The heaviest users were copy_to/from_userand related functions (copying data betweenkernel and userspace). It is possible that

these operations would take a pagefault on theuserspace page, and thus sleep (and thus cannotdirectly use atomic kmap). Other heavy usersincluded one implementation of putting userpagetables into highmem—workloads withheavy pagetable manipulation (e.g. kernelcompiles) were observed to spend more thanhalf of their time just mapping and unmappingpte pages.

After much discussion on the subject dur-ing 2002, the following solution was agreedupon, and Andrew Morton implemented it. Inessence, we now use atomic kmap for opera-tions such as copy_to/from_user, but touch thepage first to ensure it is faulted in, making it ex-tremely unlikely that we will take a pagefault.In the unlikely event that a pagefault does oc-cur, we handle the fault, then retry the copyoperation using persistent kmap (in practice,this was never found to occur). Truly persistentglobal operations (typically for the lifetime ofthe OS instance) where performance it is not aconcern can still use persistent kmap.

4 Pagetables

The pagetables map the process’ virtual ad-dresses to the physical addresses of the ma-chine. For an IA32 machine with PAE, eachPTE entry controlling a 4K page consumes 8bytes of space, resulting in a fully populated3GB process address space consuming 6MB ofPTE entries. In other words, the overhead ofPTEs is 0.2% of physical RAM if we have nosharing going on.

In most workloads, however, there is signif-icant amounts of space shared between pro-cesses, either in shared libraries, or as sharedmemory segments. In particular, databaseworkloads often use large shared segments(e.g. 2GB) shared between large numbers ofprocesses. Whilst the memory itself is shared

Linux Symposium 53

between processes, the pagetables are dupli-cated; one copy for each process. Thus for5000 processes sharing a 2GB shmem seg-ment, the PTE overhead for that segment isnow 20GB of RAM (i.e. the overhead is1000% of the consumed space).

These levels of heavy sharing are for real work-loads analysed, not a theoretical projection,and for a machine that might otherwise runhappily with 8GB of RAM.There are two ob-vious ways to reduce the overhead: either weshare the pagetables, or we reduce the size ofeach copy substantially.

Sharing the PTE level of the pagetableshas been implemented by Dave McCracken.This enables us to share identical mappingsover large areas between different processes,thereby reducing the mapping overhead for thecase under consideration from 20GB to 4MB.It is totally transparent to applications, but isnot currently in the 2.5 kernel as of 2.5.68.

The locking required for shared pagetablesmakes the patch slightly complex, and sharingcan only occur on a pte-page sized basis (2MBof memory). Due to the mechanisms of sharedlibraries writing to certain areas of the shlib(and thus causing a copy-on-write split for the2MB area) and the alignment requirements, itis not generally useful as a mechanism for re-ducing the overhead of shared libraries. It is,however, extremely effective on large sharedmemory segments, and reduces the overheadof fork+exec for large processes.

Support for large hardware pagesizes (akahugetlbfs) can dramatically reduce the over-head of each process’ pagetables. On IA32PAE, there is only 1 large page size available(2MB), and this reduces the overhead by a fac-tor of approximately 512. Memory consump-tion for the case under consideration is reducedfor this case from 20GB to about 60MB. This isin the Linux 2.5 kernel, but requires small mod-

ifications to applications in order to use this fa-cility.

A static pool of memory reserved for largepages is established at boot time, and handedout to applications that request it via a flag toshared memory create calls. Future work isplanned to make a more flexible mechanism,whereby it is not necessary to reserve a staticnumber of pages, and the kernel automaticallyuses large pages where appropriate.

In order to accomodate the large numbers ofpagetables that are potentially needed on largersystems, it is possible to put the third level ofthe pagetables (PTEs) into the high memoryarea, rather than the main global kernel space.While this can greatly alleviate the space con-sumption problem, it comes at a price in termsof time.

Though modern implementations of highmempagetables use atomic kmap, the cost of settingup the mappings for access, and the subsequentTLB flush is still expensive for such heavy us-age, especially for workloads that create anddestroy processes frequently. For kernel com-piliation, the overhead of highpte was an in-crease of approximately 8% of system time.

5 UKVA

The shortage of virtual space on IA32 keepsthe kernel from directly mapping everythingthat it might like to, especially things whichare in high memory. We have mecha-nisms to do this temporarily with kmap() andkmap_atomic(), but both of these mechanismsimpose significant overhead in data manage-ment and tlb flushing.

For workloads with large numbers of pro-cesses, one of the largest consumers of virtualspace is page tables, specifically the bottom-level PTE pages. There is an option (high-

Linux Symposium 54

pte) in the kernel to put these pages in highmemory and map them via kmap_atomic() asneeded, but this incurs an 8% increase in sys-tem overhead. It would be much more efficientto permanently map the PTE pages, but into aper-process area instead of a global one, thusgiving efficient operation without wasting largeamounts of virtual space.

UKVA (User-kernel virtual addressing) pro-vides a per-process kernel memory area. Thesame virtual space in each process is mapped todifferent physical pages, just like the userspaceaddresses (thus the “U”), but with the protec-tions of kernel space (hence the “K”). A previ-ous implementation actually located this areain the current user area. However, that imple-mentation would have made it difficult to lo-cate things other than user PTEs in the area.The implementation described here will locatethe area inside the current kernel virtual area,and concentrate on locating PTEs in the area.

The first question is, “How big does it need tobe?” An IA32 machine with PAE enabled has4k pages, each controlled by a 8-byte pte en-try. Each process will only need enough UKVAspace to map in its own page tables.

4GB / (4K / page) = 1M pages

1M pages * 8 bytes/pte = 8MB of virtual spacefor ptes

For the purpose of this example, this space willbe from 4G–8MB through 4GB; however, itdoes not really matterwherethis area is. It doesnotneedto be aligned in a certain way, but thisdoes make it easier to work with. First, the areashould be aligned on a PTE page/PMD entryboundary (2MB with PAE). This will make itcertain that the whole area can itself be mappedwith only 4 PTE pages (more on this below).Secondly, the area should not straddle a PMDboundary, to avoid the initial setup being re-quired to map more than 1 page, which makes

it more expensive.

To map the required 8MB of virtual space, werequire 2048 4K page table entries. We can fit512 pte entries per 4K page, thus we require 4UKVA PTE pages. Each time a pte page needsto be mapped in to the UKVA area, one of these2048 pte entries contained in the 4 UKVA PTEpages will need to be set.

5.1 Initialization

Since the UKVA area will be the primarymeans for access to all PTE pages, it mustbe available for the entire life of the process’spagetables. For this reason, the initializationwill occur in pgd_alloc(), at the same time asthe top level pagetable entry is created.

On IA32, a pmd entry and a pte entry are thesame size and PTRS_PER_PMD (the countof pmd entries per pmd page) is the same asPTRS_PER_PTE (the count of pte entries perpte page). Also, every time a pte page is al-located, a pmd entry is pointed to it. Eachtime you want to map an allocated PTE page,a pte entry is made, somewhere (highpte useskmap() to do this). Instead of setting kmap()ptes, we will use UKVA ptes. This meansthere will be a 1:1 relationship between PMDpages/entries and UKVA PTE pages/entries.

During pgd_alloc(), the 4 UKVA PTE pagesare allocated as soon as the PMD page whichwill point to them is allocated. The 4 pmd en-tries are made for the 4 pte pages, as are thecorresponding 4 pte entries. However, mak-ing the PTE entries is slightly complex. Oneof the goals of UKVA is to replace HIGHPTE,which means that all of the PTE pages will beallocated in highmem, including the special 4UKVA PTE pages. This means that the ptepage that contains the 4 pte entries will needto be mapped via atomic kmap to make the en-tries. However, after this is done, they may

Linux Symposium 55

be accessed directly, without ever using kmapagain.

This is the time where keeping the 8MB areafrom crossing a PMD boundary is important.Because of the 1:1 relationship, if the 4 pagesare covered by more than 1 PMD page, theywill also be covered by more than 1 PTE page,possibly doubling the amount of number ofkmap calls which must occur.

5.2 Runtime

The bulk of the UKVA work is done in pte_alloc_map(). In keeping with the 1:1 relation-ship, each time a pmd_populate() is done, aUKVA PTE is also set. However, there are 2possible ways to set a PTE with UKVA.

The first method is to simply index into theUKVA area, and set the PTE directly. Sincethe UKVA area is virtually contiguous, it canbe accessed just like an array of every PTE inthe system. The PTE contolling the first pageof memory is at the start of the UKVA PTEspace, just as the last PTE in the space con-trols the last page of RAM. It will always beknown where the PTE for any given virtual ad-dress will be mapped. This also means the the 4UKVA PTE pages themselves will be mappedto constant, known places.

The second method is used when pte_alloc_map() is asked to allocate a PTE for anotherprocess. Since the UKVA only contains thecurrent process’s pagetables, the pagetables ofthe other process must be walked, and appro-priate entries made. During the walking pro-cess, the UKVE PTE pages must be mappedvia kmap_atomic() so that they can be altered.

6 Hot & Cold pages

As the ratio of CPU speed to memory speedgrows over time and CPU architectures change

in ways such as pipelining, the efficient utilisa-tion of processor caches becomes increasinglyimportant. The hot and cold pages mechanismin Linux 2.5 (and its predecessors such as per-cpu pages) provide an important way to helpincrease the efficiency of the data cache. Thisis important for UP systems, but provides evengreater benefit on SMP.

For each CPU in the system, for each zoneof memory, we provide two queues for datapages: a hot queue, and a cold queue. The gen-eral precept is that pages in the hot queue arecache hot on that CPU, and pages on the coldqueue are cache cold. Only 0-order pages (sin-gle page groups) are kept in these queues, thehigher order allocations (multipage groups) aremanaged directly by the buddy allocator.

Both the hot and cold page lists allocate pagesand free pages en masse from the buddy allo-cator for greater efficiency. This allows us totake multiple pages under one holding of thelock, whilst those codepaths and data manage-ment elements are cache hot. The lists havelow watermarks, below which they will be re-filled, and high watermarks, above which theywill be emptied. Default batch size for allo-cations is 16 pages at a time; watermarks are32–96 pages for the hotlists, and 0–32 pagesfor the cold lists.

The hot queue is managed as a LIFO stack—pages freed via the normal free_pages() routeare pushed onto the hot stack (i.e. assumedto be cache warm). Further tuning in this areamay be needed—the caller has better informa-tion about the cache warmth of the pages theyare freeing than the generic routines. By de-fault, page allocations come out of the hot list,unless __GFP_COLD is specified.

The cold list basically just functions as a batch-ing mechanism for page allocations. It is usedfor pages that will not be first touched by theCPU in question (e.g. pagecache pages that

Linux Symposium 56

will be filled by DMA before they are read).This preserves the valuable cache hot pages forother uses, and saves cacheline invalidates forthe CPU’s cache. shrink_list(), shrink_cache(),and refill_inactive_zone() all free pages backinto the cold list via the pagevec mechanism.

Below is a comparison of the kernel profiles ofan equivalent workload (kernel compile) withand without the hot & cold pages mechanism,measuring how many ticks are spent in eachroutine, and which routines see the greatestchange. Those labelled ’+’ get more expen-sive with hot & cold pages, those labelled ’- ’get cheaper. The 17.5% overall reduction in thetotal number of ticks spent is evident.

Ticks Percent Routine+243 0.0% buffered_rmqueue+197 6.6% page_remove_rmap+131 0.0% handle_mm_fault+116 0.0% fget

...-77 -15.7% release_pages-78 -34.4% atomic_dec_and_lock-89 -18.5% d_lookup-97 -16.2% __get_page_state

-155 -35.6% link_path_walk-155 -23.8% copy_page_range-178 -27.8% shmem_getpage-193 -24.6% do_no_page-209 -100.0% pte_alloc_one-210 -26.0% zap_pte_range-303 -100.0% pgd_alloc-365 -100.0% __free_pages_ok-532 -28.0% do_anonymous_page-650 -37.0% do_wp_page-700 -100.0% rmqueue

-4595 -17.5% total

The cost of rmqueue, pgd_alloc, pte_alloc_one, and __free_pages_ok has shifted intobuffered_rmqueue, but it takes much less timeto execute. The main cost for do_anonymous_page was in zeroing newly allocated pages, andthe cost of do_wp_page is in copying pagesfrom one to the other. Both obviously bene-fit greatly from the better cache warmth of the

system. The profiles only show kernel time—userspace is actually the biggest beneficiaryfrom this mechanism.

7 Page reclaim

In 2.5, the LRU lists were converted fromglobal to per-zone. This makes it easierto free up one particular type of memory(e.g. ZONE_NORMAL) without affectingother types. It also breaks up the global locksand reduces cross-node cacheline traffic forNUMA machines. Following is the most sig-nificant elements from kernel profile data froma 2.4.18 kernel + NUMA patches doing a ker-nel compile on a 16-way NUMA-Q:

2763 _text_lock_dcache2499 _text_lock_swap1199 do_anonymous_page

763 d_lookup651 lru_cache_add646 __free_pages_ok612 do_generic_file_read573 lru_cache_del

...

The _text_lock_swap entry is the pagemap_lru_lock

The page-reclaim daemon (kswapd) musttouch large amounts of data, both the userpages being manipulated, and their controlstructures (e.g. the LRU lists). However, onNUMA systems this is extremely problematic,as it causes a lot of cross-node memory traffic.Hence the global daemon was replaced witha per-node daemon, each of which only scansits own nodes pages, which is much more effi-cient.

One of the last major global locks in theVM was pagemap_lru_lock. Andrew Morton’spagevec implementation reduced contention onit by 98% by batching page operations together

Linux Symposium 57

into ‘pagevecs’—vectors of pages that couldbe manipulated together as groups more effi-ciently.

8 rmap

Whilst the pagetables of each process providea mapping from that virtual address space tothe physical addresses backing it, in Linux 2.4,there is no easy way to map back from a phys-ical address to a virtual address. This is whatrmap provides—a “reverse” mapping from thephysical address back to the set of virtual ad-dresses mapping it.

To reclaim memory, 2.4 used a “virtualscan”—walk each process, and see if we canunmap the physical pages it is using. 2.5 uses“physical scan”—walk each page of RAM, andsee if it can be freed (this requires the rmapmechanism). This new mechanism has sev-eral advantages, perhaps the most importantof which is stability. It has proven signifi-cantly more robust under pressure than the vir-tual scan in 2.4 code.

Whilst the usage of the rmap mechanism hasremained fairly stable, the method of keepingthe data for the reverse mapping has been asource of more contention and trouble. Thecurrent 2.5 code as of 2.5.68 uses a mecha-nism called “pte-chains” which keeps (for eachphysical page) a simple linked list of pointersback to the pte entries of the processes map-ping each page.

These pte-chains have several problems:

• Locking

• Space consumption

• Time consumtion

The locking was at first implemented as aglobal lock, which was actually shared with

the worst existing global VM lock (pagemap_lru_lock). This caused massive lock contention(data from a 12-way NUMA-Q), as seen in Ta-ble 1.

Therefore, the locking was changed to a per-chain lock, which was subsequently compactedinto a 1 bit lock embedded in the flags field ofthe struct page to avoid more space consump-tion problems. This reduced kernel compiletimes by more than half on 16-way NUMA-Q(from 85s to 40s).

The next problem with pte-chains is the spaceconsumption. A simple singly linked list willconsume 4 bytes per entry for the pointer to thePTE and 4 bytes per entry for the pointer to thenext entry. Two methods were used to alleviatethis:

1. Pages with only a single mapping can usethe “page-direct” optimisation—insteadof storing the pointer to the linked list inthe struct page, we use the same space topoint directly to the only PTE by using thepte union introduced into struct page:

union {struct pte_chain *chain;pte_addr_t direct;

} pte;

And this switch is governed by thePG_direct flag from the flags field in thestruct page.

2. The lists are grouped by cacheline, allow-ing multiple PTE pointers per ‘list next’pointer. Not only does this reduce the sizeof the linked list by almost half (assum-ing sufficient grouping), but it also greatlyincrease the data locality and cache effi-ciency for walking the chain.

Moreover, this space consumption all comesfrom low memory, an extremely precious re-source on large 32-bit machines. To take

Linux Symposium 58

SPINLOCKS HOLD WAIT

UTIL CON MEAN MAX MEAN MAX % CPU TOTAL NOWAIT SPIN NAME

45.5% 72.0% 8.5us 341us 138us 11ms 44.3% 3067414 28.0% 72.0% pagemap_lru_lock0.03% 31.7% 5.2us 30us 139us 3473us 0.02% 3204 68.3% 31.7% deactivate_page+0xc6.0% 78.3% 6.8us 87us 162us 9847us 9.4% 510349 21.7% 78.3% lru_cache_add+0x2c6.7% 73.2% 7.6us 180us 120us 8138us 6.4% 506534 26.8% 73.2% lru_cache_del+0xc

12.7% 64.2% 7.1us 151us 140us 10ms 13.4% 1023578 35.8% 64.2% page_add_rmap+0x2c20.1% 76.0% 11us 341us 133us 11ms 15.0% 1023749 24.0% 76.0% page_remove_rmap+0x3c

Table 1: Massive lock contention on 12-way NUMA-Q

Pointer to PTE

chain->next

mem_map pte_chains

PG_direct!

Figure 2: pte-chain based rmap

our example of 5000 processes sharing a2GB memory segment again, not only do wenow have 20GB of pagetables, but 10GB ofpte_chains. Whilst the 20GB of pagetables canat least be moved off into high memory, thisis not easy to do for pte chains. In order tomove the chains into high memory, the “nextelement” pointers would need to become phys-ical addresses, instead of virtual ones. Not onlyare these larger (36 bits instead of 32), they alsoneed to be mapped into virtual addresses be-fore use, an incredibly expensive procedure forwalking the linked lists.

Last, but not least, of the problems is the timeconsumption. For every page used, and for ev-ery process that uses it, we must take a lock(using an expensive atomic operation) and cre-ate a page entry. Worse still, when we teardown the mapping, we must take that samelock, and then walk the pte-chain looking forthe element to free. This takes approximately alinear amount of time, depending on the num-ber of elements sharing that page. Even forjust a load of 128 on SDET, the kernel profileshows the rmap functions massively dominat-ing:

86159 page_remove_rmap38690 page_add_rmap17976 zap_pte_range14431 copy_page_range10953 __d_lookup

9978 release_pages9369 find_get_page7483 atomic_dec_and_lock6924 __copy_to_user_ll6830 kmem_cache_free

The problem is especially acute under a work-load such as SDET that does significantamounts of fork/exec/exit traffic, where map-pings must be continually built up and thentorn down again. I see this problem as fun-damental with a page based approach; thoughit may be alleviated somewhat by tuning, it isstill a per-page operation, and thus too expen-sive. Even for a simple kernel compile, thepage_remove_rmap is still the most expensivefunction in the whole kernel:

Linux Symposium 59

23222 page_remove_rmap14034 do_anonymous_page

7638 __d_lookup6406 page_add_rmap5188 __copy_to_user_ll3656 find_get_page3429 __copy_from_user_ll3126 zap_pte_range2108 do_page_fault1925 atomic_dec_and_lock1852 path_lookup

8.1 rmap shadow pages

Ingo Molnar has proposed a new rmap methodwhich I shall call “rmap shadow pages,” whichalleviates some of the problems with pte-chains, but is still page based. At the time ofwriting, there is no implementation available,but some of its properties may be determinedby analysis.

Instead of the chains being allocated as neededon a cacheline sized block, it is proposed to al-locate two “shadow pages” for each pte-page(page filled with PTE entries). These wouldform a doubly-linked list with other shadowpages. To retrieve the list of PTE entries fora particular page, one would consult the pte-chain pointer in the relevant struct page, andwalk the list (similarly to PTE-chains).

The PTE entry itself need not be stored in-side the rmap shadow page, but the rmap pagesare implicitly “linked” to the pte page in somefashion (e.g. being placed together in somecontiguous group of two pages). However, thecacheline locality characteristics seem to beagainst this method for scanning, as it involvestouching a separate cacheline for every elementin the list.

To add a page to the linked list, we would takethe pte_chain lock, and add ourselves to thehead of the list (one modification to the shadowpage, plus one to the struct page). To removea page from the list would not require walking

PTE pages

rmap shadow pages

prev next prev next

Page B

Page APage B

Page A

A prev A next

B prev B next

A prev A next

B prev B next

Figure 3: rmap shadow pages

the list (as for pte_chains), but we would tra-verse from the PTE page to the correspondingshadow page, map both its prev and next ele-ment pages, and perform a regular unlink for adoubly-linked list.

One of the major advantages of the rmapshadow pages method is that the rmap data canbe more easily moved into the highmem area.However, this is not without cost—each pageaccessed must be mapped via kmap, which hasproven expensive for PTE pages in the highpteimplementation.

Whilst rmap shadow pages may fix some of theproblems of pte-chains, it is still page-based,and thus requires a large amount of data ma-nipulation. It is therefore unlikely to solve thefundmantal time and space problem, thoughmoving the chains into high memory may beworthwhile.

Linux Symposium 60

8.2 Object based rmap

Other operating systems have taken a differ-ent approach to the physical to virtual addressmapping problem. K42 has taken an approachbased on file objects (akin to the Linux ad-dress_space structure), which seemed to bepromising after initial discussions with OrranKrieger and other K42 engineers.

Instead of keeping a reverse mapping for eachpage in the system, we can discover the listof virtual addresses by going from the structpage to the address_space object. From theaddress_space object, we can walk a list ofvmas—areas of process virtual memory whichmap that object. By adding the offset withinthe file (stored in struct page as “index”) to thebase virtual address of each vma, we can derivethe virtual address within each process. Fromthere, we can walk the pagetables to find theappropriate PTE.

The key advantage of this method is that thereis no overhead at all for the setup and tear-down of each page. This comes at the cost ofhigher overhead to find the PTEs at scanningtime (the list of VMAs and the pagetables foreach process must be walked). However, manyworkloads will run without memory pressure,or only have pressure on the caches, which areeasily freed, so the approach seems promising.

However, there are a few fundamental prob-lems with an object-based approach withinLinux. For one, there is not a backing file ob-ject for every page in the system, some pages(e.g. private process data allocated via mal-loc/sbrk) are anonymous, i.e. not associatedwith any file. Whilst it would be possible tocreate a file object for anonymous pages, thiswould not be a simple change.

Another problem is that the calculation ofadding the offset to the base virtual addresswithin the process assumes that the vma is lin-

ear. Whilst this used to be true, the 2.5 kernelcontains a new mechanism called “sys_remap_file_pages” that allows for non-linear VMAs.

Bearing in mind that pte_chains are most ex-pensive under heavy sharing (the linked listmust be walked for page_remove_rmap), someanalysis was taken of the length of the pte-chains for both file-backed and anonymous ob-jects. This showed that the anonymous objectswere only mapped once for nearly all pages,and the shared mappings were nearly all file-backed.

Based on these observations, and the simplicityof implementation, a “partially object-based”scheme was proposed. This used the object-based mappings for file-backed pages, andthe pte-chains method for anonymous mem-ory (and for nonlinear mappings). Dave Mc-Cracken implemented this scheme, and it wasvery effective in reducing both the space andtime taken by pte-chains.

Kernbench-16: (make -j 256 vmlinux)Elapsed User System CPU

pte-chains 47.21 569.17 139.55 1500.67partial objrmap 46.09 568.19 121.83 1496.67

Note the 12% drop in total system time.

SDET 64 (see disclaimer)Throughput Std. Dev

2.5.68 100.0% 0.2%2.5.68-objrmap 121.2% 0.3%

Again, kernel profiles clearly show the reduc-tion:

-30518 -78.9% page_add_rmap-72197 -83.8% page_remove_rmap

Partial objrmap also drastically reduced thenumber of pte_chain objects in the slab cache,graphically demonstrating that file_backedchains are predomininant. For amake-j256 vmlinux :

Linux Symposium 61

pte-chains 24116 pte_chain objects in slab cacheobjrmap 716 pte_chain objects in slab cache

(a 97% reduction).

However, at the time of writing, there are stilla couple of remaining objections to partial ob-jrmap:

1. The interaction with sys_remap_file_page’s nonlinear vmas is complex andconvoluted due to the conversion to andfrom pte_chains which may be necessary.However, it is generally recongnised thatsys_remap_file_pages is a special case.If those vmas are pre-declared as non-linear, most of the problems disappear.Furthermore, for the intended use of sys_remap_file_pages (windowing onto largedatabase shared segments), it is accept-able to lock those pages into memory (thisis normally done in any case), in whichcase none of this information will ever beneeded, so it is not necessary to keep it.

2. If 100 processes each map 100 vmas ontothe same address space, then objrmapwould have to scan 10,000 regions, notsimply the 100 mappings that the page-based methods might have had for theirchains. Whether this corner case is im-portant or not is a matter of debate, asit was exactly the case that sys_remap_file_pages was designed to fix, and callersshould be using that method for such anunusual situation. However, a simple opti-misation is proposed that should alleviatethe problem in any case:

For each distinct range of addresses mapped bya vma inside the address_space, we define anaddress_range. This takes advantage of the factthat we are likely to remap the same range re-peatedly (e.g. for shared libraries). From each

shared range, we attatch a list of vmas that mapthat range. Furthermore, we sort the list of ad-dress ranges by start address.

addressspace

R RRR

V

V

V

V V

V

V

V

R = address_range

V = vma V

Figure 4: list of lists

struct address_range {unsigned long start;unsigned long end;struct list_head ranges;struct list_head vmas;

};

9 NUMA support

On NUMA systems, it is more efficient to ac-cess node-local memory than remote memory.Thus Linux tries to allocate memory to a pro-cess from the node it is currently running on,providing for more efficient performance. Thisalso allows for locality of memory and controlstructures, reducing cross-node cacheline traf-fic.

By default, we allocate memory from the lo-cal node (if some is free), then round robinamongst the remaining nodes by node number,starting at the local node, and progressing up-wards. Kernel code can also request memoryon specific nodes via alloc_pages_node().

Matt Dobson has created a simple NUMAbinding interface for userspace, allowing pro-cesses to request memory from a specific node,

Linux Symposium 62

or group of nodes. This is very useful for largedatabase applications, which wish to bind “par-titions” of the database to certain nodes.

Several critical control structures (e.g. themem_map array—the control structures for thephysical RAM pages, and the pgdat—the nodecontrol structure) are now allocated on thenodes own memory, providing for better per-formance on NUMA systems. More items (e.g.the scheduler data, and per-cpu data) should bemigrated into node-local memory in the future.

Of the three main memory allocators (al-loc_pages, vmalloc, slab cache), only the slabcache is not NUMA aware. Manfred Spraulhas created patches to do this, but we have notseen observeable performance benefits fromthis method yet. Part of the problem is the in-herently global nature of many of the caches(eg the directory cache), which will need to beattacked first.

Some of the kernel architectures (ones withhardware assistance from the CPU) have kerneltext replication functioning. This makes a copyof the kernel data to each node, and processeswill use their own node’s local copy of thedata, reducing backplane traffic, and intercon-nect cache pollution. Replicating the read-onlyportion of shared libraries also seems promis-ing, though this has not yet been implementedin Linux. Replicating any data that is not read-only is likely to be too complex to be benefi-cial.

10 Legal

This work represents the view of the authors anddoes not necessarily represent the view of IBM.

SPEC is a registered trademark and the benchmarkname SDET is a trademark of the Standard Perfor-mance Evaluation Corporation. This benchmarkingwas performed for research purposes only and the

run results are non-complaint and not-comparablewith any published results.

NUMA-Q is a registered trademark of Interna-tional Business Machines Corporation in the UnitedStates and/or other countries.



Integrating DMA Into the Generic Device Model

James E.J. BottomleySteelEye Technology, Inc.

http://www.steeleye.com

[email protected]

Abstract

This paper will introduce the new DMA APIfor the generic device model, illustrating howit works and explaining the enhancements overthe previous DMA Mapping API. In a later sec-tion we will explain (using illustrations fromthe PA-RISC platform) how conversion to thenew API may be achieved hand in hand with acomplete implementation of the generic deviceAPI for that platform.

1 Introduction

Back in 2001, a group of people working onnon-x86 architectures first began discussingradical changes to the way device drivers makeuse of DMA. The essence of the proposalwas to mandate a new DMA Mapping API[1]which would be portable to all architecturesthen supported by Linux. One of the forcesdriving the adoption of this new API was thefact that the PCI bus had expanded beyond thex86 architecture and being embraced by non-x86 hardware manufacturers. Thus, one of thegoals was that any driver using the DMA Map-ping API should work onany PCI bus inde-pendent of the underlying microprocessor ar-chitecture. Therefore, the API was phrased en-tirely in terms of the PCI bus, since PCI drivercompatibility across architectures was viewedas a desirable end result.

1.1 Legacy Issues

One of the issues left unaddressed by the DMAMapping API was that of legacy buses: Mostnon-x86 architectures had developed other bustypes prior to the adoption of PCI (e.g. sbusfor the sparc; lasi and gsc bus for PA-RISC)which were usually still present in PCI basedmachines. Further, there were other busesthat migrated across architectures prior to PCI,the most prominent being EISA. Finally, somemanufacturers of I/O chips designed them notto be bus based (the LSI 53c7xx series of SCSIchips being a good example). These chipsmade an appearance in an astonishing varietyof cards with an equal variety of bus intercon-nects.

The major headache for people who writedrivers for non-PCI or multiple bus devices isthat there was no standard for non-PCI basedDMA, even though many of the problems en-countered were addressed by the DMA Map-ping API. This gave rise to a whole hotch-potch of solutions that differed from architec-ture to architecture: On Sparc, the DMA Map-ping API has a completely equivalent SBUSAPI; on PA-RISC, one may obtain a “fake”PCI object for a device residing on a non-PCIbus which may be passed straight into the PCIbased DMA API.

Linux Symposium 64

1.2 The Solution

The solution was to re-implement the DMAMapping API to be non bus specific. This goalwas vastly facilitated by the new generic de-vice architecture[5] which was also being im-plemented in the 2.5 Linux kernel and which fi-nally permitted the complete description of de-vice and bus interconnections using a generictemplate.

1.3 Why A New API

After all, apart from legacy buses, PCI is theone bus to replace all others, right? so an APIbased on it must be universally applicable?

This is incorrect on two counts. Firstly, supportfor legacy devices and buses is important toLinux, since being able to boot on older hard-ware that may have no further use encouragesothers who would not otherwise try Linux toplay with it, and secondly there are other newnon-PCI buses support for which is currentlybeing implemented (like USB and firewire).

1.4 Layout

This paper will describe the problems causedby CPU caches in section 2, move on to intro-ducing new struct device based DMA API[2]in section 3 and describe how it solves theproblems, and finally in section 4 describe howthe new API may be implemented by platformmaintainers giving specific examples from thePA-RISC conversion to the new API.

2 Problems Caused By DMA

The definition of DMA: Direct Memory Ac-cess means exactly that: direct access to mem-ory (without the aid of the CPU) by a devicetransferring data. Although the concept sounds

simple, it is fraught with problems induced bythe way a CPU interacts with memory.

2.1 Virtual Address Translation

Almost every complex CPU designed for mod-ern Operating Systems does some form of Vir-tual Address Translation. This translation,which is usually done inside the CPU, meansthat every task running on that CPU may utilisememory as though it were the only such task.The CPU transparently assigns each task over-lapping memory in virtual space, but quietlymaps it to unique locations in the physical ad-dress space (the memory address appearing onthe bus) using an integral component called aMMU (Memory Management Unit).

Unfortunately for driver writers, since DMAtransfers occur without CPU intervention,when a device transfers data directly to or frommemory, it must use the physical memory ad-dress (because the CPU isn’t available to trans-late any virtual addresses). This problem isn’tnew, and was solved on the x86 by using func-tions which performed the same lookups asthe CPU’s MMU and could translate virtual tophysical addresses and vice versa so that thedriver could give the correct addresses physicaladdresses to the hardware and interpret any ad-dresses returned by the hardware device backinto the CPU’s virtual space.

However, the problems don’t end there. Withthe advent of 64 bit chips it became apparentthat they would still have to provide support forthe older 32 bit (and even 24 bit) I/O buses fora while. Rather than cause inconvenience todriver writers by arbitrarily limiting the physi-cal memory addresses to which DMA transferscould be done, some of the platform manufac-turers came up with another solution: Add anadditional MMU between the I/O buses and theprocessor buses (This MMU is usually calledthe IOMMU). Now the physical address lim-

Linux Symposium 65

itation of the older 32 bit buses can be hid-den because the IOMMU can be programmedto map the physical address space of the busto anywhere in thephysical (not virtual)memory of the platform. The disadvantageto this approach is that now this bus physicalto memory physical address mapping must beprogrammed into the IOMMU and must alsobe managed by the device driver.

There may even be multiple IOMMUs in thesystem, so a physical address (mapped by aparticular IOMMU) given to a device mightnot even be unique (another device may use thesame bus address via a different IOMMU).

2.2 Caching

On most modern architectures, the speed of theprocessor vastly exceeds the speed of the avail-able memory (or, rather, it would be phenom-enally expensive to use memory matched tothe speed of the CPU). Thus, almost all CPUscome equipped with a cache (called the Level1 [L1] cache). In addition, they usually exposelogic to drive a larger external cache (called theLevel 2 [L2] cache).

In order to simplify cache management, mostcaches operate at a minimum size called thecache width.1 All reads and writes from mainmemory to the cache must occur in integermultiples of the cache width. A common valuefor the cache width is sixteen bytes; however,higher (and sometimes for embedded proces-sors, lower) values are also known.

The effect of the processor cache on DMA canbe extremely subtle. For example, consider ahypothetical processor with a cache width ofsixteen bytes. Referring to Figure 1, suppos-ing I read a byte of data at address0x18 . Be-cause of the cache burst requirement, this will

1also called the “cache line size” in the PCI specifi-cation

Cache Line Width

0x00

0x20

0x10 dataDMAdata

DMA

0x00

0x20

0x10 dataDMAdata

DMA

0x00

0x20

0x10 datadata

new DMAnew DMA

CPU Cache DMA

CPU Cache DMA

Time Line

Device does DMA to main memory

CPU Reads data at 0x18

data

data

Read from 0x11 satisfied from cached data

Figure 1: Incorrect data read because of cacheeffects

bring the address range0x10 to 0x1f into thecache. Thus, any subsequent read of say0x11will be satisfied from the cache without any ref-erence to main memory. Unfortunately, If I amreading from0x11 because I programmed adevice to deposit data there via DMA, the valuethat I read will not be the value that the deviceplaced in main memory because the CPU be-lieves the data in the cache to be still current.Thus I read incorrect data.

Worse, referring to Figure 2, supposing I havedesignated the region0x00 to 0x17 for DMAfrom a device, but then I write a byte of datato 0x19 . The CPU will probably modify thedata in cache and mark the cache line0x10to 0x1f dirty, but not write its contents tomain memory. Now, supposing the devicewrites data into0x00 to 0x17 by DMA (thecache still is not aware of this). However, sub-sequently the CPU decides to flush the dirtycache line from0x10 to 0x1f . This flush willoverwrite (and thus destroy) part of the the data

Linux Symposium 66

Cache Line Width

0x00

0x20

0x10 dataDMAdata

DMA

0x00

0x20

0x10 dataDMAdata

DMA

0x00

0x20

0x10 datadata

new DMAnew DMA

0x00

0x20

0x10 DMAdata

new DMA

CPU Cache DMA

CPU Cache DMA new data

new data

new data

Time Line

CPU Writes data at 0x18

Device does DMA to main memory

CPU flushes the dirty cache line

Figure 2: Data destruction by interfering de-vice and cache line writes

that was placed at0x10 to 0x17 by the DMAfrom the device.

The above is only illustrative of some of theproblems. There are obviously many other sce-narios where cache interference effects maycorrupt DMA data transfers.

2.3 Cache Coherency

In order to avoid the catastrophic consequencesof caching on DMA data, certain processorsexhibit a property called “coherency.” Thismeans that they take action to ensure that datain the CPU cache and data in main memoryis identical (usually this is done by snoopingDMA transactions on the memory bus and tak-ing corrective cache action before a problem iscaused).

Even if a CPU isn’t fully coherent, it can usu-

ally designate ranges of memory to be coherent(in the simplest case by marking the memory asuncacheable by the CPU). Such architecturesare called “partially coherent.”

Finally there is a tiny subset of CPUs that can-not be made coherent by any means, and thusthe driver itself must manage the cache so asto avoid the unfortunate problems described insection 2.2.

2.4 Cache Management Instructions

Every CPU that is not fully coherent includescache management instructions in its reper-toire. Although the actual instruction formatvaries, they usually operate at the level of thecache line and they usually perform one ofthree operations:

1. Writeback (or flush): causes the cacheline to be synced back to main memory(however, the data in the cache remainsvalid).

2. Invalidate: causes the cache line to beeliminated from the cache (often for adirty cache line this will cause the con-tents to be erased). When the cpu nextreferences data in that cache line it willbe brought in fresh from main memory.

3. Writeback and Invalidate: causes anatomic sequence of Writeback followedby Invalidate (on some architectures, thisis the only form of cache manipulation in-struction implemented).

When DMA is done to or from memory thatis not coherent, the above instructions must beused to avoid problems with the CPU cache.The DMA API contains an abstraction whichfacilitates this.

Linux Symposium 67

2.5 Other Caches

Although the DMA APIs (both the PCI andgeneric device ones) are concernedexclusivelywith the CPU cache, there may be other cachesin the I/O system which a driver may needto manage. The most obvious one is on theI/O bus: Buses, like PCI, may possess a cachewhich they use to consolidate incoming writes.Such behaviour is termed “posting.” Sincethere are no cache management instructionsthat can be used to control the PCI cache (thecache management instructions only work withthe CPU cache), the PCI cache employs specialposting rules to allow driver writers to controlits behaviour. The rules are essentially:

1. Caching will not occur on I/O mappedspaces.

2. The sequence of reads and writes to thememory mapped space will be preserved(although the cache may consolidate writeoperations).

One important point to note is that there is nodefined time limit to hold writes in the buscache, so if you want a write to be seen by thedevice it must be followed by a read.

Suffice it to say that the DMA API does notaddress PCI posting in any form. The basicreason is that in order to flush a write fromthe cache, a read of somewhere in the device’smemory space would have to be done, but onlythe device (and its driver) know what areas aresafe to read.

3 The New DMA API for DriverWriters

By “driver writer”, we mean any person whowishes to use the API to manage DMA co-herency without wanting to worry about the

underlying bus (and architecture) implementa-tion.

This part of the API and its relation to the PCIDMA Mapping API is fairly well describedin [2] and also in other published articles [3].

3.1 DMA Masks, Bouncing and IOMMUs

Every device has a particular attachment to abus. For most, this manifests itself in the num-ber of address lines on the bus that the deviceis connected to. For example, old ISA typebuses were only connected (directly) to the firsttwenty four address lines. Thus they could onlydirectly access the first sixteen megabytes ofmemory. If you needed to do I/O to an ad-dress greater than this, the transfer had to govia a bounce buffer: e.g. data was receivedinto a buffer guaranteed to be lower than six-teen megabytes and then copied into the correctplace. This distinction is the reason why Linuxreserves a special memory zone for legacy (24bit) DMA which may be accessed using flag(GFP_DMA) or’d into the allocation flags.

Similarly, a 32 bit PCI bus can only access upto the first four gigabytes of main memory (andeven on a 64 bit PCI bus, the device may beonly physically connected to the first 32 or 24address lines).

Thus, the concept of adma maskis used to con-vey this information. The API for setting thedma mask is:

int dma_set_mask(structdevice *dev, u64 mask)

• dev —a pointer to the generic device.

• mask—a representation of the bus con-nection. It is a bitmap where a 1 meansthe line is connected and a zero meansit isn’t. Thus, if the device is only con-

Linux Symposium 68

nected to the first 24 lines, themask willbe0xffffff .

• returns true if the bus accepted the mask.

Note also that if you are driving a device capa-ble of addressing up to 64 bits, you must alsobe aware that the bus it is attached to may notsupport this (i.e. you may be a 64 bit PCI cardin a 32 bit slot). So when setting the DMAmask, you must start with the value you wantbut be prepared that you may get a failure be-cause it isn’t supported by the bus. For 64 bitdevices, the convention is to try 64 bits first butif that fails set the mask to 32 bits.

Once you have set the DMA mask, the trou-bles aren’t ended. If you need transfers to orfrom memory outside of your DMA mask tobe bounced, you must tell the block2 layer us-ing the

void blk_queue_bounce_limit(request_queue_t *q, u64mask)

• q—the request queue your driver is con-nected to.

• mask—the mask (again 1s for con-nected addresses) that an I/O transfer willbe bounced if it falls outside of (i.e.address & mask != address )

function that it should take care of the bounc-ing. Note, however, that bouncingonly needsto occur for buses without an IOMMU. Forbuses with an IOMMU, the mask serves onlyas an indication to the IOMMU of what therange of physical addresses available to the de-vice is. The IOMMU is assumed to be able

2Character and Network devices have their own waysof doing bouncing, but we will consider only the blocklayer in the following

to address the full width of the memory busand therefore transfers to the device need notbe bounced by the block layer.

Thus, the driver writer must know whether thebus is mapped through an IOMMU or not.The means for doing this is to test the globalPCI_DMA_BUS_IS_PHYSmacro. If it istrue, the system generally has no IOMMU3

and you should feed the mask into the blockbounce limit. If it is false, then you should sup-ply BLK_BOUNCE_ANY(informing the blocklayer that no bouncing is required).

3.2 Managing block layer DMA transfers

It is the responsibility of the device driverwriter to manage the coherency problems in theCPU cache when transferring data to or froma device and also mapping between the CPUvirtual address and the device physical address(including programming the IOMMU if such isrequired).

By and large, most block devices are simplytransports: they move data from user applica-tions to and from storage without much con-cern for the actual contents of the data. Thusthey can generally rely on the CPU cache man-agement implicit in the APIs for DMA setupand tear-down. They only need to use explicitcache management operations if they actuallywish to access the data they are transferring.The use of device private areas for status andmessaging is covered in sections 3.5 and 3.6.

For setup of a singlephysically contiguous4

3This is an inherent weakness of the macro. Obvi-ously, it is possible to build a system where some busesgo via an IOMMU and some do not. In a future revisionof the DMA API, this may be made a device specificmacro

4physically contiguous regions of memory for DMAcan be obtained fromkmalloc() and __get_free_pages() . They may specificallynot be allo-cated on the stack (because data destruction may be

Linux Symposium 69

DMA region, the function is

dma_addr_t dma_map_single(struct device *dev,void *ptr, size_t size,enum dma_data_directiondirection)

• ptr —pointer to the physically contigu-ous data (virtual address)

• size –the size of the physically contigu-ous region

• direction —the direction, either to thedevice, from the device or bidirectional(see section 3.4 for a complete descrip-tion)

• returns a bus physical address which maybe passed to the device as the location forthe transfer

and the corresponding tear-down after thetransfer is complete is achieved via

void dma_unmap_single(struct device *dev,dma_addr_t dma_addr,size_t size, enumdma_data_directiondirection)

• dma_addr —the physical address re-turned by the mapping setup function

• size , direction —the exact valuespassed into the corresponding mappingsetup function

The setup and tear-down functions also takecare of all the necessary cache flushes associ-ated with the DMA transaction.5

caused by overlapping cache lines, see section 2.2) orvmalloc()

5they use thedirection parameter to get this

3.3 Scatter-Gather Transfers

By and large, almost any transfer that crosses apage boundary will not be contiguous in phys-ical memory space (because each contiguouspage in virtual memory may be mapped to anon-contiguous page in physical memory) andthus may not be mapped using the API of sec-tion 3.2. However, the block layer can con-struct a list of each separate page and lengthin the transfer. Such a list is called a Scatter-Gather (SG) list. The device driver writer mustmap each element of the block layer’s SG listinto a device physical address.

The API to set up a SG transfer for a givenstruct request *req is

int blk_rq_map_sg(request_queue_t *q,struct request *req, structscatterlist *sg)

• q—the queue the request belongs to

• sg —a pointer to a pre allocated phys-ical scatterlist which must be at leastreq->nr_phys_segments in size.

• returns the number of entries insg whichwere actually used

Once this is done, the SG list may be mappedfor use by the device:

int dma_map_sg(structdevice *dev, structscatterlist *sg, int nents,enum dma_data_directiondirection)

• sg —a pointer to a physical scatterlistwhich was filled in by

right, so be careful when assigning directions to ensurethat they are correct.

Linux Symposium 70

• nents —the allocated size of the SG list.

• direction —as per the API in sec-tion 3.2

• returns the number of entries of thesglist actually used. This value must beless than or equal tonents but is oth-erwise not constrained (if the system hasan IOMMU, it may chose to do all SG in-side the IOMMU mappings and thus al-ways return just a single entry).

• returns zero if the mapping failed.

Once you have mapped the SG list, you mayloop over the number of entries using the fol-lowing macros to extract the busy physical ad-dresses and lengths

dma_addr_t sg_dma_address(struct scatterlist *sge)

• sge –pointer to the desired entry in the SGlist

• returns the busy physical DMA addressfor the given SG entry

unsigned int sg_dma_len(struct scatterlist *sge)

• returns the length of the given SG entry

and program them into the device’s SG hard-ware controller. Once the SG transfer has com-pleted, it may be torn down with

void dma_unmap_sg (structdevice *dev, structscatterlist *sg, int nents,enum dma_data_directiondirection)

• nents should be the number of entriespassed in todma_map_sg() not thenumber of entries returned.

3.4 Accessing the Data Between Mapping andUnmapping

Since the cache coherency is normally man-aged by the mapping and unmapping API, youmay not access the data between the map andunmap without first synchronizing the CPUcaches. This is done using the DMA synchro-nization API. The first

void dma_sync_single(structdevice *dev, dma_addr_tdma_handle, size_t size,enum dma_data_directiondirection)

• dma_handle —the physical address ofthe region obtained by the mapping func-tion.

• size —The size of the region passed intothe mapping function.

synchronizes only areas mapped by thedma_map_single API, and thus only worksfor physically contiguous areas of memory.The other

void dma_sync_sg(struct device *dev,struct scatterlist*sg, int nelems, enumdma_data_directiondirection)

• The parameters should be identical tothose passed in todma_map_sg

synchronizes completely a given SG list (andis, therefore, rather an expensive operation).The correct point in the driver code to invokethese APIs depends on thedirection pa-rameter:

Linux Symposium 71

• DMA_TO_DEVICE—Usually flushes theCPU cache. Must be calledafter you lastmodify the data andbeforethe device be-gins using it.

• DMA_FROM_DEVICE—Usually invali-dates the CPU cache. Must be calledafterthe device has finished transferring thedata andbeforeyou first try to read it.

• DMA_BIDIRECTIONAL—Usually doesa writeback/invalidate of the CPU cache.Must be calledbothafter you finish writ-ing it but before you hand the data to thedeviceandafter the device finishes with itbut before you read it.

3.5 API for coherent and partially coherent ar-chitectures

Most devices require a control structure (or aset of control structures), to facilitate commu-nication between the device and its driver. Forthe driver and its device to operate correctly onan arbitrary platform, the driver would have toinsert the correct cache flushing and invalidateinstructions when exchanging data using this.

However, almost every modern platform hasthe ability to designate an area of memory ascoherent between the processor and the I/O de-vice. Using such a coherent area, the driverwriter doesn’t have to worry about synchronis-ing the memory. The API for obtaining suchan area is

void * dma_alloc_coherent(struct device *dev,size_t size, dma_addr_t*dma_handle, int flag)

• dev —a pointer to the generic device

• size —requested size of the area

• dma_handle —a pointer to the area thephysically usable address will be placed(i.e. this should be the address given tothe device for the area)

• flag —a memory allocation flag. EitherGFP_KERNELif the allocation may sleepwhile finding memory orGFP_ATOMICif the allocation may not sleep

• returns the virtual address of the area orNULL if no coherent memory could be al-located

Note that coherent memory may be a con-strained system resource and thusNULL maybe returned even forGFP_KERNELalloca-tions.

It should also be noted that the tricks platformsuse to obtain coherent memory may be quiteexpensive, so it is better to minimize the allo-cation and freeing of these areas where possi-ble.

Usually, device drivers allocate coherent mem-ory at start of day, both for the above reasonand so an in-flight transaction will not run intodifficulties because the system is out of coher-ent memory.

The corresponding API for releasing the coher-ent memory is

void dma_free_coherent(struct device *dev,size_t size, void *vaddr,dma_addr_t dma_handle)

• vaddr —the virtual address returned bydma_alloc_coherent

• dma_handle —the device physical ad-dress filled in at allocation time

Linux Symposium 72

3.6 API for fully incoherent architectures

This part of the API has no correspondancewith any piece of the old DMA Mapping API.Some platforms (fortunately usually only olderones) are incapable of producing any coher-ent memory at all. Even worse, drivers whichmay be required to operate on these platformsusually tend to have to also operate on plat-forms which can produce coherent memory(and which may operate more efficiently if itwere used). In the old API, this meant it wasnecessary to try to allocate coherent memory,and if that failed allocate and map ordinarymemory. If ordinary memory is used, the drivermust remember that it also needs to enforce thesync points. This leads to driver code whichlooks like

memory = pci_alloc_coherent(...);if (!memory) {

dev->memory_is_not_coherent = 1;memory = kmalloc(...);if (!memory)

goto fail;pci_map_single(...);

}

....

if (dev->memory_is_not_coherent)pci_dma_sync_single(...);

Which cannot be optimized away. The non-coherent allocation additions are designed tomake this code more efficient, and to be opti-mized away at compile time on platforms thatcan allocate coherent memory.

void *dma_alloc_noncoherent(struct device *dev,size_t size, dma_addr_t*dma_handle, int flag)

• the parameters are identical to those ofdma_alloc_coherent in section 3.5.

The difference here is that the drivermustusea special synchronization API,6 to synchronizethis area between data transfers

dma_cache_sync (void*vaddr, size_t size,enum dma_data_directiondirection)

• vaddr —the virtual address of the mem-ory to sync (this need not be at the begin-ning of the allocated region)

• size —the size of the region to sync(again, this may be less than the allocatedsize)

• direction —see section 3.4 for a dis-cussion of how to use this.

Note that the placement of these synchroniza-tion points should be exactly as described insection 3.4. The platform implementation willchoose whether coherent memory is actuallyreturned. However, if coherent memory is re-turned, the implementation will take care ofmaking sure the synchronizations become nops(on a fully coherent platform, the synchroniza-tions will compile away to nothing).

Using this API, the above driver example be-comes

memory = dma_alloc_noncoherent(...);if (!memory)

goto fail;

...

dma_cache_sync(...);

The (possibly) non-coherent memory area isfreed using

6The driver could also use the API of section 3.4 butthe point of having a separate one is that it may be opti-mized away on platforms that are partially non-coherent

Linux Symposium 73

void dma_free_noncoherent(struct device *dev,size_t size, void *vaddr,dma_addr_t dma_handle)

• The parameters are identical to those ofdma_free_coherent in section 3.4

Since there are very few drivers that need tofunction on fully non-coherent platforms, thisAPI is of little use in modern systems.

3.7 Other Extensions in the New API

There are two other extensions over the oldDMA Mapping API. They are

int dma_get_cache_alignment(void)

• returns the cache alignment width of theplatform (see section 2.2). Note, the valuereturned guarantees only to be a power oftwo and greater than or equal to the cur-rent processor cache width. Thus its valuemay be relied on to separate data variableswhere I/O caching effects would destroydata.

int dma_is_consistent(dma_addr_t dma_handle)

• returns true if the physical memory area atdma_handle is coherent

And

void dma_sync_single_range(struct device *dev,dma_addr_t dma_handle,unsigned long offset,size_t size, enumdma_data_directiondirection)

• offset —the offset from thedma_handle

• size —the size of the region to be syn-chronized

which allows a partial synchronization of amapped region. This is useful because on mostCPUs, the cost of doing a synchronization isdirectly proportional to the size of the region.Using this API allows the synchronization tobe restricted only to the necessary parts of thedata.

4 Implementing the DMA API fora Platform

In this section we will explore how the DMAAPI should be implemented from a platformmaintainer’s point of view. Since implementa-tion is highly platform specific, we will con-centrate on how the implementation was donefor the HP PA-RISC[4] platform.

4.1 Brief Overview of PA-RISC

CPU

U2/UturnIOMMU

Cujo

PCI64

Dino Wax Lasi

EISA LASIPCI32

IOMMUU2/Uturn

Memory

GSC/10 GSC/8

Runway

Figure 3: An abridged example of PA-RISCarchitecture

An abridged version of a specific PA-RISC ar-chitecture7 is given in Figure 3. It is particu-

7The illustration is actually from a C360 machine,and does not show every bus in the machine

Linux Symposium 74

larly instructive to note that the actual chips in-volved (Cujo, U2/Uturn, Wax etc.) vary frommachine to machine, as do the physical busconnections, so the first step that was requiredfor PA-RISC was to build a complete model ofthe device layout from the runway bus on downin the generic device model[5].

4.2 Converting to the Device Model

Since PA-RISC already had its own device type(struct parisc_device ), it was fairlysimple to embed a generic device in this, as-sign it to a newparisc_bus_type andbuild up the correct device structure. Forbrevity, instead of giving all the PA-RISC spe-cific buses (like Lasi, GSC, etc) their ownbus type, they were all simply assigned to theparisc_bus_type .

The next problem was that of attaching the PCIbus. Previously, PCI buses could only haveother PCI buses as parents, so a new API8 wasintroduced to allow parenting a PCI bus to anarbitrary generic device. At the same time, oth-ers were working on bringing the EISA bus un-der the generic device umbrella [6].

With all these core and architecture specificchanges, PA-RISC now has a complete genericdevice model layout of all of its I/O compo-nents and is ready for full conversion to thenew DMA API.

4.3 Converting to the DMA API

Previously for PA-RISC, the U2/Uturn(IOMMU) information was cached in thePCI devicesysdata field and was placedthere at bus scan time. Since the bus scan-ning was done from the PA-RISC specificcode, it knew which IOMMU the bus wasconnected to. Unfortunately, this scheme

8pci_scan_bus_parented()

doesn’t work for any other bus type, so anAPI9 was introduced to obtain a fake PCIdevice for a givenparisc_device andplace the correct IOMMU in thesysdatafield. PA-RISC actually has an architectureswitch (see struct hppa_dma_ops inasm-parisc/dma-mapping.h ) for theDMA functions. All the DMA functions reallyneed to know is which IOMMU the deviceis connected to and the device’sdma_mask,making conversion quite easy since the onlyPCI specific piece was extracting the IOMMUdata.

Obviously, in the generic device model, thefield platform_data is the one that wecan use for caching the IOMMU informa-tion. Unfortunately there is no code in anyof the scanned buses to allow this to be pop-ulated at scan time. The alternative schemewe implemented was to take the generic de-vice passed into the DMA operations switchand, if platform_data was NULL, walkup theparent fields until the IOMMU wasfound, at which point it was cached in theplatform_data field. Since this will nowwork for every device that has a generic device(which is now every device in the PA-RISCsystem), the fake PCI device scheme can beeliminated, and we have a fully implementedDMA API.

4.4 The Last Wrinkle—Non-Coherency

There are particular PA-RISC chips (the PCX-S and PCX-T) which are incapable of allo-cating any coherent memory at all. Fortu-nately, none of these chips was placed into amodern system,or indeed into a system withan IOMMU, so all the buses are directly con-nected.

Thus, an extra pair of functions was addedto the DMA API switch for this platform

9ccio_get_fake()

Linux Symposium 75

which implemented noncoherent allocations asakmalloc() followed by a map, and also forthe dma_cache_sync() API. On the plat-forms that are able to allocate coherent mem-ory, the noncoherent allocator is simply the co-herent one, and the cache sync API is a nop.

5 Future Directions

The new DMA API has been successful onplatforms that need to unify the DMA view ofdisparate buses. However, the API as designedis really driver writer direct to platform. Thereare buses (USB being a prime example) whichwould like to place hooks to intercept the DMAtransactions for programming DMA bridgingdevices that are bus specific rather than plat-form specific. Work still needs doing to inte-grate this need into the current framework.

Acknowledgements

I would like to thank Grant Grundler andMatthew Wilcox from the PA-RISC linux port-ing team for their help and support transition-ing PA-Linux to the generic device model andsubsequently the new DMA API. I would alsolike to thank HP for their donation of a PA-RISC C360 machine and the many people whocontributed to the email thread[7] that began allof this.

References

[1] David S. Miller, Richard Henderson, andJakub JelinekDynamic DMA MappingLinux Kernel 2.5Documentation/DMA-mapping.txt

[2] James E.J. BottomleyDynamic DMAmapping using the generic deviceLinuxKernel 2.5.Documentation/DMA-API.txt

[3] Jonathan CorbetDriver Porting: DMAChangesLinux Weekly News,http://lwn.net/Articles/28092

[4] The PA-RISC linux teamhttp://www.parisc-linux.org

[5] Patrick MochelThe (New) Linux KernelDriver ModelLinux Kernel 2.5Documentation/driver-model/*.txt

[6] Marc Zyngiersysfs stuff for EISA bushttp://marc.theaimsgroup.com/?t=103696564400002

[7] James Bottomley[RFC] generic deviceDMA implementationhttp://marc.theaimsgroup.com/?t=103902433500007

Linux® Scalability for Large NUMA Systems

Ray Bryant and John HawkesSilicon Graphics, Inc.

[email protected] [email protected]

Abstract

The SGI® Altix™ 3000 family of serversand superclusters are nonuniform memory ac-cess systems that support up to 64 Intel® Ita-nium® 2 processors and 512GB of main mem-ory in a single Linux image. Altix is targetedto the high-performance computing (HPC) ap-plication domain. While this simplifies cer-tain aspects of Linux scalability to such largeprocessor counts, some unique problems havebeen overcome to reach the target of near-linear scalability of this system for HPC appli-cations. In this paper we discuss the changesthat were made to Linux® 2.4.19 during theporting process and the scalability issues thatwere encountered. In addition, we discuss ourapproach to scaling Linux to more than 64 pro-cessors and we describe the challenges that re-main in that arena.

1 Introduction

Over the past three years, SGI has been work-ing on a series of new high-performance com-puting systems based on its NUMAflex™ in-terconnection architecture. However, un-like the SGI® Origin® family of machines,which used a MIPS® processor and ran theSGI® IRIX® operating system, the new se-ries of machines is based on the Intel® Ita-nium® Processor Family and runs an Itaniumversion of Linux.

In January 2003, SGI announced this series of

machines, now known as the SGI Altix 3000family of servers and superclusters. As an-nounced, Altix supports up to 64 Intel Itanium2 processors and 512GB of main memory ina single Linux image. The NUMAflex archi-tecture actually supports up to 512 Itanium 2processors in a single coherency domain; sys-tems larger than 64 processors comprise multi-ple single-system image Linux systems (eachwith 64 processors or less) coupled via NU-MAflex into a "supercluster." The resultingsystem can be programmed using a message-passing model; however, interactions betweennodes of the supercluster occur at shared mem-ory access times and latencies.

In this paper, we provide a brief overviewof the Altix hardware and discuss the Linuxchanges that were necessary to support thissystem and to achieve good (near-linear) scal-ability for high-performance computing (HPC)applications on this system. We also discusschanges that improved scalability for moregeneral workloads, including changes for high-performance I/O. Plans for submitting thesechanges to the Linux community for incorpo-ration in standard Linux kernels will be dis-cussed. While single-system-image Linux sys-tems larger than 64 processors are not a config-uration shipped by SGI, we have experimentedwith such systems inside SGI, and we will dis-cuss the kernel changes necessary to port Linuxto such large systems. Finally, we presentbenchmark results demonstrating the results ofthese changes.

Linux Symposium 77

2 The SGI Altix Hardware

An Altix system consists of a configurablenumber of rack-mounted units, each of whichSGI refers to as a brick. Depending on con-figuration, a system may contain one or moreof the following brick types: a compute brick(C-brick), a memory brick (M-brick), a routerbrick (R-brick), or an I/O brick (P-brick or PX-brick).

The basic building block of the Altix system isthe C-brick (see Figure 1). A fully configuredC-brick consists of two separate dual-processorsystems, each of which is a bus-connected mul-tiprocessor or node. The bus (referred to hereas a Front Side Bus or FSB) connects the pro-cessors and the SHUB chip. Since HPC ap-plications are often memory-bandwidth bound,SGI chose to package only two processors perFSB in order to keep FSB bandwitdh from be-ing a bottleneck in the system.

The SHUB is a proprietary ASIC that imple-ments the following functions:

• It acts a memory controller for the localmemory on the node

• It provides an interface to the interconnec-tion network

• It provides an interface to the I/O subsys-tem

• It manages the global cache coherencyprotocol

• It supports global TLB shoot-down andinter-processor interrupts (IPIs)

• It provides a globally synchronized high-resolution clock

The memory modules of the C-brick consistof standard PC2100 or PC2700 DIMMS. With

Figure 1:Altix C-Brick

1GB DIMMS, up to 16GB of memory can beinstalled on a node. For those applications re-quiring even more memory, an M-brick (a C-brick without processors) can be used.

Memory accesses in an Altix system are eitherlocal (i.e., the reference is to memory in thesame node as the processor) or remote. Localmemory references have lower latency; the Al-tix system is thus a NUMA (nonuniform mem-ory access) system. The ratio of remote to lo-cal memory access times on an Altix systemvaries from 1.9 to 3.5 depending on the sizeof the system and the relative locations of theprocessor and memory module involved in thetransfer.

As shown in Figure 1, each SHUB chip pro-vides three external interfaces: two NUMA-link™ interfaces (labeled NL in the figure) toother nodes or the network and an I/O interfaceto the I/O bricks in the system. The SHUB chipuses the NUMAlink interfaces to send remotememory references to the appropriate node inthe system. Depending on configuration theNUMAlink interfaces provide up to 3.2GB/secof bandwidth in each direction.

I/O bricks implement the I/O interface in theAltix system. (These can be either an IX-brickor a PX-brick. Here we will use the genericterm I/O brick.) The bandwidth of the I/O

Linux Symposium 78

channel is 1.2GB/sec in each direction. EachI/O brick can contain a number of standardPCI cards; depending on configuration a smallnumber of devices may be installed directly inthe I/O brick as well.

Shared memory references to the I/O brick canbe generated on any processor in the system.These memory references are forwarded acrossthe network to the appropriate node’s SHUBchip and then they are routed to the appropri-ate I/O brick. This is done in such a way thatstandard Linux device drivers will work againstthe PCI cards in an I/O brick, and these devicedrivers can run on any node in the system.

The cache-coherency policy in the Altix sys-tem can be divided into two levels: localand global. The local cache-coherency pro-tocol is defined by the processors on the FSBand is used to maintain cache-coherency be-tween the Itanium processors on the FSB.The global cache-coherency protocol is imple-mented by the SHUB chip. The global proto-col is directory-based and is a refinement of theprotocol originally developed for DASH [11](Some of these refinements are discussed in[10]).

The Altix system interconnection network usesrouting bricks (R-bricks) to provide connectiv-ity in system sizes larger than 16 processors.(Smaller systems can be built without routersby directly connecting the NUMAlink chan-nels in a ring configuration.) For example, a64-processor system is connected as shown inFigure 2.

One measure of the bandwidth capacity of theinterconnection network is the bisection band-width. This bandwidth is defined as follows:draw an imaginary line through the center ofthe system. Suppose that each processor onone side of this line is referencing memoryon a corresponding node on the other sideof this line. The bisection bandwidth is the

total amount of data that can be transferredacross this line with all processors driven ashard as possible. In the Altix system, thebisection bandwidth of the system is at least400MB/sec/processor for all system sizes.

R R

RR

R R

RR

C−bricks

Figure 2: 64-CPU Altix System (R’s representrouter bricks)

3 Benchmarks

Three kinds of benchmarks have been used instudying Altix performance and scalability:

• HPC benchmarks and applications

• AIM7

• Other open-source benchmarks

HPC benchmarks, such as STREAM [12],SPEC® CPU2000, or SPECrate® [22] are sim-ple, usermode, memory-intensive benchmarks

Linux Symposium 79

to verify that the hardware architectural designgoals were achieved in terms of memory andI/O bandwidths and latencies. Such bench-marks typically execute one thread per CPUwith each thread being autonomous and inde-pendent of the other threads so as to avoid in-terprocess communication bottlenecks. Thesebenchmarks typically do not spend a signifi-cant amount of time using kernel services.

Nonetheless, such benchmarks do test the vir-tual memory subsystem of the Linux kernel.For example, virtual storage for dynamicallyallocated arrays in Fortran is allocated viammap. When the arrays are touched duringinitialization, the page fault handler is invokedand zero-filled pages are allocated for the ap-plication. Poor scalability in the page-faulthandling code will result in a startup bottleneckfor these applications.

Once these benchmark tests had been passed,we then ran similar tests for specific HPC ap-plications. These applications were selectedbased on a mixture of input from SGI market-ing and an assessment of how difficult it wouldbe to get the application working in a bench-mark environment. Some of these benchmarkresults are presented in section 7 on page 85.

The AIM Benchmark Suite VII (AIM7) isa benchmark that simulates a more generalworkload than that of a single HPC applica-tion. AIM7 is a C-language program thatforks multiple processes (called tasks), each ofwhich concurrently executes similar, randomlyordered set of 53 different kinds of subtests(called jobs). Each subtest exercises a par-ticular facet of system functionality, such asdisk-file operations, process creation, user vir-tual memory operations, pipe I/O, or compute-bound arithmetic loops.

As the number of AIM7 tasks increases, ag-gregated throughput increases to a peak value.Thereafter, additional tasks produce increasing

contention for kernel services, they encounterincreasing bottlenecks, and the throughput de-clines. AIM7 has been an important bench-mark to improve Altix scalability under gen-eral workloads that consist of a mixture ofthroughput-oriented runs or of programs thatare I/O bound.

Other open-source benchmarks have also beenemployed. pgmeter[6] is a general file sys-tem I/O benchmark that we have used to eval-uate file system performance[5].Kernbenchis a parallel make of the Linux kernel, (e.g.,/usr/bin/time make -j 64 vmlinux). Hackbenchis a set of communicating threads that stressesthe CPU scheduler. Erich Focht’srandupdtstresses the scheduler’s load-balancing algo-rithms and ability to keep threads executingnear their local memory and has been used tohelp tune the NUMA scheduler for Altix.

4 Measurement and Analysis Tools

A number of measurement tools and bench-marks have been used in our optimizationof Linux kernel performance for Altix 3000.Among these are:

• Lockmeter [4, 17], a tool for measuringspinlock contention

• Kernprof [18], a kernel profiling tool thatsupports hierarchical profiling

• VTune™ [20], a profiling tool availablefrom Intel.

• pfmon [13, 15], a tool written by StephaneEranian that uses theperfmonsystem callsof the Linux kernel for Itanium to inter-face with the Itanium processor perfor-mance measurement unit

Linux Symposium 80

5 Scaling and the Linux Kernel onAltix

“Perfect scaling”—a linear one-to-one rela-tionship between CPU count and throughputfor all CPU counts—is rarely achieved becauseone or more bottlenecks (software or hard-ware) introduce serial constraints into the oth-erwise independently parallel CPU executionstreams. The common Linux bottlenecks -lock contention and cache-line contention - aretrue for uniform-memory-access multiproces-sor systems as well as for NUMA multiproces-sor systems and Altix 3000. However, the per-formance impact of these bottlenecks on Al-tix can be exaggerated (potentially in a nonlin-ear way) by the high processor counts and thedirectory-based global cache-coherency policy.

Analysis of lock contention has therefore beena key part of improving scalability of the Linuxkernel for Altix. We have found and removeda number of different lock-contention bottle-necks whose impacts are significantly worseon Altix than they are on smaller, non-NUMAplatforms. Specific examples of these changesare discussed below, in sections 5.2 through5.8.

Cache-line contention is usually more subtlethan lock contention, but cache-line contentioncan still be a significant impediment to scal-ing large configurations. These effects canbe broadly classified as either “false cache-line sharing” or cache-line “ping-ponging.”“False cache-line sharing” is the unintendedco-residency of unrelated variables in the samecache-line. Cache-line “ping-ponging” is thechange in exclusive ownership of a cache-lineas different CPUs write to it.

An example of “false cache-line sharing” iswhen a single L3 cache-line contains a loca-tion that is frequently written and another loca-tion that is only being read. In this case, a read

will commonly trigger an expensive cache-coherency operation to demote the cache-linefrom exclusive to shared state in the directory,when all that would otherwise be necessarywould be adding the processor to the list ofsharing nodes in the cache-line directory en-try. Once identified, “false cache-line sharing”can often be remedied by isolating a frequentlydirtied variable into its own cache-line.

The performance measurement unit of the Ita-nium 2 processor includes events that allowone to sample cache misses and record pre-cisely the data and instruction addresses asso-ciated with these sampled misses. We haveused tools based on these events to find and re-move false sharing in user applications, and weplan to repeat such experiments to find and re-move false sharing in the Linux kernel.

Some cache-line contention and ping-pongingcan be difficult to avoid. For example,a multiple-reader single-writerrwlock_t con-tains an contending variable: the read-lockcount. Eachread_lock() and read_unlock()request changes the count and dirties thecache-line containing the lock. Thus, an ac-tively usedrwlock_t is continually being ping-ponged from CPU to CPU in the system.

5.1 Linux changes for Altix

To get from awww.kernel.org Linux ker-nel to a Linux kernel for Altix, we apply thefollowing open-source community patches:

• The IA-64 patch maintained by DavidMossberger [14]

• The "discontiguous memory" patch orig-inally developed as part of the Atlasproject [1]

• The O(1) scheduler patch from ErichFocht [7]

Linux Symposium 81

• The LSE rollup patch for the Big KernelLock [19]

This open-source base is then modified tosupport the Altix platform-specific addressingmodel and I/O implementation. This set ofpatches and the changes specific to Altix com-prise the largest set of differences between astandard Linux kernel and a Linux kernel forAltix.

Additional discretionary enhancements im-prove the user’s access to the full power of thehardware architecture. One such enhancementis the CpuMemSets package of library and ker-nel changes that permit applications to controlprocess placement and memory allocation. Be-cause the Altix system is a NUMA machine,applications can obtain improved performancethrough use of local memory. Typically, thisis achieved by pinning the process to a CPU.Storage allocated by that process (for example,to satisfy page faults) will then be allocated inlocal memory, using a first-touch storage allo-cation policy. By making optimal use of localmemory, applications with large CPU countscan be made to scale without swamping thecommunications bandwidth of the Altix inter-connection network.

Another enhancement is XSCSI, a new SCSImidlayer that provides higher throughput forthe large Altix I/O configurations. Fig-ure 3 shows a comparison of I/O bandwidthsachieved at the device-driver level for SCSIand XSCSI on a prototype Altix system.XSCSI was a tactical addition to the Linux ker-nel for Altix in order to dramatically improveI/O bandwith performance while still meetingproduct-development deadlines.

5.2 Big Kernel Lock

The classic Linux multiprocessor bottleneckhas been thekernel_flag, commonly known as

0

200

400

600

800

1000

0 200 400 600 800 1000

MB

/s

Request size (in 512 byte blocks)

XSCSI Raw I/O read performance

xscsiscsi

Figure 3:Comparison of SCSI and XSCSI on Pro-totype Altix Hardware

the Big Kernel Lock (BKL). Early Linux MPimplementations used the BKL as the primarysynchronization and serialization lock. Whilethis may not have been a significant problemfor workloads on a 2-CPU system, a singlegross-granularity spinlock is a principal scal-ing bottleneck for larger CPU counts.

For example, on a prototype Altix platformwith 28 CPUs running a relatively unimproved2.4.17 kernel and with Ext2 file systems, wefound that with an AIM7 workload, 30% ofthe CPU cycles were consumed by waitingon the BKL, and another 25% waiting on therunqueue_lock. When we introduced a moreefficient multiqueue scheduler that eliminatedcontention on therunqueue_lock, we discov-ered that therunqueue_lockcontention sim-ply became increased contention pressure onthe BKL. This resulted in 70% of the systemCPU cycles being spent waiting on the BKL,up from 30%, and a 30% drop in AIM7 peakthroughput performance. The lesson learnedhere is that we should attack the biggest bot-tlenecks first, not the lesser bottlenecks.

Linux Symposium 82

We have attempted to solve the BKL problemusing several techniques. One approach hasbeen in our preferential use of the XFS® filesystem, vs. Ext2 or Ext3, as the Altix file sys-tem. XFS uses scalable, fine-grained lockingand largely avoids the use of the BKL alto-gether. Figure 4 shows a comparison of the rel-ative scalability of several different Linux filesystems under the AIM7 workload [5].

0

1

2

3

4

5

6

0 4 8 12 16 20 24 28

Thr

ough

put (

rela

tive

to E

xt2

at 2

CPU

S)

Number of Processors

Filesystem Performance

ext2ext3

ReiserFSXFS

Figure 4:File System Scalability of Linux® 2.4.17on Prototype Altix Hardware

Other changes were targeted to specific func-tionality, such as back porting the 2.5 algorithmfor process accounting that uses a new spinlockinstead of using the BKL.

The largest set of changes was derived from theLSE rollup patch [19]. The aggregate effect ofthese changes has reduced the fraction of cy-cles spent spinning (out of all CPU cycles) forthe BKL lock from 50% to 5% when runningthe AIM7 benchmark, with 64 CPUs and XFSfilesystems on 64 disks.

5.3 TLB Flush

For the Altix platforms, a global TLB flushfirst performs a local TLB flush using the In-tel ptc.gainstruction. Then it writes TLB flushaddress and size information to special SHUBregisters that trigger the remote hardware toperform a local TLB flush of the CPUs in thatnode. Several changes have been incorporatedinto the Linux kernel for Altix to reduce theimpact of TLB flushing. First, care is takento minimize the frequency of flushes by elimi-nating NULL entries in themmu_gathersarrayand by taking advantage of the larger Itanium 2page size that allows flushing of larger addressranges. Another reduction was accomplishedby having the CPU scheduler remember thenode residency history of each thread; this al-lows the TLB flush routine to flush only thoseremote nodes that have executed the thread.

5.4 Multiqueue CPU Scheduler

The CPU scheduler in the 2.4 (and earlier) ker-nels is simple and efficient for uniprocessorand small MP platforms, but it is inefficientfor large CPU counts and large thread counts[3]. One scaling bottleneck is due to the useof a single globalrunqueue_lockspinlock toprotect the global runqueue, thereby making itheavily contended.

A second inefficiency in the 2.4 scheduler iscontention on cache-lines that are involvedwith managing the global runqueue list. Theglobal list was frequently searched, and pro-cess priorities were continually being recom-puted. The longer the runqueue list, the moreCPU cycles were wasted.

Both of these problems are addressed bynew Linux schedulers that partition the sin-gle global runqueue of the standard schedulerinto multiple runqueues. Beginning with theLinux® 2.4.16 for Altix version, we began to

Linux Symposium 83

use the Multiqueue Scheduler contributed tothe Linux Scalability Effort (LSE) by HubertusFranke, Mike Kravetz, and others at IBM [8].This scheduler was replaced by the O(1) sched-uler contributed to the 2.5 kernel by Ingo Mol-nar [16] and subsequently modified by ErichFocht of NEC and others. The Linux kernel forAltix now uses a version of the 2.5 schedulerwith some "NUMA-aware" enhancements andother changes that provide additional perfor-mance improvements for typical Altix work-loads.

What is relatively peculiar to Altix workloadsis the commonplace utilization of the taskcpus_allowedbitmap to constrain CPU res-idency. This is typically done to pin pro-cesses to specific CPUs or nodes in order toefficiently use local storage on a particularnode. The result is that the Altix schedulercommonly encounters processes that cannot bemoved to other CPUs for load-balancing pur-poses. The 2.5 version of the O(1) sched-uler’s load_balance()routine searches only the“busiest” CPU’s runqueue to find a candidateprocess to migrate. If, however, the “busiest”CPU’s runqueue is populated with processesthat cannot be migrated, the 2.5 version ofthe O(1) scheduler does not then search otherCPUs’ runqueues to find additional migrationcandidates. What is done in the Linux kernelfor Altix is that load_balance()builds a list of“busier” CPUs and searches all of them (in de-creasing order of imbalance) to find candidateprocesses for migration.

A more subtle scaling problem occurs whenmultiple CPUs contend insideload_balance()on a busy CPU’s runqueue lock. This ismost often the case when idle CPUs are load-balancing. While it is true that lock contentionamong idle CPUs is often benign, the problemis that that “busiest” CPU’s runqueue lock hasnow become highly contended, and that stallsany attempt totry_to_wake_up()a process on

that queue or on the runqueue of any of the idleCPUs that are spinning on another runqueuelock. Therefore, it is beneficial to stagger theidle CPUs’ calls toload_balance()to minimizethis lock contention. We continue to experi-ment with additional techniques to reduce thiscontention.

At the time this paper was being written, wehave begun to use an adaptation of the O(1)scheduler from Linux® 2.5.68 in the Linuxkernel for Altix. To this version of the sched-uler, we have added a set of “NUMA-aware”enhancements that were contributed by variousdevelopers and that have been aggregated intoa roll-up patch by Martin Bligh [2]. Early per-formance tests show promising results. AIM7aggregate CPU time is roughly 6% lower at2500 tasks than without these “NUMA-aware”improvements. We will continue to track fur-ther changes in this area and to contribute SGIchanges back to the community.

5.5 dcache_lock

In the 2.4.17 and earlier kernels, thedcache_lockwas a modestly busy spinlock forAIM7-like workloads, typically consuming3% to 5% of CPU cycles for a 32-CPUconfiguration. The 2.4.18 kernel, however,began to use this lock indnotify_parent(),and the compounding effect of that additionalusage made this lock a major CPU cycleconsumer. We have solved this problem inthe Linux kernel for Altix by back portingthe 2.5 kernel’s finer-graineddparent_lockstrategy. This has returned the contention onthedcache_lockto acceptable levels.

5.6 lru_list_lock

One bottleneck in the VM system we have en-countered is contention for thelru_list_lock.This bottleneck cannot be completely elimi-nated without a major rewrite of the Linux

Linux Symposium 84

2.4.19 VM system. The Linux kernelfor Altix contains a minor, albeit measur-ably effective, optimization for this bottle-neck in fsync_buffers_list(). Instead of re-leasing thelru_list_list and immediately call-ing osync_buffers_list(), which reacquires it,fsync_buffers_list()keeps holding the lockand instead calls a new__osync_buffers_list(),which expects the lock to be held on entry. Fora highly contended spinlock, it is often better todouble the lock’s hold-time than to release thelock and have to contend for ownership a sec-ond time. This particular change produced 2%to 3% improvement in AIM7 peak throughput.

5.7 xtime_lock

Thextime_lockread-write spinlock is a severescaling bottleneck in the 2.4 kernel. A merehandful of concurrent user programs callinggettimeofday()can keep the spinlock’s read-count perpetually nonzero, thereby starving thetimer interrupt routine’s attempts to acquirethe lock in write mode and update the timervalue. We eliminated this bottleneck by in-corporating an open-source patch that convertsthextime_lockto a lockless-read usingfrlock_tfunctionality (which is equivalent toseqlock_tin the 2.5 kernel).

5.8 Node-Local Data Structures

We have reduced memory latencies to var-ious per-CPU and per-node data structuresby allocating them in node-local storage andby employing strided allocation to improvecache efficiency. Structures where this tech-nique is used includestruct cpuinfo_ia64,mmu_gathers, andstruct runqueue.

6 Beyond 64 Processors

The Altix platform hardware architecture sup-ports a cache-coherent domain of 512 CPUs.

Although SGI currently supports a maximumof 64 processors in a single Linux image, webelieve there is potential interest in SSI Linuxsystems that are larger than 64 CPUs (judgingfrom our experience with IRIX systems, wheresome customers run 512-CPU SSI systems).Additionally, testing on systems that are largerthan 64 CPUs can help us find scalability prob-lems that are present, but not yet obvious insmaller systems.

The Linux kernel currently defines a CPU bitmask as anunsigned long, which for an Ita-nium architecture provides enough bits to spec-ify 64 CPUs. To support a CPU count that ex-ceeds the bit count of anunsigned longrequiresthat we define acpumask_ttype, declared asunsigned long[N], where N is large enoughto provide sufficient bits to denote each CPU.While this is a simple kernel coding change,the change affects numerous files. Moreover, italso affects some calling sequences which to-day expect to pass acpumask_tas a call-by-value argument or as a simple function returnvalue. Most problematic is whencpumask_tis involved in a user-kernel interface, such aswe have with the SGI CpuMemSets functionslike runon anddplace. Our plan is to followthe approach of the 2.5 kernel in this area (c.f.,sched_set/get_affinity).

Our initial 128-CPU investigations have so farnot yielded any great surprises. The Altix hard-ware architecture scales memory access band-widths to these larger configurations, and thelonger memory access latencies are small (65nanoseconds per router “hop” in the currentsystems) and well understood.

The large NR_CPU configurations benefitfrom anti-aliasing the per-node data and fromdynamically allocating thestruct cpuinfo_ia64andmmu_gathers. Dynamic allocation of therunqueue_telements reduces the static size ofthe kernel, which otherwise produces a “gp

Linux Symposium 85

overflow” at link time. Inter-processor inter-rupts and remote SHUB references are madein physical addressing mode, thereby avoidingthe use of TLB entries and reducing TLB pres-sure. Finally, several calls to__cli() were elim-inated or converted into locks.

7 HPC Application BenchmarkResults

In this section, we present some benchmark re-sults for example applications in the HPC mar-ket segment.

7.1 STREAM

The STREAM Benchmark [12] is a “simplesynthetic benchmark program that measuressustainable memory bandwidth (in MB/sec)and the corresponding computation rate forsimple vector kernels” [12]. Since many HPCapplications are memory bound, higher num-bers for this benchmark indicate potentiallyhigher performance on HPC applications ingeneral. The STREAM Benchmark has the fol-lowing characteristics:

• It consists of simple loops

• It is embarrassingly parallel

• It is easy for compiler to generate scalablecode for this benchmark

• In general, only simple optimization lev-els are allowed

Here we report on what is called the "triad"portion of the benchmark. The parallel Fortrancode for this kernel is shown below:

!$OMP PARALLEL DO

DO j = 1, n

a(j) = b(j) + s ∗c(j)

CONTINUE

To execute this code in parallel on an Al-tix system, the data is evenly divided amongthe nodes, and each processor is assigned todo the calculations on the portion of the dataon its node. Threads are pinned to proces-sors so that no thread migration occurs dur-ing the benchmark. The scalability results forthis benchmark on Altix 3000 are as shownin figure 5. As can be seen from this graph,

0

20

40

60

80

100

120

140

0 8 16 24 32 40 48 56 64

GB

/s


Altix 3000 STREAM Triad Benchmark

LinearSGI Altix 3000

Figure 5:Scalability of STREAM TRIAD bench-mark on Altix 3000

the results scale almost linearly with proces-sor count. In some sense, this is not surpris-ing, given the NUMA architecture of the Altixsystem, since each processor is accessing datain local memory. One can argue that any non-shared memory cluster with a comparable pro-cessor could achieve a similar result. However,what is important to realize here is that not allmultiprocessor architectures can achieve sucha result. A typical uniform-memory-accessmultiprocessor system, for example, could notachieve such linear scalability because the in-terconnection network would become a bottle-neck. Similarly, while it is true that a non-

Linux Symposium 86

shared memory cluster can achieve good scal-ability, it does so only if one excludes the datadistribution and collection phases of the pro-gram from the test. These times are trivial onan Altix system due to its shared memory ar-chitecture.

We have also run the STREAM benchmarkwithout thread pinning, both with the standard2.4.18 scheduler and the O(1) scheduler. TheO(1) scheduler produces a throughput resultthat is nearly six times better than the standardLinux scheduler. This demonstrates that thestandard Linux scheduler causes dramaticallymore thread migration and cache damage thanthe O(1) scheduler.

7.2 Star-CD™

Star-CD [21] is a fluid-flow analysis systemmarketed by Computational Dynamics, Ltd. Itis an example of a computational fluid dynam-ics code. This particular code uses a message-passing interface on top of the shared-memoryAltix hardware. This example uses an “A”Class Model, with 6 million cells. The resultsof the benchmark are shown in Figure 6.

7.3 Gaussian® 98

Gaussian [9] is a product of Gaussian, Inc.Gaussian is a computational chemistry systemthat calculates molecular properties based onfundamental quantum mechanics principles.The problem being solved in this case is a sam-ple problem from the Gaussian quality assur-ance test suite. This code uses a shared mem-ory programming model. The results of thebenchmark are shown in Figure 7.

8 Concluding Remarks

With the open-source changes discussed in thispaper, SGI has found Linux to be a good fit

0

10

20

30

40

50

60

0 8 16 24 32 40 48 56 64

Spee

dup


Star−CD Scalability Altix 3000 Prototype


Figure 6:Scalability of Star-CD Benchmark

0

5

10

15

20

25

30

0 8 16 24 32

Spee

dup


Gaussian Sclability Altix 3000 Prototype


Figure 7:Scalability of Gaussian Benchmark

Linux Symposium 87

for HPC applications. In particular, changes inthe Linux scheduler, use of the XFS file sys-tem, use of the XSCSI implementation, andnumerous small scalability fixes have signifi-cantly improved scaling of Linux for the Altixplatform. The combination of a relatively lowremote-to-local memory access time ratio andthe high bandwidth provided by the Altix hard-ware are also key reasons that we have beenable to achieve good scalability using Linux onthe Altix system.

Relatively speaking, the total set of changes inLinux for the Altix system is small, and mostof the changes have been obtained from theopen-source Linux community. SGI is com-mitted to working with the Linux communityto ensure that Linux performance and scalabil-ity continues to improve as a viable and com-petitive operating system for real-world envi-ronments. Many of the changes discussed inthis paper have already been submitted to theopen-source community for inclusion in com-munity maintained software. Versions of thissoftware are also available atoss.sgi.com .

We anticipate future scalability efforts in mul-tiple directions. One is an ongoing reductionof lock contention, as best we can accomplishwith the 2.4 source base. We have work inprogress on CPU scheduler improvements, re-duction of thepagecache_lockcontention, andreductions in other I/O-related locks.

Another class of work will be analysis of cacheand TLB activity inside the kernel, which willpresumably generate patches to reduce over-head associated with poor cache and TLB us-age.

Large-scale I/O is another area of ongoing fo-cus, including improving aggregate bandwidth,increasing transaction rates, and spreading in-terrupt handlers across multiple CPUs.

References

[1] sourceforge.net/projects/discontig

[2] www.kernel.org/pub/linux/kernel/people/mbligh/

[3] Ray Bryant and Bill Hartner, Javatechnology, threads, and scheduling inLinux, IBM Developerworkswww-106.ibm.com/developerworks/library/j-java2/index.html

[4] Ray Bryant and John Hawkes,Lockmeter: Highly-InformativeInstrumentation for Spin Locks in theLinux Kernel,Proceedings of the FourthAnnual Linux Showcase & Conference,Atlanta, Ga. (2000),oss.sgi.com/projects/lockmeter/als2000/als2000lock.html

[5] Ray Bryant, Ruth Forester, and JohnHawkes, Filesystem Performance andScalability in Linux 2.4.17,Proceedingsof the Freenix Track of the 2002 UsenixAnnual Technical Conference, Montery,Ca., (June 2002).

[6] Ray Bryant, David Raddatz, and RogerSunshine, PenguinoMeter: A newFile-I/O Benchmark for Linux,Proceedings of the 5th Annual LinuxShowcase and Conference, Oakland, Ca.(October 2001).

[7] home.arcor.de/efocht/sched

[8] M. Kravetz, H. Franke, S. Nagar, and R.Ravindran, Enhancing Linux SchedulerScalability,Proceedings of the OttawaLinux Symposium, Ottawa, CA, July2001.

[9] Gaussian, Inc., Carnegie Office Park,Building 6, Suite 230, Carnegie, PA15106 USA.www.guassian.com

Linux Symposium 88

[10] James Laudon and Daniel Lenoski, TheSGI Origin: a ccNUMA Highly ScalableServer,ACM SIGARCH ComputerArchitecture News, Volume 25, Issue 2,(May 1997), pp. 241-251.

[11] Daniel Lenoski, James Laudon, TrumanJoe, David Nakahira, Luis Stevens,Anoop Gupta, and John Hennesy, TheDASH prototype: Logic overhead andperformance,IEEE Transacctions onParallel and Distributed Systems,4(1):41-61, January 1993.

[12] John D. McCalpin, STREAM:Sustainable Memory Bandwidth in HighPerformance Computers,www.cs.virginia.edu/stream

[13] David Mosberger and Stephane Eranian,IA-64 Linux Kernel, Design andImplementation, Prentice-Hall (2002),ISBN 0-13-061014-3, pp. 405-406.

[14] www.kernel.org/pub/linux/kernel/ports/ia64

[15] www.hpl.hp.com/research/linux/perfmon/ .

[16] www.ussg.iu.edu/hypermail/linux/kernel/0201.0/0810.html

[17] oss.sgi.com/projects/lockmeter

[18] oss.sgi.com/projects/kernprof

[19] lse.sourceforge.net/lockhier/bkl_rollup.html

[20] www.intel.com/software/products/vtune

[21] Star-CD,www.cd-adapco.com .

[22] www.spec.org

© 2003 Silicon Graphics, Inc. Permission to re-distribute in accordance with Ottawa Linux Sym-posium submission guidelines is granted; all otherrights reserved. Silicon Graphics, SGI, IRIX, Ori-gin, XFS, and the SGI logo are registered trade-marks and Altix, NUMAflex, and NUMAlink aretrademarks of Silicon Graphics, Inc., in the UnitedStates and/or other countries worldwide. Linux isa registered trademark of Linus Torvalds. MIPSis a registered trademark of MIPS Technologies,Inc., used under license by Silicon Graphics, Inc.,in the United States and/or other countries world-wide. Intel and Itanium are registered trademarksand VTune is a trademark of Intel Corporation. Allother trademarks mentioned herein are the propertyof their respective owners. (05/03)

An Implementation of HIP for Linux

Miika Komu∗

[email protected]

Mika Kousa*[email protected]

Janne Lundberg†

[email protected]

Catharina Candolin‡

[email protected]

Abstract

One of the main problems with IP has been itslack of security. Although IPSec and DNSSechave provided some level of security to IP, thenotion of a true identity for hosts is still miss-ing. Typically, the IP address of the host hasbeen used as the host identity, regardless ofthe fact that it is nothing more than routing in-formation. The purpose of the Host IdentityPayload/Protocol (HIP) architecture is to add acryptographically based name space, the HostIdentity, to the IP protocol. The Host Identityserves as the identity of the host, whereas theIP address is merely used for routing purposes.In this paper, we describe the HIP architecturefurther, and present our IPv6 based implemen-tation of HIP for Linux.

1 Introduction

The lack of security has been one of the mainproblems with IP. Although IPSec [8] andDNSSec [14] have provided some level of se-curity to IP, such as data origin authentication,

∗Helsinki Institute for Information Technology, P.O.Box 9800, FIN-02015 HUT, Finland

†Laboratory for Theoretical Computer Science,Helsinki University of Technology, P.O. Box 9205, FIN-02015 HUT, Finland

‡Laboratory for Theoretical Computer Science,Helsinki University of Technology, P.O. Box 5400, FIN-02015 HUT, Finland

confidentiality, integrity, and so forth, the no-tion of a true identity for hosts is still miss-ing. The IP address has typically been usedboth to identify the host and to provide rout-ing information. This has led to the misuseof IP addresses for identification purposes inmany security schemes. To overcome the prob-lems related to the current use of IP addresses,the Host Identity Payload/Protocol (HIP) archi-tecture adds a cryptographically based namespace, the Host Identity, to the IP protocol.Each host (or more specifically, its network-ing kernel or stack) is assigned at least oneHost Identity, which can be either public oranonymous. The Host Identity can be usedfor authentication purposes to support trustbetween systems, enhance mobility and dy-namic IP renumbering, aid in protocol transla-tion/transition and reduce denial-of-service at-tacks. Furthermore, as all of the higher proto-cols are bound to the Host Identity instead ofthe IP address, the IP address can now be usedsolely for routing purposes.

In this paper, we describe the HIP architectureand present our IPv6 [6] based implementationof HIP for Linux. The rest of the paper is struc-tured as follows: in Section 2, the HIP architec-ture and the Host Layer Protocol is described.Section 3 describes our implementation, andSection 4 concludes the paper.

Linux Symposium 90

2 The HIP architecture

There are two name spaces in use in the In-ternet today: IP addresses and domain names.IP addresses have been used both to identifythe network interface of the host and the rout-ing direction vector. The three main prob-lems with the current name spaces are thatdynamic readdressing cannot be directly man-aged, anonymity is not provided in a consis-tent and trustable manner, and authenticationfor systems and datagrams is not provided.

In [10][11][12], the HIP architecture is intro-duced. HIP introduces a new cryptographicallybased name space, the Host Identity (HI), andadds a Host Layer between the network and thetransport layer in the IP stack.

The modification to the IP stack is depicted inFigure 1. In the current architecture, each pro-cess is identified by a process ID (PID). Theprocess may establish transport layer connec-tions to other hosts (or to the host itself), andthe transport layer connection is then identi-fied using the source and destination IP ad-dresses as well as the source and destinationports. On the IP layer, the IP address is used asthe endpoint identifier, and on the MAC layer,the hardware address is used. In HIP, the trans-port layer is modified so that the connectionsare identified using the source and destinationHIs as well as the source and destination ports.HIP then provides a binding between the HIsand the IP addresses, e.g. using DNS [9].

The HI is typically a cryptographic public key,which serves as the endpoint identifier of thenode. Each host will have at least one HI as-signed to its networking kernel or stack. TheHI can be either public or anonymous. PublicHIs may be stored in directories, such as DNS,in order to allow the host to be contacted byother hosts. A host may have several HIs, andit may also generate temporary (anonymous)

TRANSPORTLAYER

NETWORKLAYER

ACCESS MEDIUM

TRANSPORTLAYER

HOST LAYER

NETWORKLAYER

ACCESS MEDIUM

ProcessID

(IP, port)pairs

ProcessIDAPPLICATION APPLICATION

IP address

MAC address

(HI, port)pairs

HI

MAC address

IP address

Figure 1: The current IP stack and the HIPbased stack

HIs on the fly for establishing connections toother hosts. The main purpose of anonymousHIs is to provide privacy protection to the host,should the host not wish to use its public HI(s).

The HI is never directly used in any Internetprotocol. It is stored in a repository, and ispassed in HIP. Protocols use a 128-bit HostIdentity Tag (HIT), which is a hash of the HI.Another representation of the HI is the LocalScope Identity (LSI), which has a size of 32bits, but is local to the host. Its main purposeis to support backwards compatibility with theIPv4 API.

The main advantages of using HIT in proto-cols instead of the HI is that its fixed lengthmakes protocol coding easier and also does notadd as much overhead to the data packets as apublic key would. It also presents a consistentformat to the protocol regardless of the under-lying identity technology used. HIT functionsmuch like the SPI does in IPSec, but instead ofbeing an arbitrary 32-bit value that identifiesthe Security Association for a datagram (to-gether with the destination IP address and se-curity protocol), HIT identifies the public key

Linux Symposium 91

that can validate the packet authentication.

The probability that a collision will occur is ex-tremely small. However, should there be twopublic keys for one HIT, the HIT acts as a hintfor the correct public key to use.

The HIP architecture basically solves the prob-lems of dynamic readdressing, anonymity, andauthentication. As the IP address no longerfunctions as an endpoint identifier, the prob-lem of mobility becomes trivial, as the nodemay easily change its HI and IP address bind-ings as it moves. Anonymity is provided bytemporary and anonymous HIs. Furthermore,as the name space is cryptographically based,it becomes possible to perform authenticationbased on the HIs. In [13], the concept of in-tegrating security, mobility, and multi-homingbased on HIP is discussed further.

2.1 The Host Layer Protocol

The Host Layer Protocol (HLP) is a signal-ing protocol between the communicating end-points. The main purpose of the protocol is toperform mutual end-to-end authentication andto create IPSec ESP [7] Security Associationsto be used for integrity protection and possiblyalso encryption. Furthermore, the protocol per-forms reachability verification using a simplechallenge-response scheme.

The HLP protocol provides seven messagetypes, of which four are dedicated to the baseexchange. In Figure 2, the base exchange is de-picted. In the first message,I1, the initiatorIsends its own HIT and the HIT of the responderto the responder. The responderR replies withmessageR1, which contains the HITs ofI anditself as well as a puzzle based challenge forI to solve. The purpose of the challenge is tomake the protocol resistant to denial-of-serviceattacks. (Puzzle based schemes have been pre-viously used for providing DoS protection to

I

I1: <HIT(I), HIT(R)>

R1: <HIT(I), HIT(R), challenge>

I2: <HIT(I), HIT(R), response, authentication>

R2: <HIT(I), HIT(R), authentication>

R

Figure 2: The base exchange of the Host LayerProtocol

both authentication [3] and encryption [5] pro-tocols.) I solves the puzzle and sends inI2the HITs of itself andR as well as the solu-tion to the puzzle, and performs the authenti-cation. R2 now commits itself to the commu-nication, and responds with the HITs ofI anditself, and performs the authentication. Afterthis, I andR have performed the mutual au-thentication and established Security Associa-tions for ESP, and can now engage in securecommunications. Furthermore, reachability isverified by the fact that the protocol has morethan two rounds.

If I does not have any prior information ofR, itmay retrieve the information from a repository,such as DNS.I sends a lookup query to theDNS server, which replies withR’s address,HI, and HIT.

There are three other messages in the HLP. TheHIP New SPI Packet (NES) provides the peersystem with its new SPI, provides a new Diffie-Hellman key to produce new keying material,and provides any intermediate system with themapping of the old SPI to the new. The HIPReaddress Packet (REA) allows a host to no-tify its partners of a change of the IP address(e.g. as a result of mobility). The HIP Boot-strap Packet (BOS) is used when the initiator isunable to learn a responders information froma repository.

Linux Symposium 92

3 Implementation

The protocol is implemented as a kernel mod-ule which uses a user space daemon process forsome cryptographic operations, such as com-putation and verification of DSA signatures.Since the protocol is implemented as a ker-nel module, the kernel can remain as intactas possible, with only minor modifications.The modifications are backwards compatibleso that normal TCP/IP connectivity withoutHIP can still be used. The state information re-quired by the protocol state machine is locatedwithin the kernel module. The user space dae-mon acts as a slave to the kernel module anddoes not contain state.

The implementation is based on the Linux ker-nel version 2.4.18 with USAGI [2] patches.The implementation supports HIP only overIPv6.

3.1 Network Socket API

The most important design goal in the networksocket API has been the transparent use of HIPby legacy applications. Thus, the legacy appli-cations do not need any changes in their sourcecode to utilize the benefits of HIP. On the otherhand, applications that are HIP aware shouldbe able to perform some additional tasks thatwill not be available to legacy applications. Forexample, a HIP aware application may requireor deny the use of HIP. A reason to require HIPwould be to benefit from the multihoming, se-curity, and mobility features of HIP. A reasonto deny the use of HIP might be to avoid theextra overhead caused by the cryptographic op-erations in a device with limited computing ca-pacity.

A typical network application does not usuallyestablish a network connection directly to anIPv6 address. Instead, the application is usu-ally given the hostname of the peer, which has

to be resolved to an IPv6 address from DNS.The connection can then be established to theIPv6 address.

When HIP is used, the network applicationneeds additional support in the resolver for twodifferent reasons. The first reason is that the re-solver should return HITs instead of IPv6 ad-dresses if HIP is being used transparently in alegacy application. The second reason is thata HIT to IPv6 address mapping should alwaysbe sent to the kernel as a side effect of the do-main name query. Otherwise the IPv6 layer inthe kernel does not have an address it can usefor routing packets.

The resolver interfaces are traditionally con-tained in libc . The USAGI project hasits own modified version oflibc which isalso used in the implementation. Only thegetaddrinfo resolver interface is currentlysupported in the implementation for experi-mentation purposes.

Most of the legacy IPv6 applications, such astelnet clients and web browsers, are able to useHIP in transparent mode if they can access theHIP enabled resolver. This means that theyshould be relinked against the HIP patched US-AGI libc . Firewalls, network address trans-lators and other applications that handle rawpackets may need changes in application codein order to utilize HIP.

3.2 Userspace daemon

A userspace daemon is required by the HIPmodule for several reasons, of which the mostimportant reason is that the protocol requiresDSA and Diffie-Hellman cryptographic algo-rithms which cannot easily be implementedwithin the kernel. Unfortunately, there are noknown kernel cryptographic libraries support-ing those algorithms, so those tasks have to bedone in user space libraries. The HIPL im-

Linux Symposium 93

plementation uses the OpenSSL [1] library foruser space cryptographic operations.

The daemon is used by the kernel module toperform many small operations on data. Thekernel module can send queries such assignthe given data with this DSA keyor solvethis cookieto the daemon. The daemon cal-culates a response for the given query andthe response either contains the answer to thequery or an error message if something wentwrong. Messages between the kernel moduleand the daemon are exchanged synchronouslyin a request-response fashion.

The request-response communication is imple-mented using a common interface in the kernel.When a request is carried out in the kernel, thecurrent context of execution will be saved. Thecontents of the context depends on the opera-tion being executed, but a minimal context de-fines at least a reference to a callback function.The callback function is called after the requesthas been served in the daemon and the daemonhas sent a response message to the kernel mod-ule. The kernel module can restore its statebased on the information stored in the contextand then continue its execution where it left off.

The actual request-response communication isimplemented in a straightforward manner. Theimplementation can serve only one daemon re-quest at a time, and subsequent requests aresaved into a FIFO queue. An arriving responsemessage from the daemon triggers a new dae-mon request from the top of the FIFO queue.

3.3 Networking Stack

Transport layer communications are bound toHITs when HIP is used. When data is sentover the transport layer connections, packetsare created and received as if they were usingHITs as the source and destination addresses inthe transport layer headers. As the packets are

passed up and down the protocol stack, theywill encounter a number of hooks that may in-tercept the passing packet to the HIP modulefor modifications. Currently, the implementa-tion has three major entry points into the HIPmodule from the IPv6 stack.

IPv6 output functions. The hooks in the out-put functions are triggered after the packethas been built and the packet has passedIPsec ESP processing. If the packet be-longs to a HIP connection, it has a HIT in-stead of an IPv6 address as the destinationaddress in the IPv6 header. Such a packetwill be intercepted by the HIP module.Other packets are allowed through intact.

When the HIP output functions receive apacket, the module first checks whetherthe packet belongs to an established HIPconnection by searching its table of estab-lished connections using the source anddestination HITs as the key. If an exist-ing connection is not found, the packetis dropped and a HIP exchange is startedby sending an I1 packet. On the otherhand, if an existing connection is found,the source and destination HITs in theIPv6 header are replaced by the IPv6 ad-dresses that are stored in the mapping ta-ble in the kernel. Thus, even if connec-tions are maintained using HITs as identi-fiers in the transport layer, the actual pack-ets that are sent to the network will stillalways contain valid IPv6 addresses.

IPv6 ESP input functions. All receivedpackets that belong to an establishedHIP connection will have an ESP header.Therefore, it is only necessary to interceptHIP packets from within ESP. Packetsthat are received by ESP are classifiedto those that belong to a HIP connectionand those that do not. The reverse ofthe output mapping is performed. The

Linux Symposium 94

correct mapping is located using the SPIfield in the ESP header, and if a mappingis found, the source and destinationaddresses in the packet are replaced bythe corresponding HITs before the packetis forwarded to the actual ESP processing.Again, the ESP processing only sees theHITs, and not the IPv6 addresses.

HIP protocol input. A special case is re-quired for HIP packets that are re-ceived during the connection establish-ment phase. The HIP module is registeredas a transport layer protocol and does notactually require a special hook for thisfunctionality. This intercept point is usedto receive I1, R1, I2 and R2 packets, aswell as other HIP negotiation packets.

Also, a few other hooks are required in or-der to, for example, make the neighbor discov-ery in IPv6 work correctly with HIP. However,these hooks are not relevant for this discussion.

3.4 Collaboration of Components

This section gives an overview of the collab-oration of the components through an exam-ple. Figure 3 represents an overview of thesystem architecture and the logical connectionsbetween the components.

For simplicity, only a minimalistic base ex-change is demonstrated. For example, mobil-ity and multihoming are not demonstrated here.The configuration in this example consists oftwo hosts which use legacy applications thathave not been designed for HIP, and thereforethe transparent mode is used. The host thatstarts the HIP exchange will be referred to asthe initiator while the peer is known as the re-sponder. The initiator is a host that wishes tobrowse a web page from another host, and theresponder has a web server listening for incom-ing requests. A DNS server in the domain of

HIP networkmodule

HIP daemon

USERSPACE

KERNEL

Figure 3: Collaboration of components

the responder is configured to return the IPv6address and the HIT of the responder.

When the HIP module is loaded into the ker-nel, it first queries a Host Identity from the dae-mon. This identity will be used in all signedHIP packets that are sent by the host. The ker-nel module also generates a list of prebuilt R1packets for quick sending. Finally, the moduleregisters its hooks into the kernel. HIP connec-tions can now be established.

When the user of the initiator host inputs theURL of the requested web page into the webbrowser to view the web pages in the respon-der’s web server, the browser queries DNS forthe name of the host using the resolver routinegetaddrinfo . Since the browser is linkedto the modifiedlibinet6 , the query is han-dled by the modified resolver.

The resolver queries the DNS for the respon-der’s hostname. When the resolver receives theresponse from the DNS, it finds IPv6 addressesas well as HITs in the reply. Two things aredone before the DNS reply is returned in a listfrom the resolver to the web browser. First,the resolver changes the order of the addressesin the DNS reply list before returning them tothe application. The HITs of the responder areplaced in the beginning of the list before theIPv6 addresses of the responder. Second, the

Linux Symposium 95

kernel is notified about the mapping of the HITto the corresponding IPv6 address.

The result of the resolver call is a list contain-ing first the HIT and then the IPv6 address ofthe responder. The browser is assumed to makethe straightforward choice and select the firstaddress from the list, which is a HIT in thiscase. The HIT is then used in the socket calls.Because the browser uses the HTTP [4] proto-col which runs over TCP, the browser passesthe HIT to the TCPconnect call in the net-work socket API.

Theconnect call is handled by the TCP layerin the kernel. The TCP layer begins a hand-shake to establish a connection by generating aSYN packet to be delivered to the web server.When the SYN packet is encapsulated into anIPv6 packet in the IPv6 layer, the packet is cap-tured by the output hook of the HIP module.The HIP module then examines the packet anddiscovers that the addresses in the packets areHITs instead of regular IPv6 addresses. Themodule also attempts to lookup a previouslyestablished HIP connection from its table ofestablished HIP connections. The lookup failsbecause the connection attempt was a new oneand a HIP exchange is needed to establish thenew HIP connection. Since a HIP connectionbetween the client and the server does not ex-ist, the TCP SYN packet cannot immediatelybe delivered to the web server. For simplicity,the SYN packet will be dropped. Retransmis-sions will also be dropped until the base ex-change has been completed.

To start the base exchange, the initiator sendsan I1 packet to the web server. R1, I2 and R2packets are exchanged after this. The build-ing and parsing of each of the R1, I2, and R2packets requires the assistance of the HIP dae-mon. For example, the daemon verifies the va-lidity of the identity of the peer from DNS andcreates a symmetric Diffie-Hellman key for the

hosts during the base exchange.

Once the base exchange is completed, the hostswill have generated a common secret that theywill be able to use to secure their communica-tion. They will also have established IPsec Se-curity Associations that will be used to encryptthe communication between the hosts. TheTCP handshake can continue, and once it hasbeen completed, the initiator can receive webpages from the web server at the responder. Iffurther TCP connections need to be establishedbetween the two hosts, the HIP negotiation isnot needed to be performed again, but the ex-isting security associations are reused for thenew connections.

4 Conclusion

In this paper, we described the HIP architec-ture, which has been designed to overcomeproblems mainly with respect to security, mo-bility, and privacy in the current Internet. HIPadds a new layer, the Host Layer, between thenetworking and transport layer in the IP stack,and introduces a Host Identity (HI) to serve asan end-point identifier of the host. Typically,the HI is represented by a public key. Eachhost will have at least one HI assigned to itsnetworking kernel or stack. As the HI is usedto identify the hosts, the IP addresses are usedmerely for routing purposes.

HIP defines a Host Layer Protocol to be usedas a signaling protocol between end hosts. Thepurpose of the protocol is to perform mu-tual end-to-end authentication and to establishIPSec Security Associations. HLP consists ofseven message types, of which four are part ofthe HIP base exchange.

As part of this paper, we presented our IPv6based implementation of HIP for Linux. TheHost Layer Protocol is implemented as a ker-nel module, which uses a user space daemon

Linux Symposium 96

process to perform some cryptographic opera-tions. The advantage of our approach is thatthe kernel can remain as intact as possible,with only minor modifications. Furthermore,the modifications are backwards compatible sothat the host is able to do networking with-out HIP. Our implementation is based on Linuxkernel version 2.4.18 with USAGI patches.

Acknowledgments

This research is funded by the National Tech-nology Agency of Finland (TEKES), ElisaCommunications, Ericsson, Nokia, TeliaSon-era, Creanor, and More Magic Software. Wethank Jukka Ylitalo, Jorma Wall, Petri Jokela,and especially Pekka Nikander from EricssonResearch for fruitful cooperation and interop-erability testing with their implementation ofHIP for BSD. Furthermore, we are grateful forthe valuable discussions we have had with An-drew McGregor and Thomas Henderson.

References

[1] Openssl: The open source toolkit forssl/tls.http://www.openssl.org/ .

[2] Usagi project.http://www.linux-ipv6.org/ .

[3] T. Aura, P. Nikander, and J. Leiwo.Dos-resistant authentication with clientpuzzles,.

[4] T. Berners-Lee, R. Fielding, H. Frystyk,J. Gettys, and J. Mogul. HypertextTransfer Protocol – HTTP/1.1. Technicalreport, Internet Engineering Task Force,January 1997. RFC 2068.

[5] Catharina Candolin, Janne Lundberg,and Pekka Nikander. Experimenting

with early opportunistic key agreement.In Proceedings of Workshop SEcurity ofCommunication on Internet, InternetCommunication Security, Tunis, Tunisia,September 2002.

[6] S. Deering and R. Hinden. InternetProtocol, Version 6 (IPv6) Specification.IETF Request for Comments 2460,December 1998.

[7] S. Kent and R. Atkinson. IPEncapsulating Security Payload (ESP).Request for Comments 2406, 1998.

[8] S. Kent and R. Atkinson. SecurityArchitecture for the Internet Protocol.IETF Request for Comments 2401, 1998.

[9] P. Mockapetris. Domain Names —Implementation and Specification. IETFRFC 1035, 1987.

[10] R. Moskowitz. Host identity payload andprotocol. Internet Draftdraft-moskowitz-hip-05.txt, work inprogress, 2001.

[11] R. Moskowitz. Host identity payloadarchitecture. Internet Draftdraft-ietf-moskowitz-hip-arch-02.txt,work in progress, 2001.

[12] R. Moskowitz. Host Identity PayloadImplementation. Internet Draftdraft-ietf-moskowitz-hip-impl-01.txt,work in progress, 2001.

[13] P. Nikander, J. Ylitalo, and J Wall.Integrating Security, Mobility, andMulti-homing in a HIP way. InProceedings of Network and DistributedSystems Security Symposium (NDSS’03),pages 87–99, San Diego, USA, February2003.

[14] B. Wellington. Domain Name SystemSecurity (DNSSEC) Signing Authority.

Linux Symposium 97

IETF Request for Comments 3008,November 2000.

Improving enterprise database performance onIntel Itanium ® architecture

Ken Chen, Rohit Seth, Hubert NueckelIntel Corporation

Software and Solutions Group

Abstract

In this paper, we will present several operatingsystem features that improved database per-formance under OLTP1 workload significantly,such as Huge TLB2 page to reduce DTLB3

misses as database uses large amount of sharedmemory and asynchronous I/O to accommo-date high amount of random I/O without in-troducing the overhead of many I/O processes.We will also present many other kernel opti-mizations that were developed by Intel, RedHat and the Linux community that improvedthe scalability and performance of Linux ker-nel, specifically the areas are: raw vary I/O,kernel data structure footprint reduction, globalio_request_lock reduction, and storage devicedriver optimization.

1 Introduction

Linux has been receiving a great deal of at-tention in the past few years. The popular-ity is propelled by wide range of adoption ofLinux for enterprise computing. Major soft-ware vendors have been supporting their prod-ucts on Linux for many years. As the enter-prise software solution stack builds up every-day, it is crucial that Linux kernel develop-

1On-Line Transaction Processing2Translation Lookaside Buffer3Data TLB

ment takes this opportunity to ensure that ker-nel provides key necessary infrastructure forenterprise application to excel. This means de-veloping enterprise focused operating system(OS) features, improving performance by ex-tending the scalability, and many other areas.

Relational database management systems(RDBMS) are complex server applicationsthat solve the problems of information man-agement. The RDBMS reliably manages largeamount of data in a multi-user environmentsuch that users can concurrently access shareddata. While it is required to maintain con-sistent data between users, it is also requiredto deliver high performance. All these re-quirements need high-quality infrastructureprovided by the operating system. Some ofthe examples are virtual memory manage-ment for managing vast amount of physicalmemory, scalable I/O subsystem, robust / highperformance storage subsystem, light-weightinter-process communication, and robust /high performance networking subsystem.

Recent Linux kernel development has ad-dressed many of the areas with a focus toprovide key functionality for enterprise work-loads. The rest of the paper will discuss newkernel features as well as performance en-hancements in the context of database runningOLTP workload.

Linux Symposium 99

2 Overview

On-line transaction processing refers to a classof applications that facilitates and managestransaction-oriented operation, typically fordata entry and retrieval transactions in a num-ber of industries. The basic skeleton ofOLTP environment consists of multi-tier soft-ware applications that allow thousands of usersto concurrently execute transactions against adatabase. Typically transactions are either ex-ecuted on-line or queued for deferred execu-tion and have certain characteristics on the dis-tribution between a mixture of different types.Because of the complexity and overall execu-tion behavior of OLTP workload, the workloadcharacteristics can be summarized as:

• Simultaneous execution of multiple typesof transactions that span a breadth of com-plexity

• On-line and deferred transaction execu-tion

• Significant random disk input/output

• Transaction integrity

• Unique distribution of data access

• Contention on data access and update

From system architecture perspective, theOLTP workload exercises a breadth of systemcomponents associated with the environment.Database server application and the underly-ing operating system software are the key soft-ware components to provide high performance.Earlier evaluation of Linux kernel under OLTPworkload revealed several hot spots or limita-tions from performance point of view, such aslarge execution time spent in low level TLBmiss handling, large number of process contextswitch due to blocking synchronous I/O, large

execution time on functions related to I/O ele-vator algorithm, and large execution time spenton spinning on a highly contended lock likeglobal io_request_lock. In the following sec-tions, we will examine how features like HugeTLB and asynchronous I/O allow database ap-plication to exploit maximum hardware capa-bility with minimum overhead from Linux ker-nel and how Linux I/O subsystem is improvedto reduce kernel execution time.

3 Huge TLB Support in Linux

3.1 Motivation of Huge TLB page

A TLB (Translation Lookaside Buffer) is ahardware structure for virtual-to-physical ad-dress translations that supports high perfor-mance paged virtual memory system. Typi-cally it is a scarce resource on a processor. Op-erating systems always try to make best useof the limited number of available TLB en-tries on a system. Orthogonally, with advance-ment of semiconductor technology that result-ing in ever growing memory capacity, it be-comes more and more feasible both technicallyand economically to populate tens of gigabytesof memory on a server. For example, HP serverrx5670 can be populated with 48 GB of mem-ory with 1GB DIMM4, or even 96 GB with lat-est 2GB DIMM.

Database server applications generally uselarge amounts of system memory in order toefficiently manage the actual databases thatare usually much larger than system memory.It typically utilizes shared memory segmentsamong multiple database processes. The firstarea in shared memory segments, usually thelargest, is the database buffer cache. It holdscopies of data blocks read from datafiles. Adata block is the smallest unit of storage space

4dual in-line memory module

Linux Symposium 100

managed by database server. The RDBMS ac-tively manages the data blocks in the buffercache. When a user process requires a particu-lar piece of data, it searches through the buffercache. If the data is already in the cache (acache hit), it will read the data directly frommemory. Otherwise data block will be copiedfrom datafile on disk into memory (a cachemiss). It is well known that accessing datafrom memory is several orders of magnitudefaster than accessing data from disk. Thereforein production environment, system administra-tor will typically allocate as much memory aspossible for shared memory segments in orderto improve cache hit rate by maximizing buffercache size. However, accessing large amountof memory combined with random data accesspattern of OLTP workload, it puts lots of pres-sure on CPU’s TLB resource. For example, as-suming 16K page size for Linux-IA64, a 48 GBof process memory would need 3 million TLBtranslations. Or to look at from hardware pointof view, an Itanium 2 processor’s internal TLBresource would only cover 2 MB of virtual ad-dress space with 16 KB page size.

With vast amount of memory each applica-tion process access, there is a need to makeeach TLB mapping as large as possible to re-duce TLB pressure. Large contiguous regionsin a process address space, such as contigu-ous data, may be mapped by using small num-ber of large pages rather than large number ofsmall pages. It is also important to note herethat OS kernel cannot blindly pick up a largerpage size for all applications because it maycause lots of fragmentation and very poor uti-lization of large amount of physical addressspace. Thus a requirement for having a sepa-rate large page size facility from the operatingsystem becomes more and more important interms of functionality and performance.

3.2 Design and Implementation

To support large page size for user applicationto utilize processor’s capability, Intel workedwith the Linux community to introduce a newOS feature that exposes the hardware architec-ture for application to benefit from using hugepage size without affecting many other aspectsof the OS. This new feature is called Huge TLBpage. Specifically the Huge TLB support is at-tempting to solve the following problems:

• Increase CPU TLB coverage / Reducedata TLB miss rate

• Reduce process’s page table memory re-quirement

• Pin data pages in physical memory

The design goal of Huge TLB interface is toexpose the hardware architecture to applica-tion. Mapping the kernel, or specialized de-vices such as frame buffers by using large map-ping is a relatively straightforward exercise. Itonly affects very limited portions of the oper-ating system code. However, virtual memoryimplementation in Linux kernel makes the ba-sic assumption that there is only one page sizefor user applications. This one size is related toMMU page size supported by a specific archi-tecture. For example, on IA-32 this page sizeis 4K, and on Itanium-based system, user pagesize is configurable at kernel build time to beeither 4K, 8K, 16K or 64K. Itanium 2 proces-sor actually provides concurrent multiple pagesize support (4K, 8K, 16K, 64K, 256K, 1M,4M, 16M, 256M, 1G and 4G). The current VMsystem is not suited for supporting multipleuser page sizes because the knowledge of onepage size is ingrained in several subsystemswithin the kernel. It is important to note thatsupporting multiple page sizes affects both ar-chitecture dependent and independent portions

Linux Symposium 101

of the Linux kernel. That is, a clean separa-tion of architecture dependent and independentcode in kernel is not enough to mitigate the dif-ficulties of supporting multiple page sizes.

The allocation of Huge TLB page is performedin two phases. First a system administrator re-quests the kernel to reserve a set of memoryin a special huge TLB page pool. The reser-vation of each huge TLB page is constrainedthat memory to be physically contiguous. Oncehuge TLB pages are reserved by the operat-ing system, they can be used by applicationthrough two well defined system interfaces, ei-ther by mmap interface or through the stan-dard System V shared memory interface. Note,application changes are required to use HugeTLB pages.

3.3 Application Benefit

To quantify the speed up of RDBMS underOLTP workload, we setup an experimental en-vironment similar to industry standard OLTPbenchmark on a Itanium 2 processor basedplatform.

First a baseline result is established with stan-dard 16K page size. We then ran experi-ment with 256 MB page size while holding to-tal memory in shared memroy segments con-stant. Throughput is then normalized to base-line. Figure 3.1 depicts the result.

We can easily see that with each incrementalincrease in page size used for data pages inshared memory segments, the speed up is no-tably at 11% overall for 256 MB huge TLBpage size.

To further study how various page size speedsup the overall OLTP throughput at hardwaremicro-architecture level, we used Itanium-processor’s hardware performance monitoringunit (PMU) to measure TLB pressure with var-ious page size. The usage model of PMU

0.00

0.25

0.50

0.75

1.00

1.25

16 KB 256 MB

page size

norm

aliz

ed th

roug

hput

Figure 3.1: Relative OLTP throughput withvarious page size while holding database buffercache size constant

are described in detail in several publications[1][2].

Again a baseline is established and data werecollected for each page size. To measure hard-ware TLB pressure, we measured with met-ric of DTLB miss rate, or inverse of averagenumber of data references per DTLB miss. Asshown from figure 3.2, there is significant re-duction in data TLB miss rate by using hugepage size. For 256 MB page size, the DTLBmiss rate is reduced by 65%, or inversely, thenumber of data references between successiveTLB misses increases by 280%.

1.00

0.35

0.00

0.20

0.40

0.60

0.80

1.00

1.20

16 KB 256 MB

page size

DTL

B m

iss

rate

Figure 3.2: DTLB miss rate comparison

It is also interesting to observe that TLB pres-

Linux Symposium 102

sure for OLTP workload on Itanium 2 basedsystem does not vary much with respect to to-tal memory size, it is more or less a function ofI/O load. For example, two experiments wereconducted such that one with 16 GB databasebuffer cache while the other has 32 GB. Themicro-architecture DTLB miss rate for bothconfigurations are well within a couple of per-centage points. This experiment points out thateven at different OLTP throughput due to dif-ferent size of database buffer cache, the benefitof using larger page size is equally significant.With 256 MB page size, the hardware TLB re-source on Itanium 2 processor would be ableto cover up to 32 GB of memory and primarysource of TLB misses are shifted to data accessto process’s local data and task context switch-ing.

DTLB miss rate vs. memory size

0

0.2

0.4

0.6

0.8

1

1.2

16 GB 32 GB

Total Database Cache Size

DTL

B m

iss

rate

Figure 3.3: DTLB miss rate vs. memory size

A second benefit of using huge TLB feature isthat the memory usage for process’s page ta-ble is significantly reduced. Taking a 48 GBsystem as an example, if 45 GB is allocated as180 256MB huge TLB pages, memory for pagetable covers that 45 GB of vma is only 1440byte. For 100 processes that shares the 45 GBof shared memory segment, total memory forpage table is 1.6 MB considering each processround up 1440 byte to one page. In the caseof using normal 16 K page size, the memoryrequirement grows to 2250 MB (3 million en-tries * 8 bytes/entry * 100 processes, ignoring

first and second level page table structure forsimplicity). A secondary effect is that this 2.2GB of memory reduced from page table can bebetter utilized by application to further increaseapplication’s performance.

A third benefit of huge TLB feature is thatmemory allocated for huge TLB page is pinnedin physical memory and is not considered forswapping. This eliminates the chance of swap-ping physical pages that are being used forholding critical application data.

4 Linux I/O Subsystem

4.1 Dynamic vs. Static kiobuf allocation

Direct device access via raw devices partitionimproves database performance. A raw devicepartition is a contiguous region of a disk thatcan be accessed via a character device inter-face (/dev/raw on Linux). Such access typi-cally bypasses the file system buffering. SinceRDBMS does its own memory cache and I/Omanagement, there is no need to have operat-ing system to perform another level of cachingand buffering. In fact, it is better to leave thattask to application because it has much betterinformation to determine optimal I/O strategy.

In a large OLTP workload configuration, dueto sheer number of disk drives and the needto spread I/O load onto large number of diskdrives, a database server typically opens largeamount of data files where these data files re-side on raw devices. Independent processeswithin the database server application will eachopen same set of data files.

The existing raw I/O code will statically allo-cate one kiobuf and its associated structures(mainly buffer_head structure, abbreviated asbh hereafter) upon every raw device open.There are 1024 bh allocated for each kiobuf.In a benchmark configuration, the memory re-

Linux Symposium 103

quirement just for the bh structure is calculatedas following:

150 raw devices * 120 db processes * 1024 bh* 192 byte/bh = 3534 MB

However, since each process can have only oneoutstanding synchronous I/O at any given time,the active memory required for 120 processesare:

120 db processes * 1024 bh * 192 byte/bh = 24MB

There are massive amount of memory beingset aside by the bh structure and only 0.67%of them are being actively used. This largeamount of under-utilized memory can be betterdevoted for other part of the system, for exam-ple, database buffer cache.

The cause of the issue is that each kiobuf struc-ture is associated with a file descriptor. Al-though in certain cases, static bh allocationavoids the overhead of dynamic allocation, thisstatic allocation scheme actually hurts perfor-mance for OLTP workload due to displacementof memory allocated for bh structure but other-wise can be used for database buffer cache.

To enable large number of raw devices tobe opened simultaneously, we removed thestatic kiobuf allocation in raw_open functionand at each invocation of rw_raw_dev func-tion, kiobuf is dynamically allocated and freedfor each raw I/O request. In order to re-duce the prohibitive amount of overhead withdynamic allocation of all the memory arraysin kiobuf, we treat kiobuf and its associatedmember arrays as one entity. With the aidof constructor and destructor API provided bythe kernel slab allocator, member arrays ofkiobuf are allocated and initialized upon thecreation of kiobuf object. Subsequent dy-namic allocation would only incur one levelof kmem_cache_alloc and kmem_cache_free

overhead for such large data structure. With theper-CPU slab allocation area, the cost of dy-namic allocation is even more affordable. Theoverhead of this dynamic kiobuf allocation ismeasured at 0.8 % for 2 KB I/O size and 0.1%for 128 KB I/O size.

It should be noted that even though the size ofkiobuf structure is small (128 byte on Linux-IA64), the entire kiobuf entity is fairly largeat 200KB. The per-CPU array for kiobuf slabcache should be managed pro-actively. Withdefault parameter that calculates the per-CPUarray size based on object size, there will bemaximum 252 objects allocated on per-CPUarray and on a 4 CPU system, this leads to 1008kiobuf entity, or 200MB memory allocation. Asmall burden to the system administrator.

4.2 Variable size Block I/O

A second enhancement made to the raw de-vice layer is to enhance the effectiveness forthe raw vary I/O on Linux-IA64. The exist-ing code restricts the sector combining to max-imum size of RAWIO_BLOCKSIZE (4KB).The user pointer is also restricted to be alignedon that boundary (4K aligned). Both restric-tions are sub optimal on Linux-IA64 becausethey reject many scenarios that can be put intospeed path.

The implementation can be modified to be runtime page size aware instead of hard codedconstant value. The concept is to combine allsectors within a page to send down to sub-mit_bh. For example, on a system with de-fault page size of 16KB, a raw I/O request with16KB size would be broken down to 4-4-4-4KB with existing code where it could be com-bined optimally as one 16KB request to sub-mit_bh. The user pointer should only be re-stricted to sector aligned. For example, againon a system with page size of 16KB, a rawI/O request of 4 KB I/O size with user pointer

Linux Symposium 104

aligned on 2KB into a page would be rejectedby the existing code for fast path considerationwhere technically it could be in the fast path.We measured the speed up varies from 10% to280% with micro-benchmark depending on theI/O size and buffer alignment for this enhance-ment.

Again, using OLTP workload to measure howwell the raw vary I/O and the enhancementsmeasure up in production environment, we rantwo experiments, one with and one withoutraw vary I/O. It was measured that raw varyI/O gives 4% performance advantage over onewithout for OLTP workload.

0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

without raw vary io with raw vary io

norm

aliz

ed th

roug

hput

Figure 4.1: Comparison of raw vary I/O underOLTP workload

4.3 Relieve global lock contention

Another area of improvement in blockI/O subsystem was the reduction of globalio_request_lock usage. Much work has beendone in this area [4] and the result of the workis incorporated in products released by severalmajor Linux OS distributors. In earlier releasesof Linux kernel 2.4, I/O requests are queuedone at a time while holding the global lockio_request_lock. The Linux community hasimplemented many iterations/versions to breakthe global lock to per device lock. With theoptimization, I/O requests are queued whileholding a lock specific to the queue associated

with the request. This improves concurrentI/O queuing and significantly improves I/Othroughput.

5 Asynchronous I/O

5.1 History of AIO implementation

Several asynchronous I/O implementations fol-lowing the POSIX standard were developedduring the Linux kernel 2.3 development cy-cle. The implementations were either in ker-nel space or user space. Both of them em-ployed an idea of an I/O queue with N numberof helper threads that issues synchronous I/Oto the underlying OS. However, there are sev-eral drawbacks with this approach for databaseapplication. First of all, even though the inter-face is asynchronous like, the I/O throughput isseverely limited due to another layer of queu-ing. The optimum number of helper threadsalso depends on the characteristics of I/O sub-system and thus not flexible for wide range ofproduction environment. A second issue is thatthe POSIX defined reap function aio_suspend()has a worst case of O(n) operation and tends tobreak down with large number of pending I/O.

5.2 A New paradigm

During the Linux 2.5 kernel development cy-cle, Red Hat kernel developers implementeda new AIO and its API based on the conceptof completion queue [3]. Subsequently Intelworked with Red Hat to port and refined theAIO design for Linux kernel 2.4 on Linux-IA64.

The core of this kernel AIO implementation iscentered around the completion queue. It intro-duces 5 new system calls for asynchronous op-eration. The core is generic that the operationis not just restricted to disk I/O, but also fornetwork and file system I/O. The completion

Linux Symposium 105

queue is created by system call io_queue_initand destroyed via io_queue_release. New I/Osare submitted via io_submit and queued onlyif there is sufficient space in the completionqueue to receive resulting event. When I/O iscompleted, a corresponding event is put intothe complete queue and can be reaped viaio_get_events. RDBMS application typicallyuses unbuffered I/O and combined with AIOinfrastructure, disk I/Os are now being queueddirectly at block layer to exploit maximumconcurrency for the capability of the underly-ing hardware devices.

5.3 AIO Evaluation and Optimization

We first turn our attention to evaluate howwell does kernel asynchronous I/O performsunder heavy disk I/O workload using micro-workload. The system under test has 3 fiberchannel host adaptors connected to 180-diskClariion towers. The disk towers are con-figured as 10 hardware RAID-0 disk drivesand each RAID-0 drive has 10 raw partitions.A micro-benchmark program is then request-ing AIO randomly on the 180 raw partitionswith random offset (round to multiple of sectorsize). The I/O size is limited to 2KB and 16KBto limit the permutation of all other variables.

The micro benchmark basically throttles I/O tokeep the system busy with at least N number ofI/O pending at any given time. When numberof pending I/O reduces to N, test program willbatch next set of I/O with ‘B’ number of I/Oin one AIO io_submit call. Completed I/O alsogets reaped with each occurrence of AIO sub-mit, i.e., program will reap approximately ‘B’number of I/O in one io_get_events call.

The first experiment is to measure averageCPU time spent on processing one I/O in theAIO request array. We sweep across the ‘B’ pa-rameter from 32 to 1024 while holding N con-stant at 1000. The data was measured with pure

CPU cycles spend on processing I/O excludingthe wait time due to disk access latency. Figure5.1 depicts the result.

0

5

10

15

20

25

30

35

40

10 100 1000 10000

# of I/O submited per syscall

aver

age

proc

essi

ng ti

me

pe

r I/O

(us)

16K I/O size

2K I/O size

Figure 5.1: average AIO cost per I/O

For un-buffered I/O via raw device, the pro-cessing cost per AIO request in the most idealcase should be insensitive to the size of I/O.However, the large differences in the averagecost between 2KB and 16 KB size in figure 4.1indicate that there are some code path in thesystem break down badly with large I/O size.A kernel profiler showed that the elevator al-gorithm was responsible for the extra cost inthe 16 KB case. It was apparent that raw varyI/O is also needed for asynchronous I/O pathon raw device. Enhancements in addtion to rawvary I/O were also made in the generic AIOlayer. Figure 5.2 illustrates the result of opti-mizations.

With raw vary I/O optimization, the cost ofAIO on raw device is now quite consistent fordifferent I/O size which matches to our expec-tation. The overall optimization improves 16KB I/O size by 400% and 27% for 2 KB I/Osize.

5.4 Application benefit

With all these fancy analysis done with micro-benchmark, the next question is how does

Linux Symposium 106

0

5

10

15

20

25

30

35

40

10 100 1000 10000

# of I/O submited per syscall

aver

age

CP

U ti

me

per

I/O (u

s)

16K-orig

2K-orig

16K-varyio

2K-varyio

Figure 5.2: average AIO cost per I/O with op-timization

AIO and the optimizations measure up ina real world production environment, likeRDBMS with OLTP workload? Most I/Ocache schemes employ deferred I/O operationsand periodically sync up memory content withpersistent data storage. A write back pro-cess is typically woken up on various condi-tions. One condition is database checkpointwhere the process will write modified databaserecords to persistent media in order to bringthose copies of record in the persistent mediacurrent.

At high transaction rate and especially largepercentage of update intensive queries in theOLTP transactions, the amount of modifieddatabase records existed in the buffer cache arehigh at the time when checkpointing initiates.It is essential that a write back process writethose records to disks as quickly as possible tominimize amount of CPU processing time con-sumed on checkpoint task. Since the purposeof checkpoint is to sync-up persistent data filewith content in memory, it actually has verylittle data dependency on when the blocks arebeing written, as long as database server getsnotified that the writes are completed. This re-

quirement fits perfectly with the non-blockingsemantics of asynchronous I/O.

There are two ways for the writeback pro-cess to submit I/O to the OS. One is an inter-nal RDBMS facility that distributes I/O amongmultiple helper processes for system that lacksthe native AIO implementation. This facility issimilar to I/O queue and help threads describedearlier. The other is to submit I/O to the OS vianative AIO interface. Again, two experimentswere conducted, first with I/O helper threadsconfiguration to establish a baseline result andsecond with native AIO configuration. Figure5.3 depicts the result.

comparison of io threads vs. AIO

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50

run time (minutes)

norm

aliz

ed th

roug

hput

aio

io threads

Figure 5.3: Benefit of AIO for OLTP workload

Several points worth noting here. At steadystate, configuration with AIO is 10% higher inOLTP throughput compare to without AIO. Itis due to combination of reduction in I/O pro-cesses’ overhead and efficient OS I/O queuingand event reaping. Second note is that in theI/O helper thread configuration, system takesextra overhead in context switching betweenthe helper threads and other active processes.Not only the helper thread takes a penalty hitwith process context switch, it also puts morepressure on the CPU’s data cache because moreprocesses are actively running on the system.In the asynchronous I/O case, disk I/Os are

Linux Symposium 107

submitted directly to OS, thus reduces numberof context switches. As illustrated from figure5.4, the number of process context switches isreduced by 12 % with asynchronous I/O.

system context switch rate

0.80

0.85

0.90

0.95

1.00

1.05

without AIO with AIO

Nor

mal

ized

to b

asel

ine

Figure 5.4: system context switch rate with andwithout AIO

There are many other secondary effects indi-rectly improving overall system performanceby using AIO. System memory consumptionis reduced because there aren’t any I/O helperthreads at all. OS scheduler will have less pres-sure because less number of active tasks it needto manage, and lastly inter-process communi-cation overhead between the helper threads areeliminated. All of these translate into highlyefficient scalable asynchronous I/O layer andhigher OLTP throughput.

6 Storage Device Driver Optimiza-tion

While most I/O enhancements outlined in pre-vious section are more or less transparent tostorage device driver, some still do require co-operation from each individual driver to en-able specific optimization. One example wouldbe HP’s smart array family of disk controllers.Since this driver hooks directly into Linux I/Oblock layer, it missed out all the enabling in-frastructures for the raw vary I/O and the global

io_request_lock optimization implemented forSCSI devices.

Both optimizations are fairly straightforward toenable. What we did was at the time of thecontroller’s initialization, we initialize a per-controller raw vary I/O capability array andthen hook that array into the blkdev_varyio de-fined in the block layer. To enable per de-vice request lock, two locks are added in thecontroller’s data structure, one lock for I/O re-quest queue, and one for the controller itself.Locking primitives are then modified to use thecorresponding request queue lock in the caseof I/O queuing/dequeuing. For operations thatpertain to controller, the controller lock will beused.

Other optimizations that were also activelyworked on for this particular storage devicedriver are interrupt coalescing, 32-bit DMAcommand pool, and Itanium architecture spe-cific command structure alignment.

7 Conclusions

In this paper we have outlined some of thekey operating system requirements for runninga high performance database on Linux. Im-plementations of huge TLB support and asyn-chronous I/O have been described along withhow these features perform to expectation un-der OLTP workloads. The I/O subsystem forthe Linux kernel 2.4 has been improved signif-icantly to achieve high concurrency and effi-ciency for high demand I/O workload. Storagedevice driver optimizations are also shown tobe equally important to materialize optimiza-tions done at generic layer.

8 Acknowledgement

The authors of this paper would like to thankthe following people who enthusiastically con-

Linux Symposium 108

tribute directly or indirectly to this paper, in noparticular order: Asit Mallick, Arun Sharma,Tony Luck, Sunil Saxena, Mark Gross and sev-eral other groups at Intel Corporation; Red Hatkernel developers; Linux community aroundthe world.

References

[1] Intel Itanium Architecture SoftwareDeveloper’s Manual, Volume 1-3.

[2] David Mosberger and Stephane Eranian,ia-64 linux kernel design andimplementation., Prentice Hall, 1stedition, 2002.

[3] Benjamin LaHaise,An AIOImplementation and its Behavior, OttawaLinux Symposium proceedings 2002.

[4] Peter Wai Yee Wong, et al.,ImprovingLinux Block I/O for Enterprise Workloads,Ottawa Linux Symposium proceedings2002.

Trademarks

Itanium is a registered trademark of Intel Corpora-tion or its subsidiaries in the United States and othercountries.

Other names and brands are the property of theirrespective owners.

High Availability Data Replication

Paul ClementsSteelEye Technology, Inc.

http://www.steeleye.com/

[email protected]

James E.J. BottomleySteelEye Technology, Inc.

http://www.steeleye.com/

[email protected]

Abstract

This paper will identify some problems thatwere encountered while implementing a highlyavailable data replication solution on top of ex-isting Linux kernel drivers. It will also discussfuture plans for implementing asynchronousreplication and intent logging (which are re-quirements for performing disaster recoveryover a WAN) in the Linux kernel.

1 Introduction

The first part of this paper (Section 2) will dis-cuss some issues in the 2.2 and 2.4 Linux ker-nels that had to be overcome in order to im-plement a replication solution using raid1 overnbd.

The second part of the paper (Section 3) willpresent future plans for implementing asyn-chronous replication and intent logging in themd and raid1 drivers.

2 Fixing the existing problems

We’ve done considerable work over the past 3years testing, debugging, and finally fixing sev-eral problems in the md, raid1, and nbd driversof the Linux kernel. We ran into several bugsin these drivers, primarily due to the fact thatwe’re using them in an unusual fashion, with

one of the underlying disk devices of the raid1mirror being accessed over the network, via anetwork block device (see Figure 1). Our us-age of raid1 in conjunction with nbd led to theincreased occurrence of several race conditionsand also caused the error handling code of thedrivers to be stressed much more than a nor-mal, internal disk only, raid1 setup.

The following is a brief summary of some ofthe problems we’ve uncovered and solved:

• eliminating md retries in order to avoidmassive stalls when a device (in our case,a network device) fails

• correcting SMP locking errors and al-lowing an nbd connection to be cleanlyaborted when problems are encountered

• fixing various bugs in the raid1 driver:

– mistakes in the error handling code

– incorrect SMP locking

– IRQ enabling/disabling bugs

– non-atomic memory allocations incritical regions

– block number “off by one” error1

At the time of this writing, patches for all ofthese problems have been accepted into the lat-est releases of the mainline 2.4 and 2.5 kernel

1This problem was actually corrected by Neil Brownafter our initial bug report to him.

Linux Symposium 110

/dev/sda1

/dev/md0 raid1 device

local disknetwork

block device /dev/sda1 local disk/dev/nb0

Replication Source Replication Target

Figure 1: Data replication using raid1 over nbd

series. For information about all of SteelEye’sopen source patches or to download the sourcecode for the patches, visit the SteelEye Tech-nology Open Source Website[1].

2.1 Eliminating md retries

This was one of the first problems that we en-countered back in the fall of 2000. At thetime, we were working with the 2.2 Linux ker-nel. The md resynchronization code (md_do_sync ) was written, at the time, to always retryany failed I/O (read or write) no less than 4096times!2 On a network failure, this caused raid1and nbd to spin in tight loops for several sec-onds, hanging the entire system. Our stopgapsolution (read, hack) was to strategically in-sert schedule calls into the error handlingcode of those drivers.3 Needless to say, the

2PAGE_SIZE∗(1 << 1) ∗ 2/ sizeof(structbuffer_head *) (md.c, c. 2.2.16 kernel)

3We did not have the option of modifying md, sinceit is compiled into the kernel in most Linux distributions,

md resynchronization code got a major over-haul before the 2.4 kernel was released and thisissue was fixed.

2.2 Allowing nbd connections to be aborted

After initially fixing a few trivial bugs in nbdhaving to do with missing or incorrect spin lockcalls, we realized that we could not afford towait for TCP socket timeouts when we neededto abort a network connection, or when the net-work went down. We needed to have the abil-ity to terminate nbd connections at will, so thatour high availability services could have com-plete control over the data replication process.To fix this, we unblockedSIGKILL during thenetwork transmission phase of nbd so that wecould send a signal from user space to termi-nate an nbd connection. We also needed to addcode to ensure that nbd’s request queue was

and we did not want to be in the business of distributingentire rebuilt kernels.

Linux Symposium 111

cleared and all outstanding I/Os were markedas failed when a connection was terminated.

Our patches for nbd have been accepted intothe mainline 2.5 kernel (c. 2.5.50) and werebackported and accepted into the 2.4 kernel(c. 2.4.20-pre).

2.3 Fixing various bugs in the raid1 driver

The raid1 driver, by far, has been the biggestthorn in our side. . . we’ve made many fixes toraid1 over the past few years in order to in-crease its robustness. Neil Brown has simul-taneously been performing a lot of cleanup andbugfix work in the md and raid1 drivers thatwas beneficial to our cause, as well.

The first set of problems that we ran into withraid1 was related to handling failures of theunderlying devices.4 To correct the problems,we added code to detect device failures dur-ing resync and during normal I/O operations.The additional code correctly marks the device“bad,” fails all outstanding I/Os, and abortsresync activities, if necessary.

After fixing this initial set of problems, wewere able to stress the raid1 driver much moreheavily than we previously had been able to(without it falling over dead). Unfortunately,this heavier stress uncovered a whole raft ofnew problems. We were able, however, toeventually pin-point and solve each of thesenew problems. Many of the problems turnedout to be due to some fairly common kernelprogramming mistakes, such as:

• “nested” spin_lock_irq calls– Fail-ure to use thesave and restoreversions of the spin lock macros withnested calls (i.e., spin_lock_irq

4Since we use nbd underneath raid1, device “fail-ures” are quite a common occurrence (e.g., when thenetwork goes down).

called while anotherspin_lock_irqis already in force) leads to the CPU flagsbeing improperly set. This means thatinterrupts could be enabled at inappro-priate times, causing deadlocks to occur.The rule of thumb here is that it’s best toavoid nesting spin locks whenever possi-ble, and to always use theirqsave andirqrestore versions of the macros, in-stead of the simpleirq versions, if dead-locks are a concern.

• sleeping with a spin lock held– Therewere cases where the driver was doingnon-atomic memory allocations or callingschedule with a spin lock held, whichcaused deadlocks to occur. To avoid thedeadlocks, the code was rearranged sothat a spin lock was never held while call-ing schedule and a fewkmalloc callswere changed to use theGFP_ATOMICflag rather than theGFP_KERNELflag.

• “off by one” error – This was a simplecase of differing block sizes being usedin the md and raid1 drivers resulting inone of the block counts used in the resynccode being shifted incorrectly. This bugcaused resyncs to hang, leaving the raiddevice in an unusable state.

Our patches for raid1 have been accepted intothe Red Hat Advanced Server 2.1 kernel (2.4.9-ac based) and an alternate version of the fixes(authored by Neil Brown) has been acceptedinto the mainline 2.4 kernel (c. 2.4.19-pre).The raid1 driver in the 2.5 kernel is not be-lieved to suffer from any of the aforementionedproblems.

3 Future enhancements

We are planning to enhance the md and raid1drivers of the Linux kernel to support asyn-chronous data replication and intent logging.

Linux Symposium 112

shadow bit block counter

1 n − 1 memory

disk

Figure 2: In-memory and on-disk bitmap layout

Our strategy for implementing these changeswill be to place the bulk of the code into themd driver, in a manner that will allow all theunderlying raid drivers to take advantage of it.We will also add the necessary code to raid1 tocall the md driver hooks.

We plan to leverage some of the implementa-tion and design of Peter T. Breuer’sfr1 code[2],which was recently published[3]. Thefr1driver implements intent logging and asyn-chronous replication as an add-on to the raid1driver. We will make the following, additionalchanges to thefr1 code, to produce a final so-lution:

• disk backing for the bitmap (intent log)

• addition of a daemon to asynchronouslyclear the on-disk bitmap

• conversion of single bits to 16-bit blockcounters (to track pending writes to agiven block, so as not to prematurely cleara bitmap bit on disk)

• allow rescaling of the bitmap (i.e., allowone bit to represent various block sizes—the current code is restricted to one bit per1024-byte data block only)

• make the code fully leveragable by all theraid personality drivers

• add some additional configuration inter-faces for the new features

3.1 Intent Logging

In a data replication system, an intent log isused to keep track of which data blocks areout of sync between the primary device andthe backup device. An intent log is simply abitmap, in which a set bit (1) represents a datablock that is out of sync, and a cleared bit (0)represents a data block that is in sync. The useof an intent log obviates the full resync that isnormally required upon recovery of an array.

3.1.1 Bitmap Layout

We will store the bitmap both in memory andon disk, in order to be able to withstand fail-ures (or reboots) of the primary server withoutlosing resynchronization status information.

We will use a simple, one-bit-per-block bitmapfor the on-disk representation of the intent log,while the in-memory representation will be

Linux Symposium 113

slightly more complex. The reason for this ad-ditional complexity is the need to track pendingwrites, so as not to clear a bit in the bitmap un-til all pending writes for that data block havecompleted5. The write tracking will be han-dled using a 16-bit counter for each data block.One bit in the counter will actually be used as a“shadow” of the corresponding on-disk bit, re-ducing the usable counter size by one bit (seeFigure 2). The counter will be incrementedwhen a write begins and decremented whenone has completed. Only when the counter hasreached zero, can the on-disk bit be cleared.

In order to conserve RAM, the in-memorybitmap will be constructed in a two-level fash-ion, with memory pages being allocated anddeallocated on demand (see Figure 3). This al-lows us to allocate only as much memory asis needed to hold the set bits in the bitmap.As a fail-safe mechanism, when a page can-not be allocated, the (pre-allocated) pointer forthat page will actually be hijacked and used asa counter itself. This will allow logging to con-tinue, albeit with less granularity,6 during peri-ods of extreme memory pressure.

The bitmap will also be designed so that itis possible to readjust the size of the data“chunks” that the bits represent. This willbe handled by translating from the default mddriver I/O block size of 1KB to the chunksize, whenever the bitmap is marked or cleared.So, with a chunk size of 64KB, for example,the I/O to 64 contiguous disk blocks will betracked by a single bit in the on-disk bitmap(and the corresponding in-memory counter).

5clearing the bit prematurely could result in data cor-ruption on the backup device if a network failure coin-cides

6On x86, with 32-bit pointers and 4KB pages, thegranularity is reduced to roughly 1/1000 the normallevel.

3.1.2 Bitmap Manipulation

To make use of the bitmap, we will make mod-ifications to two areas of the raid1 driver:

1. Ordinary write operations will requirea bitmap entry be made (and synced todisk) before the actual data is written—thebitmap entry will be cleared once the datahas been written to the backup device.

2. Resynchronization operations will nolonger involve a full resynchronization ofthe backup device, but rather a resync ofjust the “dirty” blocks (as indicated by thebitmap).

3.1.3 Write Operations

The sequence of events to write blockn on araid1 device with an intent log is as follows:

1. set thenth shadow bit in the in-memorybitmap and increment the counter forblock n (both can be done as a single op-eration since the shadow bit and counterare contiguous)

2. increment the “outstanding write request”counter for the array7 (and disallow fur-ther writes to the device if the counter hasexceeded the configured limit)

3. sync the shadow bit to disk, if the on-diskbit was not already set

4. duplicate the write request, including itsdata buffer

5. queue the write request to the primary de-vice

7This counter is really only used when the array isin asynchronous replication mode. For more details, seeSection 3.2.

Linux Symposium 114

. . . page pointers (pre−allocated)

pages (allocated on demand)

Figure 3: Two-level, demand-allocated bitmap

6. queue the duplicate request to the backupdevice

We then allow the writes to complete asyn-chronously. After each write is completed,the raid1 driver is notified with a call to itsb_end_io callback function (raid1_end_request ). This function is responsible forsignalling the completion of I/O back to itsinitiator. In synchronous mode, we wait un-til the writes to both the primary and backupdevices have completed before acknowledgingthe write as complete. In asynchronous mode,the write is acknowledged as soon as the datais written to the primary device.

After the write has been acknowledged, thecallback function is responsible for decrement-ing the block counter and, if the counter’s valueis 0, clearing the shadow bit in the in-memorybitmap. Whenever a shadow bit is cleared, a re-quest will also be placed in a queue to indicatethat the on-disk bit needs to be cleared.

The bits in the on-disk bitmap will be clearedasynchronously, by a dedicated kernel dae-mon, mdflushd . The daemon will periodi-cally awaken and flush all the queued updatesto disk.8 The interval at which the daemon

8unless the shadow bit has been reset in the mean-time, in which case the update is simply discarded andthe on-disk bit is left set.

will awaken and flush its queue will be tunable(with a default value of 5 seconds).

Clearing the bits in the on-disk bitmap in a lazymanner will help to reduce the number of diskwrites, and will also ensure that any bits thathappen to correspond to I/O “hotspots”9 willsimply remain dirty, rather than causing a con-stant stream of writes to the on-disk bitmap.

3.1.4 Resynchronization Operations

The resynchronization process of the md driveris fairly straightforward. Following recoveryfrom a failure, the driver will attempt a com-plete resync of the backup device. We willmodify this process slightly, so that for eachdata block that is to be resynchronized, we willfirst check the appropriate shadow bit in the in-memory bitmap and then, either:

• resync the block (if the bit is set), or

• discard the resync request and indicatesuccess (if the bit is cleared)

Once a block has been resynced, its shadowbit will be cleared and its block counter ze-roed. An update request will then be queued

9areas of the disk that are continually written, suchas an ext3 filesystem journal

Linux Symposium 115

to tell mdflushd that the on-disk bit shouldbe cleared.

3.2 Asynchronous Replication

In an asynchronous replication system, writerequests to a mirror device are acknowledgedas soon as the data is written to the primarydevice in the mirror. In contrast, in a syn-chronous replication system, writes are not ac-knowledged until the data has been written toall components of the mirror. Synchronousreplication works well in environments wherethe mirror components are local. However,when the backup device is located on a net-work, the write throughput of a synchronousmirror decreases as network latency increases.An asynchronous mirror does not suffer thisperformance degradation since a write opera-tion can be completed without waiting for thewrite request and its acknowledgement to makea complete roundtrip over the network. Toachieve reasonable write throughput in a WANreplication environment, an asynchronous mir-ror is generally employed.

3.2.1 Outstanding Write Request Limit

In an asynchronous mirror, there can be sev-eral outstanding (i.e., in-flight) write requestsat any given time. In order to limit the amountof data that is out of sync on the backup de-vice during normal mirror operation, it is nec-essary to keep the number of outstanding writerequests fairly low. Therefore, we will placea limit on the number of outstanding write re-quests. However, to avoid degrading the writethroughput of the mirror, this limit must be ad-equately high. Since the limit will need to betuned appropriately for each environment, itwill be made a user configurable parameter.10

10To avoid overflowing the block counters in the in-memory bitmap, we will make it impossible to set this

When the limit for outstanding writes has beenexceeded, the driver will throttle writes to themirror until another write acknowledgementreturns from the remote system (i.e., the mir-ror will degrade to synchronous write mode).A message will be printed in the system logwhen this event occurs, to warn system admin-istrators that they should adjust the relevant pa-rameters. The outstanding write request limitwill default to a reasonable value (which willbe determined through testing).

3.2.2 Device Tagging

In synchronous replication mode, there is noreal need to differentiate between primary andbackup devices, since writes must be commit-ted to all array components before being ac-knowledged. However, in asynchronous mode,the component devices of a raid1 array willneed to be tagged as “primary” or “backup” toensure that the bitmap is handled correctly, andto ensure that read requests are always satisfiedfrom the primary device. To accomplish this,we will need an additional/etc/raidtabdirective to enable a device to be tagged as a“backup.” Devices tagged as backups will beplaced into a special “write-only” mode thatexists in md.

4 Conclusion

With the recent bugfix and cleanup work thathas been done, and with the upcoming addi-tional features that are in the works, the Linuxkernel md driver will finally be an enterprise-class software RAID and data replication so-lution: robust, and capable of being used formany different applications, from simple inter-nal disk mirroring and striping, to LAN datareplication, and even disaster recovery over a

limit higher than the maximum value for those counters.

Linux Symposium 116

WAN.

5 Acknowledgements

We would especially like to thank Peter T.Breuer and Neil Brown for their outstandingand ongoing work in the Software RAID (md)subsystem of the Linux kernel. Without theircontributions, we would not have been able toundertake such a huge endeavor.

References

[1] SteelEye Technology, Inc.SteelEyeTechnology Open Source Websitehttp://licensing.steeleye.com/open_source/

[2] Peter T. Breuer.Fast Intelligent SoftwareRAID1 Driverhttp://www.it.uc3m.es/ptb/fr1/http://freshmeat.net/projects/fr1/

[3] Peter T. Breuer, Neil Brown, IngoMolnar, Paul Clements.linux-raidmailing list discussions on raid1 bitmapand asynchronous writeshttp://marc.theaimsgroup.com/?l=linux-raid&b=200302Jan-Apr 2003

Porting NSA Security Enhanced Linux to Hand-helddevices

Russell [email protected]

http://www.coker.com.au/

Abstract

In the first part of this paper I will describe howI ported SE Linux to User-Mode-Linux and tothe ARM CPU. I will focus on providing infor-mation that is useful to people who are portingto other platforms as well. In the second partI will describe the changes necessary to appli-cations and security policy to run on small de-vices. This will be focussed on hand-held de-vices but can also be used for embedded appli-cations such as router or firewall type devices,and any machine that has limited memory andstorage.

1 Introduction

SE Linux offers significant benefits for secu-rity. It accomplishes this by adding anotherlayer of security in addition to the default Unixpermissions model. This is achieved by firstlyassigning atype to every file, device, networksocket, etc. Then every process has adomain,and the level of access permitted to a type isdetermined by the domain of the process that isattempting the access (in addition to the usualUnix permission checks). Domains may onlybe changed at process execution time. The do-main may automatically be changed when aprocess is executed based on the type of theexecutable program file and the domain of theprocess that is executing it, or a privileged pro-cess may specify the new domain for the child

process.

In addition to the use of domains and typesfor access control SE Linux tracks theidentityof the user (which will besystem_ufor pro-cesses that are part of the operating system orthe Unix user-name) and the role. Eachidentitywill have a list of roles that it is permitted to as-sume, and eachrole will have a list of domainsthat it may use. This gives a high level of con-trol over the actions of a user which is trackedthrough the system. When the user runs SUIDor SGID programs the original identity willstill be tracked and their privileges in the SE se-curity scheme will not change. This is very dif-ferent to the standard Unix permissions whereafter a SUID program runs another SUID pro-gram it’s impossible to determine who ran theoriginal process. Also of note is the fact thatoperations that are denied by the security pol-icy [1] have theidentityof the process in ques-tion logged.

I often run SE Linux demonstration machineson the Internet which provide root access to theworld and an invitation to try and break the se-curity. [2]

For a detailed description of how SE Linuxworks I recommend reading the paper PeterLoscocco presented at OLS in 2001 [3].

SE Linux has been shown to provide sig-nificant security benefits for little overheadon servers, desktop workstations, and laptops.

Linux Symposium 118

However it has not had much use in embeddeddevices yet.

Some people believe that SE Linux is onlyneeded for server systems. I think that is in-correct, and I believe that in many situationslaptops and hand-held devices need more pro-tection than servers. A server will usually havea firewall protecting it, with a small numberof running applications which are well main-tained and easy to upgrade. Portable comput-ers are often used in hostile environments thatservers do not experience, they have no fire-walls to protect them, and often they are con-nected to routers operated by potentially negli-gent or hostile organizations.

But there are two main factors that cause anincreased need for security on portable devices.One is that it is usually extremely difficult andexpensive to upgrade them if a new security fixis needed. This means that in commercial useportable computers tend to never have securityfixes applied. Another factor is that often theperson in posession of a hand-held computer isnot authorised to access all the data it contains,and may even be hostile to the owner of themachine.

Naturally for a full security solution forportable computers a strong encryption systemwill need to be used for all persistent file sys-tems. There are various methods of doing this,but all aspects of such encryption are outsidethe scope of this project and can be imple-mented independently.

2 Kernel Porting

The current stable series of SE Linux is basedon the 2.4.x kernels and uses the Linux Secu-rity Modules (LSM) [4] interface. The currentLSM interface has a single sys_security() sys-tem call that is used to multiplex all the systemcalls for all of the security modules. SE Linux

uses 52 different system calls through this in-terface. Due to problems in porting the kernelcode to some platforms (particularly those thathave a mixed 32 and 64bit memory model) thedecision was made to change the LSM inter-face for kernel 2.6.0. The new interface willmake the code fully portable and remove thepainful porting work that is currently required.However I needed to have SE Linux workingwith the 2.4.x kernels so I couldn’t wait for ker-nel 2.6.0.

The main difficulty in porting the code is thesystem call execve_secure() which is used tospecify the security context for the new pro-cess. This calls the kernel funtion do_exec() toperform the execution, and do_exec() needs apointer to the stack, thus requiring architecturespecific code in the sys_execve_secure() func-tion. The sys_security_selinux_worker() func-tion (which determines which SE Linux sys-tem call is desired and passes the appropriateparameters to it) calls sys_execve_secure() andtherefore also needs architecture specific code,and so does the main system call sys_security_selinux().

My first port of SE Linux was to User-ModeLinux [5]. This was a practice effort for themain porting work. It is quite easy to debugkernel code under UML, and as it uses the i386system call interface I could port the kernelcode without any need to port application code.

The main architecture dependent code isin the source filesecurity/selinux/arch/i386/wrapper.c , which has codeto look on the stack for the contents of par-ticular registers. This needs to be changed forplatforms with different register names, and forUML which does not permit such direct accessof registers.

The solution in the case of UML was to nothave a wrapper function, as thecurrent struc-ture had a pointer to the stack anyway that

Linux Symposium 119

could be used inside the sys_execve_secure()function. So I renamed the sys_security_selinux_worker() function to sys_security_selinux() for the UML port and entirely re-moved all reference to the wrapper. ThenI moved the implementation of sys_execve_secure() into the platform specific directoryand implemented a different version for eachport.

This was essentially all that was required tocomplete the port, the core code of SE Linuxwas all cleanly written and could just be com-piled. The only other work involved getting theMakefile’s correctly configured, and adding ahook to sys_ptrace().

One thing I did differently with my port to theARM architecture was that I removed the codeto replace the system call entry. When the SELinux kernel code loads on UML and i386 itreplaces the system call with a direct call to theSE Linux code (rather than using the option forLSM to multiplex between different modules).As there is currently no support for having SELinux be a loadable module there seems to beno benefit in this, and it seems that on ARMthere will be more overhead for adding an extralevel of indirection for this. So I made the SELinux patch hard-code the SE system call intothe sys-call table.

3 iPaQ Design Constraints

The CompaQ/HP iPaQ [6] computers are smallhand-held devices. The most powerful iPaQmachines on sale have a 400MHz ARM basedCPU that is of comparable speed to a 300MHzIntel Celeron CPU, with 64M of RAM and48M of flash storage.

An iPaQ is not designed for memory upgrades.There are some companies that perform suchupgrades, but they don’t support all models,and this will void your warantee. Therefore

you are stuck with a memory limit of 64M.

The flash storage in an iPaQ can only be writ-ten a limited number of times, this combinedwith the small amount of storage makes it im-possible to use a swap space for virtual mem-ory unless you purchase a specialsleevefor us-ing an external hard drive. Attaching an exter-nal hard drive such as the IBM/HitachiMicroDrive is expensive and bulky. Therefore if youhave a limited budget then storage expansion(for increased file storage or swap space) is notan option.

For storing files, the 32M file system can con-tain quite a lot. The Familiar distribution is op-timised for low overheads (no documentationor man pages) and all programs are optimisedfor size not speed. Also the JFFS2 [7] filesystem used by Familiar supports several com-pression algorithms including the Lempel-Zivalgorithm implemented in zlib, so more than32M of files can fit in storage.

For a system such as SE Linux to be viable onan iPaQ it has to take up a small portion of the32M of flash storage and 64M of RAM, andnot require any long CPU intensive operations.

Finally the screen of an iPaQ only has a reso-lution of 240x320 and the default input deviceis a keyboard displayed on the screen. Thismakes an iPaQ unsuitable for interactive tasksthat involve security contexts as it takes toomuch typing to enter them and too much screenspace to display them. As a strictly end-userdevice this does not cause any problems.

4 CPU Requirements

Benchmarks that were performed on SE Linuxoperational overheads in the past show thattrivial system calls (reading from /dev/zero andwriting to /dev/null) can take up to 33% longerto complete when SE Linux is running, but that

Linux Symposium 120

the overhead on complex operations such ascompiles is so small as to be negligible [8]. Themachines that were used for such tests had sim-ilar CPU power to a modern iPaQ.

One time consuming operation related to SELinux installation is compiling the policy(which can take over a minute depending onthe size of the policy and the speed of theCPU). This however is not an issue for an iPaQas the policy takes over a megabyte of perma-nent storage and 5 megs of temporary file stor-age, as well as requiring many tools that are notnormally installed (make, m4, the SE Linuxpolicy compilation program checkpolicy, etc).The storage requirements make it impracticalto compile policy on the iPaQ, and the typicaluse involves configuration being developed onother machines for deployment on iPaQ. So thetime taken to compile the policy database is notrelevant.

The only SE Linux operation which can take alot of time that must be performed on an iPaQis labeling the file system. The file system mustbe relabeled when SE Linux is first installed,and after an upgrade. On my iPaQ (H3900 with400MHz X-Scale CPU) it takes 29.7 secondsof CPU time to label the root file system whichcontains 2421 files. For an operation that isonly performed at installation or upgrade time29.7 seconds is not going to cause any prob-lems. Also thesetfilesprogram that is used tolabel the file system could be optimised to re-duce that time if it was considered to be a prob-lem.

I conclude that for typical use of a hand-held machine SE Linux only requires the CPUpower of an iPaQ. In fact the CPU use issmall enough that even the older iPaQ ma-chines (which had half the CPU power) shoulddeliver more than adequate performance.

5 Kernel Resource Use

To compare the amounts of disk space andmemory I compiled three kernels. Onewas 2.4.19-rmk6-pxa1-hh13 with the de-fault config for the H3900 iPaQ. One wasa SE Linux version of the same kernelwith the options CONFIG_SECURITY, CON-FIG_SECURITY_CAPABILITIES, and CON-FIG_SECURITY_SELINUX. Another was thesame SE Linux kernel with development modeenabled (which slightly increases the size andmemory).

For this project I have no need for the multi-level-security (MLS) functionality of SE Linuxor the options for labelled networking and ex-tended socket calls. This optional functionalitywould increase the kernel size. I am focussingon evaluating the choice of whether or not touse SE Linux for specific applications, onceyou have decided to use SE Linux you wouldthen need to decide whether the optional func-tionality provides useful benefits to your use tojustify the extra disk space and memory use.

The kernel binaries are 658648 bytes for a non-SE kernel, 704708 bytes for the base SE Linuxkernel, and 705560 bytes for the developmentmode kernel. The difference between the ker-nel with development mode enabled and theregular one is that the development kernel al-lows booting without policy loaded, and boot-ing in permissive mode (with the policy deci-sions not being enforced). For most develop-ment work a kernel with development modeenabled will be used, also for this test it al-lowed me to determine the resource consump-tion of SE Linux without a policy loaded.

To test the memory use of the different ker-nels I configured an iPaQ to not load any ker-nel modules. My test method was to boot themachine, login at the serial console, wait 30seconds to make sure that all daemons have

Linux Symposium 121

started, and runfreeto see the amount of mem-ory that is free. This is not entirely accurate asrandom factors may result in different amountsof memory usage, however this is not as signif-icant on the Familiar distribution due to the useof devfsfor device nodes andtmpfsfor /var and/tmp which means that in the normal mode ofoperation almost nothing is written to the rootfile system, so two boots will be working onalmost the same data.

From the results I looked at thetotal field inthe results (which gives the amount of RAMthat is available for user processes after the ker-nel has used memory in the early stages of theboot process), and theusedfield which showshow much of that has been used. The kernelmessage log gives a break-down of RAM thatis used by the kernel for code and data in theearly stages of boot, however that is not of rel-evance to this study only the total amount thatis used matters.

The total memory available was reported as63412k for the non-SE kernel, 63308k for theSE Linux kernel, and 63300k for the develop-ment mode kernel. So SE Linux takes 104k ofkernel memory early in the boot process and112k if you use the development mode option.

The memory reported asusedvaried slightlywith each boot. For the vanilla kernel the value18256k was reported in two out of four tests,with values of 18252k and 18260k also be-ing reported. I am taking the value 18256k asthe working value which I consider accurate towithin 8k.

For a standard SE Linux kernel the amountreported asusedwas 19516k in three out ofsix tests with the values of 19532k, 19520k,and 19524k also being returned. So I consider19516k as the working value and the accuracyto be within 16k.

For the SE Linux kernel with development

mode enabled the memoryusedwas 19516kin three out of four tests, and the other test was19524k. So the difference between the devel-opment mode kernel and the regular SE Linuxkernel is only 8K of kernel memory in the earlystages of the boot process.

Finally I did a test of a development mode ker-nel with no policy loaded. The purpose of thistest was to determine how much memory isused on a SE Linux kernel if the SE Linux codeis not loading the policy. For this the memoryreported asusedwas 18292k in three out of fivetests, with the values of 18296k and 18300kalso being returned.

Kernel memory usednon-SE 18256kSE no policy 18292kSE with policy 19516k

So an SE Linux kernel without policy loadeduses approximately 36K more memory afterboot than a non-SE kernel in addition to the104k or 112k used in the early stages of boot.

With a small policy loaded (360 types and23,386 rules for a policy file that is 583771bytes in size) the memory used by the kernel isabout 1224k for the policy and other SE Linuxdata structures. The policy could be reducedin size as there are many rules which wouldonly apply to other systems (the sample pol-icy is quite generic and was quickly ported tothe iPaQ), although there may be other areas offunctionality that are desired which would useany saved space.

So it seems that when using SE Linux thememory cost is 104k when the kernel is loaded,and a further 1260k for SE Linux memorystructures and policy when the boot process iscomplete. The total is 1364k of non-swappablekernel memory out of 64M of total RAM in aniPaQ, this is about 2% of RAM.

Linux Symposium 122

All tests were done with GCC 3.2.3, a modi-fied Linux 2.4.19, and an X-scale CPU. Differ-ent hardware, kernel version, and GCC versionwill give different results.

6 Porting Utilities

The main login program used on the Famil-iar [9] distribution is gpe-login, which is anxdm type program for a GUI login. This pro-gram had to be patched to check a configura-tion file and the security policy to determinethe correct security context for the user and tolaunch their login shell in that context. Thepatch for this functionality made the binarytake 4556 bytes more disk space in my build(29988 bytes for the non-SE build compared to34544 bytes for the version with SE Linux sup-port).

The largest porting task was to provide SELinux support in Busybox [10]. Busybox pro-vides a large number of essential utility pro-grams that are linked into one program. Link-ing several programs into one reduces diskspace consumption by spreading the overheadfor process startup and termination code acrossmany programs. On arm it seems that the min-imum size of an executable generated by GCC3.2.3 is 2536 bytes. In the default configura-tion of Familiar Busybox is used for 115 com-monly used utilities, having them in one pro-gram means that the 2.5K overhead is onlyused once not 115 times. So approximately285K of uncompressed disk space is saved byusing busybox if the only saving is from thisoverhead. The amount of disk space usedfor initialisation and termination code wouldprobably increase the space used by more than80% if all the applets were compiled separately(my build of Busybox for the iPaQ is 337028bytes).

The programs that are of most immediate note

in busybox arels, ps, id, and login. ls needsthe ability to show the security contexts ofthe files, ps needs to show the security con-texts of the running processes, andid needs toshow the context of the current process. Alsothe /bin/login applet had to be modified in thesame manner as thegpe-loginprogram. Thesechanges resulted in the binary being 5600 byteslarger (337028 bytes for a non-SE version and342628 bytes for the version with SE Linuxsupport.

7 Busybox Wrappers for DomainTransition

In SE Linux different programs run in differ-ent securitydomains. A domain change canbe brought about by using theexecve_secure()system call, or it can come from an automaticdomain transition. An example of an automaticdomain transition is when theinit process(running in theinit_t domain) runs/sbin/gettywhich has the typegetty_exec_t, which causesan automatic transition to the domaingetty_t.Another example is when getty runs/bin/loginwhich has the typelogin_exec_tand causesan automatic transition to the domainlocal_login_t. This works well for a typical Linuxmachine where/sbin/gettyand /bin/login areseparate programs.

When using Busybox the getty and login pro-grams will both be sym-links to/bin/busyboxand the type of the file as used for domain tran-sitions will be the type of/bin/busybox, whichis bin_t. SE Linux does not perform domaintransitions based on the type of the sym-link,and it assignes security types to the Inodes notfile names (so a file with multiple hard linkswill only have one type). This means that wecan’t have a single Busybox program automat-ically transitioning into the different domains.

There are several possible solutions to this

Linux Symposium 123

problem, one possible partial solution wouldbe to have Busybox useexecve_secure()torun copies of itself in the appropriate domain.Busybox already has similar code for deter-mining when to change UID so that some of theBusybox applets can be effectively SETUIDwhile others aren’t. The SETUID managementof Busybox requires that it be SETUID root,and involves some risk (any bug in busybox canpotentially be exploited to provide root access).Providing a similar mechanism for transition-ing between SE Linux security domains wouldhave the same security problems whereby ifyou crack one of the Busybox applets youcould then gain full access to any domain thatit could transition to. This does not provideadequate security. Also it would only workfor transitions between privileged domains (itwould not work for transitions from unprivi-leged domains). I did not even bother writinga test program for this case as it is not worthconsidering due to a lack of security and func-tionality.

A better option is to split the Busybox programinto smaller programs so transitions can workin the regular manner. With the current rangeof applets that would require one program forgetty, one forlogin, one forklogd, one forsys-logd, one formountandumount, one for ins-mod, rmmod, andmodprobe, one for ifconfig,one forhwclock, one for all the fsck type pro-grams, one forsu, and one forping. Of coursethere would also be one final build of busyboxwith all the utility programs (ls, ps, etc) whichrun with no special privilege. To test how thiswould work I compiled Busybox with all theusual options apart from modutils, and I did aseparate build with only support for modutils.The non-modutils build was 323236 bytes andthe build with only modutils was 37764 bytes.This gave a total of 361000 bytes compared to342628 bytes for a single image, so an extra18372 bytes of disk space was required for do-ing such a split.

Splitting the binary in such a simple fashionwould likely cost 18K for each of the elevenextra programs. If we changed the policy tohave syslogd and klogd run in the same domain(and thus the same program) and have hwclockrun with no special privs (IE the domain thatruns it needs to have access to/dev/rtc) thenthere would only be nine extra programs fora cost of approximately 162K of disk space.This disk space use could be reduced by fur-ther optimisation of some of the applets, for ex-ample in the case ofifconfig the code to checkargv[0] to determine the applet name could beremoved. A simple split in this manner wouldalso make it more difficult for an attacker tomake the program perform unauthorized ac-tions. When a single program has/bin/loginfunctionality as well as/bin/shthen there is po-tential for a buffer overflow in the login code totrigger a jump to the shell code under controlof the attacker! When the shell is a separateprogram that can only be entered through a do-main transition it is much more difficult to usean attack on the login program to gain furtheraccess to the system.

Finally if we have a single Busybox pro-gram that includes applets running in differ-ent domains we need to make some significantchanges to the policy. The default policy hasassertrules to prevent compilation of a policythat contains mistakes which may lead to secu-rity holes. For the domainsgetty_t, klogd_t,and syslogd_tthere are assertions to preventthem from executing other programs withouta domain transition, and to prevent those do-mains being entered through executing files oftypes other than the matching executable type(this requires that each of those domains have aseparate executable type, IE they are not all thesame program). Adding policy which requiresremoving these assertions weakens the securityof the base domains and also makes the policytree different from the default tree which hasbeen audited by many people.

Linux Symposium 124

Another way of doing this which uses less diskspace is to have a wrapper program such as thefollowing:

#include <unistd.h>#include <string.h>

int main(int argc, char **argv,char **envp) {

/* ptr is the basename of theexecutable that is being run */

char *ptr = strrchr(argv[0],’/’);

if(!ptr)ptr = argv[0];

elseptr++;

/* basename must match one ofthe allowed applets,otherwise it’s a hackingattempt and we exit */

if(strcmp(ptr, "insmod")&& strcmp(ptr, "modprobe")&& strcmp(ptr, "rmmod"))

return 1;return execve("/bin/busybox",

argv, envp);}

This program takes 2912 bytes of disk space.The idea would be to have a copy of itnamed/sbin/insmodwith type insmod_exec_twhich has symlinks/sbin/rmmodand mod-probepointing to it. Then wheninsmod, rm-mod, or modprobeis executed an automaticdomain transition to theinsmod_tdomain willtake place, and then the Busybox program willbe executed in the correct context for that ap-plet.

This option is easy to implement, one advan-tage is that there is no need to change the Busy-box program. The fact that the entire Busyboxcode base is available in privileged domainsis a minor weakness. Implementing this takes

about 2900 bytes of disk space for each of thenine domains (or seven domains depending onwhether you have separate domains for klogdand syslogd and whether you have a domainfor hwclock). It will take less than 33K or 27Kof disk space (depending on the number of do-mains). This saves about 130K over the optionof having separate binaries for implementingthe functionality.

A final option is to have a single program toact as a wrapper and change domains appropri-ately. Such a program would run in its own do-main with an automatic domain transition ruleto allow it to be run from all source somains.Then it would look at its parent domain and thetype of the symlink to determine the domain ofthe child process. For example I want to haveinsmodrun in domaininsmod_twhen run fromsysadm_t. So I have an automatic transitionrule to transition fromsysadm_tto the domainfor my wrapper (bbwrap_t). Then the wrapperdetermines that its parent domain issysadm_t,determines that the type of the symlink for itsargv[0] is insmod_exec_tand asks the kernelwhat domain should be entered when a processin sysadm_texecutes a program of typeins-mod_exec_t, and the answer isinsmod_t. Sothe wrapper then uses theexecve_secure()sys-tem call to execute Busybox in theinsmod_tdomain and tell it to run the insmod applet.

I implemented a prototype program for this.For my prototype I used a configuration file tospecify the domain transitions instead of ask-ing the kernel. The resulting program was6K in size (saving 27K of disk space over themultiple-wrapper method, and 156K of diskspace over the separate programs method), al-though it did require some new SE Linux pol-icy to be written which takes a small amount ofdisk space and kernel memory.

One problem with this method is that it allowssecurity decisions to be made by an application

Linux Symposium 125

instead of the kernel. It is preferrable that onlythe minimum number of applications can makesuch security decisions. In a typical configu-ration of SE Linux the only such applicationswill be login, an X login program (in this casegpe-login), cron (which is not installed in Fa-miliar), andnewrole(the SE Linux utility forchanging the security context which operatesin a similar manner tosu).

The single Busybox wrapper is more of a riskthan most of these other programs. The loginprograms are only executed by the system andcan not be run by the user with any elevatedprivileges which makes them less vulnerableto attack.Newroleis well audited and the do-mains it can transition to are limited by kernelto only include domains that might be used fora login process (dangerous domains such aslo-gin_t are not permitted).

Due to the risks involved with a single busy-box wrapper, and the fact that the benefits ofusing 6K on disk instead of 33K are very small(and are further reduced by an increase in ker-nel memory for the larger policy) I concludethat it is a bad idea.

I conclude that the only viable methods of us-ing Busybox on a SE Linux system are havingseparate wrapper programs for each domainto be entered (taking 33K of extra disk spaceand requiring minor policy changes), or havingentirely separate programs compiled from theBusybox source for each domain (taking ap-proximately 162K of extra disk space with noother problems). Also with some careful op-timisation the 162K of overhead could be re-duced for the option of splitting the Busyboxprogram. If 162K of disk space can be spared(which should not be a problem with a 32Mfile system) then splitting Busybox is the rightsolution.

8 Removed Functionality

A hand-held distribution doesn’t require all thefeatures that are needed on bigger machinessuch as servers, desktop workstations, and lap-tops. Therefore we can reduce the size of theSE Linux policy and the number of supportprograms to save disk space and memory.

For a full SE Linux installation there are wrap-pers for the commandsuseradd, userdel, user-mod, groupadd, groupdel, groupmod, chfn,chsh, andvipw. These can possibly be removedas there is less need for adding, deleting, ormodifying users or groups on a hand-held de-vice in the field. These programs would take27K of disk space if they were included.

A default installation of Familiar does not in-clude support for/etc/shadow, and thereforethere is no need for the wrapper programs forthe administrator to modify users’ accounts.However I think that the right solution here isto add /etc/shadowsupport to Familiar ratherthan removing functionality from SE Linux.This will slightly increase the size of the loginprograms.

In a full install of SE Linux there are programschsidandchconto allow changing the securitytype of files. These are of less importance for asmall device. There will be fewer types avail-able, and the effort of typing in long namesof security contexts will be unbearable on atouch-screen input device. A hand-held devicehas to be configured to not require changing thecontexts of files, and therefore these programscan be removed.

In the Debian distribution there is support forinstalling packages on a live server and havingthe security contexts automatically assigned tothe files. As iPaQ’s are used in a different en-vironment I believe that there is less need forsuch upgrades and such support could option-ally be removed to save disk space. I have not

Linux Symposium 126

written the code for this yet, but I estimate it tobe about 100K.

The default policy for SE Linux has separatedomains for loading policy and for policy com-pilation. On the iPaQ we can’t compile policydue to not having tools such asm4 andmake,so we can skip the compilation program and itspolicy. Also the policy for a special domain forloading new policy is not needed as the systemadministration domainsysadm_tcan be usedfor this purpose. It is possible to even save3500 bytes of disk space by not including theprogram to load the policy (a reboot will causethe new policy to take affect).

A server configuration of SE Linux (or afull workstation configuration) includes therun_init program to start daemons in the cor-rect security context. On a typical install of Fa-miliar there are only three daemons, a programto manage X logins, a daemon to manage blue-tooth connections, and the PCMCIA cardmgrdaemon. For restarting these daemons it shouldbe acceptable to reboot the iPaQ, sorun_init isnot needed.

9 Disk Space and RAM Use

In the section on kernel resource usage I de-termined that the kernel was using 1364K ofRAM for SE Linux with a 583771 byte policycomprising 23,386 rules loaded. Since the timethat I performed those tests I reduced the pol-icy to 455,422 bytes and 18,141 rules whichwould reduce the kernel memory use. I didnot do any further tests as it is likely that I willadd new functionality which uses the memory Ihave freed. So I can expect that 1.3M of kernelmemory is taken by SE Linux.

The SE Linux policy that is loaded by the ker-nel takes 67K on disk when compressed. Thefile_contextsfile (which specifies the securitycontexts of files for the initial installation and

for upgrades) takes 24K. The kernel binarytakes 64K more disk space for the SE Linuxkernel. So the kernel code and SE Linux con-figuration data takes 156K of disk space (mostof which is compressed data).

The programsetfilesis needed to apply thefile_contextsdata to the file system.Setfilestakes20K of disk space. Thefile_contextsfile couldbe reduced in size to 1K if necessary to saveextra disk space, but in my current implemen-tation it can not be removed entirely. In Fa-miliar a large number of important system di-rectories (such as/var) on Familiar are on aramfsfile system. I am usingsetfilesto label/mnt/ramfs. So far it has not seemed beneficialto have a smallfile_contextsfile for booting thesystem and an optional larger one for use wheninstalling new packages or upgrading, but thisis an option to save 23K. Another option wouldbe to write a separate program that hard-codesthe security contexts for theramfs. It wouldbe smaller than setfiles and not require a em-phfile_contexts file, thus saving 30K or moreof disk space. Currently this has not seemedworth implementing as I am still in a prototypephase, but it would not be a difficult task. Alsoif such a program was written then the nextstep would be to use ajffs2 loop-back mountto label the root file system on a server beforeinstallation to the iPaQ (so thatsetfilesneverneeds to run on the iPaQ.

The patches for thegpe-loginandbusyboxpro-grams to provide SE Linux login support andmodified ls, ps, andid programs cause the bi-naries to take a total of 10K extra disk space.

Splitting Busybox into separate programs foreach domain will take an estimated 162K ofdisk space.

The total of this is approximately 348K of ad-ditional disk space for a minimal installationof SE Linux on an iPaQ. Adding support for/etc/shadowand other desirable features may

Linux Symposium 127

increase that to as much as 450K dependingon the features chosen. However if you usemultiple Busybox wrappers instead of split-ting Busybox then the disk space for SE Linuxcould be reduced to less than 213K. If you thenreplacedsetfilesfor the system boot labeling oftheramfsthen it could be reduced to 190K.

10 Conclusion

Security Enhanced Linux on a hand-held de-vice can consume less than 1.3M of RAM andless than 400K of disk space (or less than 200Kif you really squeeze things). While the mem-ory use is larger than I had hoped it is within abearable range, and it could potentially be re-duced by changing the kernel code to optimisefor reduced memory use. The disk space usageis trivial and I don’t think it is a concern.

I believe that the benefits of reducing repair andmaintenance problems with hand-held devicesthat are deployed in the field through better se-curity outweigh the disadvantage of increasedmemory use for many applications.

All source code and security policy code re-leated to this article will be on my website [11].

References

[1] Configuring the SELinux Policy. StephenD. Smalley, NAI Labs.http://www.nsa.gov/selinux/

policy2-abs.html

[2] Details of SE Linux test machine,http://www.coker.com.au/

selinux/play.html

[3] Meeting Critical Security Objectiveswith Security-Enhanced Linux. Peter A.Loscocco, NSA; Stephen D. Smalley,

NAI Labs.http://www.nsa.gov/

selinux/ottawa01-abs.html

[4] Linux Security Modules,http://lsm.immunix.org/

[5] User-Mode Linux,http://sourceforge.net/

projects/user-mode-linux/

[6] HP Site for iPaQ Information,http://whp-sp-orig.extweb.hp.

com/country/us/eng/prodserv/

handheld.html/

[7] Journalled Flash File System 2, http:

//sources.redhat.com/jffs2/

[8] Integrating Flexible Support for SecurityPolicies into the Linux Operating System.Peter A. Loscocco, NSA; Stephen D.Smalley, NAI Labs.http://www.nsa.

gov/selinux/freenix01-abs.html

[9] Familiar Linux distribution forhand-held devices,http://familiar.handhelds.org/

[10] Busybox - Swiss Army Knife ofEmbedded Linux,http://busybox.net/

[11] My SE Linux Web Pages, http:

//www.coker.com.au/selinux/

Strong Cryptography in the Linux KernelDiscussion of the past, present, and future of strong cryptography in the Linux kernel

Jean-Luc CookeCertainKey Inc.

[email protected]

David BrysonTsumego [email protected]

PGP:0x74B61620

Abstract

In 2.5, strong cryptography has been incorpo-rated into the kernel. This inclusion was aresult of several motivating factors: removeduplicated code, harmonize IPv6/IPSec, andthe usual crypto-paranoia. The authors willpresent the history of the Cryptographic API,its current state, what kernel facilities are cur-rently using it, which ones should be using it,plus the new future applications including:

1. Hardware and assembly crypto drivers

2. Kernel module code-signing

3. Hardware random number generation

4. Filesystem encryption, including swapspace.

1 History of Cryptography insidethe kernel

The Cryptographic API came about from twosomewhat independent projects: the interna-tional crypto patch last maintained by HerbertValerio Riedel and the requirments for IPv6.

The international crypto patch (or ‘kerneli’)was written by Alexander Kjeldaas and in-tended for filesystem encryption, it has

grown to also optionally replace duplicatedcode in the UNIX random character device(/dev/*random). This functionality could notbe incorporated into the main line kernel at thetime because kernel.org was hosted in a na-tion with repressive cryptography export reg-ulations. These regulations have since been re-laxed to permit open source cryptographic soft-ware to travel freely from kernel.org’s locality.

The 2.5 kernel, at branch time, did not includeany built in cryptography. But with the adventof IPv6 the killer feature of kernel space cryp-tography has shown itself. The IPv6 specifica-tion contains a packet encryption industry stan-dard for virtual private network (VPN) tech-nology. The 2.5 kernel was targeted to havea full IPv6 stack–this included packet encryp-tion. The IPv6 and kernel maintainers in theirinfinite wisdom (!) saw an opportunity to re-move duplicated code and encouraged the ker-neli.org people to play with others.

And so, strong cryptography is now at the dis-posal of any 2.5+ kernel hacker.

2 Why bring cryptography intoour precious kernel?

Cryptography, in one form or another, has ex-isted in the main line kernel for many versions.The introduction of the random device driver

Linux Symposium 129

by Theodore Ts’o integrated two well knowncryptographic (digest) algorithms, MD5 andSHA-1. Other forms of cryptography were in-troduced with the loopback driver (also writ-ten by Theodore Ts’o) these included an XORand DES implementation for primitive filesys-tem encryption.

The introduction of cryptography for filesys-tem encryption, coupled with the kernelipatches, allowed users to hook the loopbackdevice up to a cipher of their choosing. Thusproviding a solution for secure hard disk stor-age on Linux.

With the advent of IPSec the introduction ofcrypto into the kernel makes setting up en-crypted IP connections extremely easy. Pre-vious implementatinons have used userspacehooks and required compilcated configurationto setup properly. With IPSec being inside thekernel much of those tasks can be automated.

More advanced features for cryptography inthe kernel will be explained throughout this pa-per.

2.1 Example Code

The use of the API is quite simple and straight-forward. The following lines of code show abasic use of the MD5 hash algorithm on a scat-terlist.

#include <linux/crypto.h>

struct scatterlist sg[2];char result[128];struct crypto_tfm *tfm;

tfm = crypto_alloc_tfm("md5", 0);if (tfm == NULL)

fail();

/* copy data into *//* the scatterlists */

crypto_digest_init(tfm);crypto_digest_update(tfm, &sg, 2);crypto_digest_final(tfm, result);

crypto_free_tfm(tfm);

Ciphers are implemented in a similar fashionbut must set a key value (naturally) before do-ing any encryption or decryption operations.

#include <linux/crypto.h>

int len;char key[8];char result[64];struct crypto_tfm *tfm;struct scatterlist sg[2];

tfm = crypto_alloc_tfm("des", 0);if (tfm == NULL)

fail();

/* place key data into key[] */crypto_cipher_setkey(tfm, key, 8);

/* copy data into scatterlists */

/* do in-place encryption */crypto_cipher_encrypt(tfm,sg[0],

sg[0],len);crypto_free_tfm(tfm);

The encryption and decryption functions arecapable of doing in-place operations as well asin/out (separate source and destination) opera-tions. This example shows in-place opertaion.By changing the encrypt line to:

crypto_cipher_encrypt(tfm,sg[0], sg[1], len);

the code then becomes an in/out operation.

3 Kernel module code-signing

Signing of kernel modules has been a desiredaddition to the kernel for a long time. Many

Linux Symposium 130

people have attempted to do some kind of au-thenticated kernel module signing/encryptionbut usually by the means of an external user-mode program. With the movement of themodule loader into the kernel in the 2.4 se-ries a truly secure module loader is possible.The authors would like to propose a method fortrusted module loading.

To create the secure module structure we needa way of designating a module as trusted. Dur-ing compile time, a token can be created foreach module. The token contains two identi-fiers.

• Time stamp token, denoting module cre-ation time.

• A secure hash of the module in its com-piled state.

After these three tokens are created they areencrypted by an internal private key (protectedby a separate password of course) bound to thekernel. The encrypted file is then stored in afile on the local disk.

Loading of the module occurs as follows.

1. A request to load modulerot13 is madeby the system.

2. The kernel reads the encrypted file formodulerot13 .

3. Using the kernels public key the file isdecrypted, and the tokens are placed inmemory

4. A hash is computed against the file on res-ident disk of modulerot13 and com-pared against the signed token.

5. If the hashes are equal the module istrusted and code loaded into memory.

This allows for a large degree of flexibility.Anybody on the system who has access to thekernels public key can verify the validity of themodules. Plus the kernel does not need to havethe private key in memory to authenticate sincethe public key can do the decryption. Thus re-ducing the time that the private key is stored inresident memory unencrypted.

However this approach can only protect a sys-tem to a point. If a malicious user is on yoursystem and is at the stage where they can loadmodules (root access) this will only slow themdown. Nothing prevents them from compilinga new kernel with a ‘dummy’ module loaderthat skips this check (solutions to this problemwelcome!).

This system requires that the kernel containfunctionality to support arbitrarily large inte-gers and perform asymmetric cryptography op-erations. Currently, there is preliminary codethat supports this functionality, but has yet tobe formally tested or introduced to the commu-nity.

4 Cryptographic Hardware andAssembly Code

A new exciting aspect of cryptography in thekernel is the ability to use hardware based cryp-tographic accelerators. Many vendors offer avariety of solutions for speeding up crypto-graphic routines for symmetric and asymmet-ric operations.

The chips provide cheap, efficient, and fastcryptographic processors. These can be pur-chased for as little as $80.00USD and offer aconsiderable speedup for the algorithms theysupport. The proposed method of integratinghardware and assembly is to have the chip orcard register its capabilities with the API.

This way the API can serve as a front end to the

Linux Symposium 131

Figure 1: The proposed hardware interfacemodel

hardware driver module. Instead of a the hard-ware registering “aes” it would register “aes-<chipset name>” with the API. Calls to the APIcan then specify which implementations of theciphers that are desired depending on what per-formance is needed.

As of this writing (May 2003) there is also noway to query the API for the fastest method ithas for computing a cipher. There is in the de-sign stage an asynchronous system for dynam-ically receiving requests and distributing themto the various pieces of hardware (or software)present on the system. The OpenBSD API andcryptography sub-system is being used as a ref-erence model.

This method would allow users of the APIto send queries to a scheduler with a callsimilar to the current interface, but addingusing the cypher nameaes_fastest oraes_hardware . The scheduler then sendsthe requested command to a piece of hardwarethat is waiting for requests, and fulfills the re-quested hardware requirements.

Drivers that are currently finished and/or underdevelopment include:

• Hifn series processors 7751 and 7951.

• Motorola MPC190 and MPC184 series.

• Broadcom BCM582[0|1|3] series.

4.1 Hardware Random Number Generation

Most cryptographic hardware and lately somemotherboards have been including a hardwarerandom number generation chip. These area wonderful source of generating entropy be-cause they are both fast and produce very ran-dom data. A set of free-running oscillatorsusually generates the data. The oscillators fre-quency clocks drift relative to each other and tothe internal clock of the chipset. Thus, produc-ing randomly flipped bits in a specified ‘RNG’register.

Random number generation in the kernel usesinterrupts from the keyboard, mouse, hard disk,and network interfaces to create an entropypool. This entropy pool produces the out-put of /dev/random and the less secure/dev/urandom .

The current interface is missing a way toadd random data from an arbitrary hardwaresource. By using tying the random driver intothe Cryptographic API the random driver cangain both extra sources of entropy and the ac-celeration from making its MD5 and SHA-1 functions available for hardware execution.The result would be a faster and better entropypool for random data generation.

5 Filesystem encryption

By far, the cryptographic API has the largestuser-base with filesystem encryption. Sev-eral distributions have shipped with support forfilesystem loopback encryption for over a year.Let us take a moment to explore the details of

Linux Symposium 132

filesystem encryption. When a write or read re-quest occurs in the kernel the information des-tined for a device passes through the VFS layer,then down through the device driver layer andonto the physical media.

Figure 2: A diagram of the Linux VFS

Looking for a place to encrypt the data isn’teasy, we could intercept the information at theVFS layer, but the result is encrypted data withplaintext metadata. Thus giving an attacker anedge, for example being able to track downyour /tmp/.SCO_source_code directoryand begin attacking the encrypted data there.The next place to intercept the plaintext datawould be at the filesystem level. But writingper filesystem hooks to encrypt both filesystemdata and filesystem metadata would be a night-mare to implement, not to mention a horribledesign decision. So the only place left is some-where outside of the filesystem code, but be-fore the data is passed to the device driver forthe media. Enter loopback drivers.

The loopback device driver in Linux allowsus to send the data (plaintext) and metadata(plaintext) through a layer of memory copying

before it is written to a device. Here is wherethe encryption will be done—this way all datawritten to the device can be encrypted insteadof just the filesystem data.

Figure 3: A diagram of the loopback encryp-tion layer

By using the loopback driver an added level offlexibility is added. Users can have their homedirectories stored as large encrypted files on theprimary drive. These would then be loaded viathe cryptographic loopback driver upon loginand unmount when the user exits all sessions.

5.1 Swap memory Encryption

Encryption of swap is a difficult problem toapproach. Any system in which the filesys-tem data is encrypted has the chance of thedata being moved out to swap memory whenthe OS gets low on RAM. This can easilybe solved by ‘locking’ all memory into RAMand not allowing it to be swapped to a physi-cal media with themlock() function. How-ever, this vastly reduces the usability of the sys-tem (Linux tends to kill processes when out of

Linux Symposium 133

memory). In the past, Linux has implementedencrypted swap with a loopback device run-ning through the swap accesses. This approachworks, but is slow and cumbersome to imple-ment.

What is needed is a policy for encryptingallpages swapped to disk. The OpenBSD com-munity has had a similar policy for a long timeand feels that the performance loss (roughly2.5 times longer to write a page to disk) isworth the added security.

The 2.4 series crypto (not a part of the main-line the kernel) named the “International Ker-nel Cryptography Patch” included a loopbackencryption driver. The driver had limited fea-tures but did the job of encrypting data fairlywell. In the 2.5 series driver there have beensome performance improvements, like multi-threaded (and SMP) reading from loopback de-vices, and code readability improvements.

6 Userspace Access

Access to the API is not currently possiblefrom user space. Discussions over how to im-plement this have come up with a variety ofproposals. The current direction is to have adevice provide access to the API via ioctls.

Compatibility with the already matureOpenBSD API has been suggested. Thiswould decrease application porting time toalmost nothing.

7 Final Comments

Thus far the 2.5 Cryptographic API has beenunder constant development with the adding ofnew ciphers and functionality since its inclu-sion in the kernel. The API is young and haspromising plans to expand, hopefully the au-thors of this paper have given you an adequate

intro to the capabilites of the API.

References

[Bryson] David Bryson,The Linux CryptoAPI: AUsers Perspective. 16 May 2002.http://www.kerneli.org/howto/index.php .

[Morris] Morris, James, and David S. Miller.Scatterlist Cryptographic API. 10 May.2003. Linux Kernel Documentation linux/Documentation/crypto/api-intro.txt.

[Steve] Steve,[email protected] . Re:[CryptoAPI-devel] Re: hardware cryptosupport for userspace ?11 Dec. 2002 viahttp://www.kerneli.org/pipermail/cryptoapi-devel .

[Hifn] Hifn, Inc. 7951 Data Sheet - DeviceSpecification. 2 June. 2001.http://www.hifn.com/docs/a/DS-0028-02-7951.pdf .

[Provos] Provos, Niels.Encrypting Virtual Mem-ory. http://www.openbsd.org/

papers/swapencrypt.ps .

Porting drivers to the 2.5 kernel

Jonathan CorbetLWN.net

[email protected]

Abstract

The 2.5 development series has brought withit the usual large set of changes to the internaldriver API. The end result is a kernel that isfar more pleasant to program for, and a morerobust and reliable system. The cost of allthese changes, of course, is that kernel code—including device drivers—must be updated towork under the new regime. This paper willgive an overview of what has changed in the in-ternal kernel API, why the changes were made,and what must be done to make drivers workagain. Some familiarity with kernel program-ming is assumed.

1 Introduction

2.5 was a busy time for kernel developers.Much work was done to make kernel codemore reliable and less susceptible to com-mon problems. There has also been a largeemphasis on improved performance on high-end systems. The end result is that almostno part of the kernel—and almost no inter-nal API—was left untouched. The degree ofchange varies from relatively small (for net-work drivers, for example) to extreme (blockdrivers). An awareness of these changes ishelpful for anybody who is interested in howkernel development is proceeding, and crucialfor anybody who must make code work withthe new kernel.

This paper will start with the basic changes

which affect all drivers—module loading,memory allocation, etc. Later sections will getinto the more advanced topics, including theblock layer, and memory management. Spaceconstraints make it impossible to get into muchdetail here. A series of documents can be foundat the web site listed at the end of this paper;those documents explore the topics found be-low in much greater depth, and will be keptcurrent as the kernel evolves.

2 Loadable modules

The module loader was completely replaced in2.5; the new implementation works almost en-tirely within the kernel. Interestingly, movingthe module loader into the kernel resulted ina net reduction in kernel code. This develop-ment has forced a few changes in how moduleswork, however.

2.1 Hello world

The obvious place to start is the classic “helloworld” program, which, in this context, is im-plemented as a kernel module. The 2.4 versionof this module looked like:

#define MODULE#include <linux/module.h>#include <linux/kernel.h>

int init_module(void){

printk(KERN_INFO "Hello, world\n");return 0;

}

void cleanup_module(void)

Linux Symposium 135

{printk(KERN_INFO "Goodbye cruel world\n");

}

One would not expect that something this sim-ple and useless would require much in the wayof changes, but, in fact, this module will notquite work in a 2.5 kernel. So what do we haveto do to fix it up?

The first change is relatively insignificant; thefirst line:

#define MODULE

is no longer necessary, since the kernel buildsystem defines it for you.

The biggest problem with this module, how-ever, is that you have to explicitly declareyour initialization and cleanup functions withmodule_init and module_exit , whichare found in<linux/init.h> . You reallyshould have done that for 2.4 as well, but youcould get away without it as long as you usedthe namesinit_module and cleanup_module . You can still get away with it, butthe new module code broke this way of doingthings once, and could do so again. It’s time tobite the bullet and do things right.

With these changes, “hello world” now lookslike:

#include <linux/init.h>#include <linux/module.h>#include <linux/kernel.h>

static int hello_init(void) {printk(KERN_ALERT "Hello, world\n");return 0;

}

static void hello_exit(void) {printk(KERN_ALERT "Goodbye, cruel world\n");

}

module_init(hello_init);module_exit(hello_exit);

This module will now work—the “Hello,world” message shows up in the system log

file. At least, once you have succeeded inbuilding the module properly. . .

2.2 Module compilation

One result of the module changes (combinedwith significant changed in the kernel buildmechanism) is that compiling loadable mod-ules has gotten a bit more complicated. In the2.4 days, a makefile for an external modulecould be put together in just about any old way;the job of creating a loadable module was han-dled in a single, simple compilation step. Allyou really needed was a handy set of kernelheaders to compile against.

With the 2.5 kernel, you still need those head-ers. You also, however, need a configured ker-nel source tree and a set of makefile rules de-scribing how modules are built. All this is re-quired because the new module loader needssome additional symbols defined at compila-tion time; because all modules must now gothrough a linking step (even single-file mod-ules); and because the new modversions imple-mentation requires a separate processing step.

One could certainly, with some effort, write anew, standalone makefile which would handlethe above issues. But that solution, along withbeing a pain, is also brittle; as soon as the mod-ule build process changes again, the makefilewill break. Eventually that process will stabi-lize, but, for a while, further changes are al-most guaranteed.

So, now that you are convinced that you wantto use the kernel build system for external mod-ules, how is that to be done? The first stepis to learn how kernel makefiles work in gen-eral; makefiles.txt from a recent ker-nel’s Documentation/kbuild directoryis recommended reading. The makefile magicneeded for a simple kernel module is minimal,however. In fact, for a single-file module, asingle-line makefile will suffice:

Linux Symposium 136

obj-m := module.o

(where module is replaced with the actualname of the resulting module, of course). Thekernel build system, on seeing that declaration,will compile module.o from module.c ,link it with vermagic.o from the kernel tree,and leave the result inmodule.ko , which canthen be loaded into the kernel.

A multi-file module is almost as easy:

obj-m := module.omodule-objs := file1.o file2.o

In this case,file1.c andfile2.c will becompiled, then linked intomodule.ko .

Of course, all this assumes that you can get thekernel build system to read and deal with yourmakefile. The magic command to make thathappen is something like the following:

make -C /usr/src/linux \SUBDIRS=\$PWD modules

Where/usr/src/linux is the path to thesource directory for the target kernel. Thiscommand causesmake to head over to the ker-nel source to find the top-level makefile; it thenmoves back to the original directory to buildthe module of interest.

Of course, typing that command could get tire-some after a while. A trick posted by GerdKnorr can make things a little easier, though.By looking for a symbol defined by the ker-nel build process, a makefile can determinewhether it has been read directly, or by wayof the kernel build system. So the followingwill build a module against the source for thecurrently running kernel:

ifneq ($(KERNELRELEASE),)obj-m := module.oelseKDIR := /lib/modules/$(shell uname -r)/buildPWD := $(shell pwd)

default:$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

endif

Now a simple “make” will suffice. The make-file will be read twice; the first time it will sim-ply invoke the kernel build system, while theactual work will get done in the second pass.A makefile written in this way is simple, andit should be robust with regard to kernel buildchanges.

2.3 Module parameters

The oldMODULE_PARMmacro, which used tospecify parameters which can be passed to themodule at load time, is no more. The new pa-rameter declaration scheme add type safety andnew functionality, but at the cost of breakingcompatibility with older modules.

Modules with parameters should now include<linux/moduleparam.h> explicitly. Pa-rameters are then declared withmodule_param :

module_param(name, type, perm);

Where name is the name of the parame-ter (and of the variable holding its value),type is its type, andperm is the permis-sions to be applied to that parameter’s sysfsentry. The type parameter can be one ofbyte , short , ushort , int , uint , long ,ulong , charp , bool or invbool . Thattype will be verified during compilation, so itis no longer possible to create confusion bydeclaring module parameters with mismatchedtypes. The plan is for module parameters to ap-pear automatically in sysfs, but that feature hadnot been implemented as of 2.5.69; for now, thesafest alternative is to setperm to zero, whichmeans “no sysfs entry.”

If the name of the parameter as seen outside themodule differs from the name of the variable

Linux Symposium 137

used to hold the parameter’s value, a variant onmodule_param may be used:

module_param_named(name, value, type, perm);

Where name is the externally-visible nameandvalue is the internal variable.

String parameters will normally be declaredwith thecharp type; the associated variable isachar pointer which will be set to the param-eter’s value. If you need to have a string valuecopied directly into achar array, declare it as:

module_param_string(name, string, len, perm);

Usually, len is best specified assizeof(string) .

2.4 The module use count

In 2.4 and prior kernels, modules maintainedtheir “use count” with macros likeMOD_INC_USE_COUNT. The use count, of course, isintended to prevent modules from being un-loaded while they are being used. This methodwas always somewhat error prone, especiallywhen the use count was manipulated insidethe module itself. In the 2.5 kernel, referencecounting is handled differently.

The only safe way to manipulate the count ofreferences to a module is outside of the mod-ule’s code. Otherwise, there will always betimes when the kernel is executing within themodule, but the reference count is zero. So thiswork has been moved outside of the modules,and life is generally easier for module authors.

Any code which wishes to call into a module(or use some other module resource) must firstattempt to increment that module’s referencecount:

int try_module_get(&module);

It is also necessary to look at the return value;a zero return means that the try failed (perhapsthe module is being unloaded), and the moduleshould not be used.

A reference to a module can be released withmodule_put() .

Again, modules will not normally have to man-age their own reference counts. The only ex-ception may be if a module provides a refer-ence to an internal data structure or functionthat is not accounted for otherwise. In that(rare) case, a module could conceivably calltry_module_get() on itself.

2.5 Exporting symbols

In 2.5, module symbols are not exportedby default. Chances are that change willcause few problems. When you get achance, however, you can removeEXPORT_NO_SYMBOLSlines from your module source.Exporting no symbols is now the default, soEXPORT_NO_SYMBOLSis a no-op.

The 2.4 inter_module_ functions havebeen deprecated as unsafe. Thesymbol_get() function exists for the cases whennormal symbol linking does not work wellenough. Its use requires setting up weak refer-ences at compile time, and is beyond the scopeof this document.

2.6 Kernel version checking

2.4 and prior kernels would include, in eachmodule, a string containing the version of thekernel that the module was compiled against.Normally, modules would not be loaded if thecompile version failed to match the runningkernel.

In 2.5, things still work mostly that way. Thekernel version is loaded into a separate, “link-once” ELF section, however, rather than be-

Linux Symposium 138

ing a visible variable within the module itself.As a result, multi-file modules no longer needto define__NO_VERSION__before includ-ing <linux/module.h> .

The new “version magic” scheme also recordsother information, including the compiler ver-sion, SMP status, and preempt status; it is thusable to catch more incompatible situations thanthe old scheme did.

3 The device model

One of the more significant changes in the 2.5development series is the creation of the inte-grated device model. The device model wasoriginally intended to make power manage-ment tasks easier through the maintenance ofa representation of the host system’s hardwarestructure. A certain amount of mission creephas occurred, however, and the device model isnow closely tied into a number of device man-agement tasks—and other kernel functions aswell.

The device model presents a bit of a steeplearning curve when first encountered. The factthat the whole thing is still (as of 2.5.69) ina state of fairly serious flux doesn’t help, es-pecially considering that the documentation is,in many cases, a few revisions behind the ac-tual code. But the underlying concepts are notthat hard to understand, and driver program-mers will benefit from a grasp of what’s goingon.

The fundamental task of the driver model is tomaintain a set of internal data structures whichreflect the architecture and state of the under-lying system. Among other things, the drivermodel tracks:

• Which devices exist in the system, whatpower state they are in, what bus they are

attached to, and which driver is responsi-ble for them.

• The bus structure of the system; whichbuses are connected to which others (i.e.,a USB controller can be plugged into aPCI bus), and which devices each bus canpotentially support (along with associateddrivers), and which devices actually exist.

• The device drivers known to the sys-tem, which devices they can support, andwhich bus type they know about.

• What kinds of devices (“classes”) exist,and which real devices of each class areconnected. The driver model can thus an-swer questions like “where is the mouse(or mice) on this system?” without theneed to worry about how the mouse mightbe physically connected.

• And many other things.

Underneath it all, the driver model works bytracking system configuration changes (hard-ware and software) and maintaining a complex“web woven by a spider on drugs” data struc-ture to represent it all.

Many driver programmers will be able to getaway with ignoring the device model alto-gether; most of the gory details are handled atthe bus level. There are times, however, whenan understanding of what’s going on can beuseful. A full discussion of the device modelwould require a talk of its own (indeed, thereare two on the OLS schedule); suffice to say,for now, that details can be found on the website.

4 Support interfaces

Now that we know how to compile a module,it’s time to look at the various other changes it

Linux Symposium 139

will need to be adapted for. We’ll start withvarious low-level support interfaces used bymany or most drivers (and other modules).

4.1 Memory allocation

The 2.5 development series has brought rela-tively few changes to the way device driverswill allocate and manage memory. In fact,most drivers should work with no changes inthis regard. There are a few improvementsthat have been made, however, that are worth amention. These include some changes to pageallocation, and the new “mempool” interface.

4.1.1 Allocation flags

The old <linux/malloc.h> include fileis gone; it is now necessary to include<linux/slab.h> instead.

The GFP_BUFFERallocation flag is gone (itwas actually removed in 2.4.6). That willbother few people, since almost nobody usedit. For reference, here is the full set of 2.5 al-location flags, from the most restrictive to theleast:

GFP_ATOMIC: a high-priority allocationwhich will not sleep; this is the flag touse in interrupt handlers and other non-blocking situations.

GFP_NOIO: blocking is possible, but noI/O will be performed.

GFP_NOFS: no filesystem operations willbe performed.

GFP_KERNEL: a regular, blocking allo-cation.

GFP_USER: a blocking allocation foruser-space pages.

GFP_HIGHUSER: for allocating user-space pages where high memory may beused.

The __GFP_DMA and __GFP_HIGHMEMflags still exist and may be added to the aboveto direct an allocation to a particular mem-ory zone. In addition, 2.5.69 added some newmodifiers:

__GFP_REPEAT: This flag tells the pageallocater to “try harder,” repeating failedallocation attempts if need be. Alloca-tions can still fail, but failure should beless likely.

__GFP_NOFAIL: Try even harder; allo-cations with this flag must not fail. Need-less to say, such an allocation could take along time to satisfy.

__GFP_NORETRY: Failed allocationsshould not be retried; instead, a failurestatus will be returned to the callerimmediately.

The__GFP_NOFAILflag is sure to be tempt-ing to programmers who would rather not codefailure paths, but that temptation should be re-sisted most of the time. Only allocations whichtruly cannot be allowed to fail should use thisflag.

4.1.2 Page-level allocation

For page-level allocations, thealloc_pages() and get_free_page() func-tions (and variants) exist as always. They arenow defined in<linux/gfp.h> . There area few new ones as well. On NUMA systems,the allocator will do its best to allocate pageson the same node as the caller. To explicitlyallocate pages on a different NUMA node, use:

Linux Symposium 140

struct page *alloc_pages_node(int node_id,

unsigned int gfp_mask,unsigned int order);

4.1.3 vmalloc_to_page()

Occasionally, it is necessary to find astructpage pointer for a page obtained fromvmalloc() ; usually this need arises in theimplementation ofnopage() methods. In thepast, a driver had to walk through the pagetables to find this pointer. As of 2.5.5 (and2.4.19), however, all that is needed is a call to:

struct page *vmalloc_to_page(void *address);

This call is not a variant ofvmalloc() —it allocates no memory. It simply returns apointer to thestruct page associated withan address obtained fromvmalloc() .

4.1.4 Memory pools

Memory pools were one of the very firstchanges in the 2.5 series—they were added to2.5.1 to support the new block I/O layer. Thepurpose of mempools is to help out in situ-ations where a memory allocation must suc-ceed, but sleeping is not an option. To thatend, mempools pre-allocate a pool of memoryand reserve it until it is needed. Mempoolsmake life easier in some situations, but theyshould be used with restraint; each mempooltakes a chunk of kernel memory out of circula-tion and raises the minimum amount of mem-ory the kernel needs to run effectively.

A full discussion of mempools doesn’t fit intothis document; see the web site for details ontheir use.

4.2 Per-CPU variables

The 2.5 kernel makes extensive use of per-CPUdata—arrays containing one object for eachprocessor on the system. Per-CPU variablesare not suitable for every task, but, in situationswhere they can be used, they do offer a coupleof advantages:

• Per-CPU variables have fewer locking re-quirements since they are (normally) onlyaccessed by a single processor.

• Restricting each processor to its own areaeliminates cache line bouncing and im-proves performance.

Examples of per-CPU data in the 2.5 kernel in-clude lists of buffer heads, lists of hot and coldpages, various kernel and networking statis-tics (which are occasionally summed togetherinto the full system values), timer queues,and so on. There are currently no driversusing per-CPU values, but some applications(i.e., networking statistics for high-bandwidthadapters) might benefit from their use. See theweb site listed at the end of this paper for a fulldescription of how to use per-CPU data.

4.3 Timekeeping

One might be tempted to think that the ba-sic task of keeping track of the time wouldnot change that much from one kernel to thenext. And, in fact, most kernel code whichworries about times (and time intervals) willlikely work unchanged in the 2.5 series. Codewhich gets into the details of how the kernelmanages time may well need to adapt to somechanges, however.

Linux Symposium 141

4.3.1 Internal clock frequency

One change whichshouldn’t be problematicfor most code is the change in the internal clockrate on the x86 architecture. In previous ker-nels,HZ was 100; in 2.5 it has been bumpedup to 1000. If your code makes any assump-tions about whatHZ really was (or, by ex-tension, whatjiffies really signified), youmay have to make some changes now.

4.3.2 Kernel time variables

With a 1KHz clock, a 32-bitjiffies willoverflow in just under 50 days, leading to oc-casional problems. So the 2.5 kernel has anew counter calledjiffies_64 . With 64bits to work with,jiffies_64 will not wraparound in a time frame that need concern mostof us—at least until some future kernel startsusing a petahertz internal clock.

For what it’s worth, on most architectures,the classic, 32-bitjiffies variable is nowjust the least significant half ofjiffies_64 .Note that, on 32-bit systems, a 64-bitjiffies value raises concurrency issues. Itis deliberately not declared as avolatilevalue (for performance reasons), so the possi-bility exists that code like:

u64 my_time = jiffies_64;

could get an inconsistent version of the vari-able, where the top and bottom halves do notmatch. To avoid this possibility, code access-ing jiffies_64 should usextime_lock ,which is the new seqlock type as of 2.5.60. Inmost cases, though, it will be easier to just usethe convenience function provided by the ker-nel:

#include <linux/jiffies.h>u64 my_time = get_jiffies_64();

Users of the internalxtime variable will no-tice a couple of similar changes. One is thatxtime , too, is now protected byxtime_lock (as it is in 2.4 as of 2.4.10), so any codewhich plays around with disabling interruptsor such before accessingxtime will need tochange. The best solution is probably to use:

struct timespec current_kernel_time(void);

which takes care of locking for you.xtimealso now is astruct timespec rather thanstruct timeval ; the difference being thatthe sub-second part is calledtv_nsec , and isin nanoseconds.

4.3.3 Timers

The kernel timer interface is essentially un-changed since 2.4, with one exception. Thenew function:

void add_timer_on(struct timer_list *timer,int cpu);

will cause the timer function to run on the givenCPU with the expiration time hits.

4.3.4 Delays

The 2.5 kernel includes a new macrondelay() , which delays for a givennumber of nanoseconds. It can be useful forinteractions with hardware which insists onvery short delays between operations. On mostarchitectures, however,ndelay(n) is equalto udelay(1) for waits of less than onemicrosecond.

4.4 Delayed tasks and workqueues

The longstanding task queue interface wasremoved in 2.5.41; in its place is a new“workqueue” mechanism. Workqueues are

Linux Symposium 142

very similar to task queues, but there are someimportant differences. Among other things,each workqueue has one or more dedicatedworker threads (one per CPU) associated withit. So all tasks running out of workqueues havea process context, and can thus sleep. Notethat access to user space is not possible fromcode running out of a workqueue; there sim-ply is no user space to access. Drivers cancreate their own work queues—with their ownworker threads—but there is a default queue(for each processor) provided by the kernel thatwill work in most situations.

See the web site for a detailed discussion ofworkqueues.

4.5 DMA support

The direct memory access (DMA) supportlayer has been extensively changed in 2.5, but,in many cases, device drivers should workunaltered. For developers working on newdrivers, or for those wanting to keep their codecurrent with the latest API, there are a fair num-ber of changes to be aware of.

The most evident change is the creation of thenew generic DMA layer. A new set of genericDMA functions has been added which is in-tended to provide a DMA support API that isnot specific to any particular bus. The newfunctions look much like the older PCI-basedones; changing from one API to the other isa fairly automatic job. The full set of equiv-alences between old and new DMA functionsmay be found on the web site.

There has been one significant change inthe creation of scatter/gather streaming DMAmappings. The 2.4 version ofstructscatterlist used a char * pointer(calledaddress ) for the buffer to be mapped,with a struct page pointer that would beused only for high memory addresses. In 2.5,

the address pointer is gone, and all scat-terlists must be built usingstruct pagepointers.

Other developments of interest here includesupport for DAC (64-bit) PCI DMA, an inter-face for explicitly non-coherent mappings, andPCI pools.

5 Kernel preemption

One significant change introduced in 2.5 is thepreemptible kernel. Previously, a thread run-ning in kernel space would run until it returnedto user mode or voluntarily entered the sched-uler. In 2.5, if preemption is configured in,kernel code can be interrupted at (almost) anytime. As a result, the number of challengesrelating to concurrency in the kernel goes up.But this is actually not that big a deal for codewhich was written to handle SMP properly—most of the time. If you have not yet got-ten around to implementing proper locking foryour 2.4 driver, kernel preemption should giveyou yet another reason to get that job done.

The preemptible kernel means that your drivercode can be preempted whenever the sched-uler decides there is something more importantto do. “Something more important” could in-clude re-entering your driver code in a differ-ent thread. There is one big, important excep-tion, however: preemption will not happen ifthe currently-running code is holding a spin-lock. Thus, the precautions which are takento ensure mutual exclusion in the SMP envi-ronment also work with preemption. So most(properly written) code should work correctlyunder preemption with no changes.

That said, code which makes use of per-CPUvariables should take extra care. A per-CPUvariable may be safe from access by other pro-cessors, but preemption could create races onthe same processor. Code using per-CPU vari-

Linux Symposium 143

ables should, if it is not already holding a spin-lock, disable preemption if the possibility ofconcurrent access exists. Usually, macros likeget_cpu_var() should be used for this pur-pose.

Should it be necessary to control preemptiondirectly (something that should happen rarely),some macros in<linux/preempt.h> willbe helpful. A call topreempt_disable()will keep preemption from happening, whilepreempt_enable() will make it possibleagain. If you want to re-enable preemp-tion, but don’t want to get preempted imme-diately (perhaps because you are about to fin-ish up and reschedule anyway),preempt_enable_no_resched() is what you need.

One interesting side-effect of the preemptionwork is that it is now much easier to tell if aparticular bit of kernel code is running withinsome sort of critical section. A single vari-able in the task structure now tracks the pre-emption, interrupt, and softirq states. A newmacro,in_atomic() , tests all of these statesand returns a nonzero value if the kernel is run-ning code that should complete without inter-ruption.

6 Sleeping and waiting

Contrary to expectations, the classic func-tionssleep_on() and interruptible_sleep_on() were not removed in the 2.5 se-ries. It seems that they are still needed in afew places where (1) taking them out is quite abit of work, and (2) they are actually used in away that is safe. Most authors of kernel codeshould, however, pretend that those functionsno longer exist. There are very few situationsin which they can be used safely, and better al-ternatives exist.

6.1 Safe sleeping

Most of those alternatives have been aroundsince 2.3 or earlier. In many situations, one canuse thewait_event() macros:

DECLARE_WAIT_QUEUE_HEAD(queue);wait_event(queue, condition);int wait_event_interruptible (queue, condition);

These macros work as they did in 2.4:condition is a boolean condition whichwill be tested within the macro; the waitwill end when the condition evaluatestrue. It is worth noting that these macroshave moved from<linux/sched.h> to<linux/wait.h> , which seems a moresensible place for them. There is also a newone:

int wait_event_interruptible_timeout(queue, condition, timeout);

which will terminate the wait if the timeout ex-pires.

In many situations,wait_event() doesnot provide enough flexibility—often becausetricky locking is involved. The longstanding“manual sleep” method can be used in thesecases. In 2.5, however, a set of helper functionshas been added which makes this task easier.The modern equivalent of a manual sleep lookslike:

DECLARE_WAIT_QUEUE_HEAD(queue);DEFINE_WAIT(wait);

while (! condition) {prepare_to_wait(&queue, &wait,

TASK_INTERRUPTIBLE);if (! condition)

schedule();finish_wait(&queue, &wait)

}

Use prepare_to_wait_exclusive()instead when an exclusive wait is needed.Note that the new macroDEFINE_WAIT()is used here, rather thanDECLARE_WAITQUEUE(). The former should beused when the wait queue entry is to be usedwith prepare_to_wait() , and should

Linux Symposium 144

probablynot be used in other situations unlessyou understand what it is doing (which we’llget into next).

6.2 Wait queue changes

In addition to being more concise and less er-ror prone, usingprepare_to_wait() canyield higher performance in situations wherewakeups happen frequently. This improvementis obtained by causing the process to be re-moved from the wait queue immediately uponwakeup; that removal keeps the process fromseeing multiple wakeups if it doesn’t otherwiseget around to removing itself for a bit.

The automatic wait queue removal is imple-mented via a change in the wait queue mech-anism. Each wait queue entry now includes itsown “wake function,” whose job it is to handlewakeups. The default wake function (whichhas the surprising namedefault_wake_function() ), behaves in the customaryway: it sets the waiting task into theTASK_RUNNINGstate and handles scheduling issues.The DEFINE_WAIT() macro creates a waitqueue entry with a different wake function,autoremove_wake_function() , whichautomatically takes the newly-awakened taskout of the queue.

And that, of course, is howDEFINE_WAIT()differs from DECLARE_WAITQUEUE()—they set different wake functions. How the se-mantics of the two differ is not immediatelyevident from their names, but that’s how itgoes. (The new runtime initialization functioninit_wait() differs from the olderinit_waitqueue_entry() in exactly the sameway).

If need be, you can define your own wakefunction—though the need for that should bequite rare (the only user, currently, is the sup-port code for theepoll() system calls). See

the web site for details on how this is done.

One other change that most programmerswon’t notice: a bunch of wait queue cruft from2.4 (two different kinds of wait queue lock,wait queue debugging) has been removed from2.5.

6.3 Completions

Completions are a simple synchronizationmechanism that is preferable to sleeping andwaking up in some situations. If you have atask that must simply sleep until some pro-cess has run its course, completions can do iteasily and without race conditions. They arenot strictly a 2.5 feature, having been added in2.4.7, but they merit a quick summary here.

A completion is, essentially, a one-shotflag that says “things may proceed.” Work-ing with completions requires including<linux/completion.h> and creatinga variable of typestruct completion .This structure may be declared and initializedstatically with:

DECLARE_COMPLETION(my_comp);

A dynamic initialization would look like:

struct completion my_comp;init_completion(&my_comp);

When your driver begins some process whosecompletion must be waited for, it’s simply amatter of passing your completion event towait_for_completion() :

voidwait_for_completion(struct completion *comp);

When some other part of your code has decidedthat the completion has happened, it can wakeup anybody who is waiting with one of:

void complete(struct completion *comp);void complete_all(struct completion *comp);

Linux Symposium 145

The first form will wake up exactly one wait-ing process, while the second will wake upall processes waiting for that event. Notethat completions are implemented in such away that they will work properly even ifcomplete() is called beforewait_for_completion() .

If you do not usecomplete_all() , youshould be able to use a completion structuremultiple times without problem. It does nothurt, however, to reinitialize the structure be-fore each use—so long as you do it before initi-ating the process that will callcomplete() !The macroINIT_COMPLETION() can beused to quickly reinitialize a completion struc-ture that has been fully initialized at least once.

7 Interrupt handling

The kernel’s handling of device interrupts hasbeen massively reworked in the 2.5 series. For-tunately, very few of those changes are visibleto the rest of the kernel; most well-written codeshould “just work” under 2.5. There are, how-ever, two important exceptions: the return typeof interrupt handlers has changed, and driverswhich depend on being able to globally disableinterrupts will require some changes for 2.5.

7.1 Interrupt handler return values

Prior to 2.5.69, interrupt handlers returnedvoid . There is, however, one useful thing thatinterrupt handlers can tell the kernel: whetherthe interrupt was something they could handleor not. If a device starts generating spuriousinterrupts, the kernel would like to respond byblocking interrupts from that device. If no in-terrupt handler for a given IRQ has been reg-istered, the kernel knows that any interrupt onthat number is spurious. When interrupt han-dlers exist, however, they must tell the kernelabout spurious interrupts.

So, interrupt handlers now return anirqreturn_t value; void handlers willno longer compile. If your interrupt handlerrecognizes and handles a given interrupt, itshould returnIRQ_HANDLED. If it knows thatthe interrupt was not on a device it manages,it can returnIRQ_NONEinstead. The macroIRQ_RETVAL(handled) can also be used;handled should be nonzero if the handlercould deal with the interrupt. The “safe” valueto return, if, for some reason you are not sure,is IRQ_HANDLED.

7.2 Disabling interrupts

In the 2.5 kernel, it is no longer possibleto globally disable interrupts. In particular,the cli() , sti() , save_flags() , andrestore_flags() functions are no longeravailable. Disabling interrupts across all pro-cessors in the system is simply no longer done.This behavior has been strongly discouragedfor some time, so most codeshouldhave beenconverted by now.

The proper way to do this fixing, of course,is to figure out exactly which resources werebeing protected by disabling interrupts. Thoseresources can then be explicitly protected withspinlocks instead. The change is usually fairlystraightforward, but it does require an under-standing of what is really going on.

It is still possible to disable all inter-rupts locally with local_save_flags()or local_irq_disable() . A single inter-rupt can be disabled globally withdisable_irq() . Some of the spinlock operations alsodisable interrupts on the local processor, ofcourse.

Linux Symposium 146

8 Asynchronous I/O

One of the key “enterprise” features added tothe 2.5 kernel is asynchronous I/O (AIO). TheAIO facility allows user processes to initiatemultiple I/O operations without waiting for anyof them to complete; the status of the opera-tions can then be retrieved at some later time.Block and network drivers are already fullyasynchronous, and thus there is nothing spe-cial that needs to be done to them to supportthe new asynchronous operations. Characterdrivers, however, have a synchronous API, andwill not support AIO without some additionalwork. For most char drivers, there is little ben-efit to be gained from AIO support. In a fewrare cases, however, it may be beneficial tomake AIO available to your users. The website has details on how to do that.

9 Block drivers

The first big, disruptive changes to the 2.5 ker-nel came from the reworking of the block I/Olayer. As 2.5 heads towards its final phases, theblock layer is still seeing a lot of attention. Asone might guess, the result of all this work is agreat many changes as seen by driver authors—or anybody else who works with block I/O.The transition may be painful for some, but it’sworth it: the new block layer is easier to workwith and offers much better performance thanits predecessor.

Space constraints make it impossible to coverall of the block layer changes here; they couldeasily justify an entire paper to themselves.These changesarecovered in detail on the website listed at the end of the paper. Here, how-ever, we’ll have to content ourselves with anoverview.

So, what has changed with the block layer?

• A great deal of old cruft is gone. Forexample, it is no longer necessary towork with a whole set of global arrayswithin block drivers. These arrays (blk_size , blksize_size , hardsect_size , read_ahead , etc.) have simplyvanished.

• As part of the cruft removal, most of the<linux/blk.h> macros (DEVICE_NAME, DEVICE_NR, CURRENT, INIT_REQUEST, etc.) have been removed, andthe file itself should go away soon. Itis still possible to implement a simplerequest loop for straightforward deviceswhere performance is not a big issue, butthe mechanisms have changed.

• The io_request_lock is gone; lock-ing is now done on a per-queue basis.

• Request queues have, in general, gottenmore sophisticated. There is simple sup-port for tagged command queueing, alongwith features like request barriers andqueue-time device command generation.

• Buffer heads are no longer used in theblock layer; they have been replaced withthe new “bio ” structure. The new rep-resentation of block I/O operations is de-signed for flexibility and performance; itencourages keeping large operations in-tact. Simple drivers can pretend that thebio structure does not exist, but mostperformance-oriented drivers—i.e., thosethat want to implement clustering andDMA—will need to be changed to workwith bio s.

One of the most significant features ofthebio structure is that it represents I/Obuffers directly withpage structures andoffsets, not in terms of kernel virtual ad-dresses. By default, I/O buffers can be lo-cated in high memory, on the assumptionthat computers equipped with that much

Linux Symposium 147

memory will also have reasonably modernI/O controllers. Support operations havebeen provided for tasks likebio splittingand the creation of DMA scatter/gathermaps.

• Sector numbers can now be 64 bits wide,making it possible to support very largeblock devices.

• The rudimentary gendisk (“genericdisk”) structure from 2.4 has been greatlyimproved in 2.5; generic disks are nowused extensively throughout the blocklayer. The most significant change forblock driver authors may be the fact thatpartition handling has been moved up intothe block layer, and drivers no longer needknow anything about partitions. That is,of course, the way things should alwayshave been.

The end result is a fair amount of short-termpain for maintainers of block drivers. It doesnot take long, however, to realize that the newinterface is easier to program for and far morerobust.

10 Network drivers

Here’s the good news for people maintainingnetwork drivers: once you have deal with thebasic changes that affect all drivers and load-able modules, your driver will likely work asit is. The API that was presented to drivers in2.4 is essentially unchanged. There has beenno gratuitous network driver breakage in 2.5.

The bad news is: a bunch of useful new stuffhas been added in 2.5. If you want your driverto use the features of 2.5 to get the best perfor-mance out of your hardware, you will have tospend some time dealing with changes.

10.1 NAPI

The most significant change, perhaps, is theaddition of NAPI (“New API”), which is de-signed to improve the performance of high-speed networking. NAPI works throughinter-rupt mitigation (reducing the thousands of in-terrupts per second that accompany high net-work traffic) andpacket throttling (keepingpackets out of the kernel if they will be droppedanyway). NAPI was also backported to the2.4.20 kernel.

Converting drivers to NAPI is essentially atwo-step process:

• Your driver should no longer process in-coming packets in its interrupt handler.Instead, it should disable further “packetavailable” interrupts and tell the network-ing system to begin polling the interface.

• A new poll() method must be createdwhich processes all available incomingpackets (up to a kernel-specified limit).A new function, netif_receive_skb() has been set up to accept packetsfrom poll() methods.

The web site has the inevitable details.

10.2 Receiving packets in non-interrupt mode

Network drivers tend to send packets into thekernel while running in interrupt mode. Thereare occasions where, instead, packets will bereceived by a driver running in process context.There is no problem with this mode of opera-tion, but it is possible that the networking soft-ware interrupt which performs packet process-ing may be delayed, reducing performance. Toavoid this problems, drivers handing packets tothe kernel outside of interrupt context shoulduse:

int netif_rx_ni(struct sk_buff *skb);

Linux Symposium 148

instead ofnetif_rx() .

10.3 Other 2.5 features

A number of other networking features wereadded in 2.5. Here is a quick summary of de-velopments that driver developers may want tobe aware of.

Ethtool support. Ethtool is a utility which canperform detailed configuration of network in-terfaces; it can be found on the gkernel Source-Forge page. This tool can be used to query net-work information, tweak detailed operating pa-rameters, control message logging, and more.Supporting ethtool requires implementing theSIOCETHTOOL ioctl() command, alongwith (parts of, at least) the lengthy set of eth-tool commands. See<linux/ethtool.h>for a list of things that can be done. Imple-menting the message logging control featuresrequires checking the logging settings beforeeachprintk() call; there is a set of conve-nience macros in<linux/netdevice.h>which make that checking a little easier.

VLAN support . The 2.5 kernel has support for802.1q VLAN interfaces; this support has alsobeen working its way into 2.4, with the corebeing merged in 2.4.14.

TCP segmentation offloading. The TSO fea-ture can improve performance by offloadingsome TCP segmentation work to the adaptorand cutting back slightly on bus bandwidth.TSO is an advanced feature that can be trickyto implement with good performance; see thetg3 or e1000 drivers for examples of how it’sdone.

11 User space access

Thekiobuf abstraction was introduced in 2.3as a low-level way of representing I/O buffers.Its primary use, perhaps, was to represent zero-

copy I/O operations going directly to or fromuser space. A number of problems were foundwith the kiobuf interface, however; amongother things, it forced large I/O operations tobe broken down into small chunks, and it wasseen as a heavyweight data structure. So, in2.5.43, kiobufs were removed from the kernel.Direct access to user space remains possible,however, as we’ll see.

The modern equivalent ofmap_user_kiobuf() is a function calledget_user_pages() :

int get_user_pages(struct task_struct *task,struct mm_struct *mm,unsigned long start,int len,int write,int force,struct page **pages,struct vm_area_struct **vmas);

task is the process performing the mapping;the primary purpose of this argument is to saywho gets charged for page faults incurred whilemapping the pages. This parameter is almostalways passed as “current ”. The memorymanagement structure for the user’s addressspace is passed in themmparameter; it is usu-ally current->mm . Note thatget_user_pages() expects that the caller will have aread lock onmm->mmap_sem. The startandlen parameters describe the user-buffer tobe mapped;len is in pages. If the memorywill be written to,write should be non-zero.The force flag forces read or write access,even if the current page protection would oth-erwise not allow that access. Thepages array(which should be big enough to holdlen en-tries) will be filled with pointers to thepagestructures for the user pages. Ifvmas is non-NULL, it will be filled with a pointer to thevm_area_struct structure containing eachpage.

Linux Symposium 149

The return value is the number of pages actu-ally mapped, or a negative error code if some-thing goes wrong. Assuming things worked,the user pages will be present (and locked) inmemory, and can be accessed by way of thestruct page pointers. Be aware, of course,that some or all of the pages could be in highmemory.

There is no equivalentput_user_pages()function, so callers ofget_user_pages()must perform the cleanup themselves. Thereare two things that need to be done: markingof modified pages, and releasing them fromthe page cache. If your device modified theuser pages, the virtual memory subsystem maynot know about it, and may fail to write thepages to permanent storage (or swap). That,of course, could lead to data corruption andgrumpy users. The way to avoid this problemis to call:

int set_page_dirty_lock(struct page *page);

for each page in the mapping.

Finally, every mapped page must be releasedfrom the page cache, or it will stay there for-ever; simply pass each page structure to:

void put_page(struct page *page);

After you have released the page, of course,you should not access it again.

For a good example of how to useget_user_pages() in a char driver, see thedefinition of sgl_map_user_pages() indrivers/scsi/st.c .

12 Conclusion

This paper is drawn from the LWN.net “Port-ing Drivers to 2.5” series, which can be foundat:

http://lwn.net/Articles/driver-porting/

Those articles contain much more detail thanwas possible to squeeze into this paper. Theyare also being maintained as kernel develop-ment continues; there are, beyond a doubt,things that have changed since this paper waswritten. If nothing else, the kernel developerswill eventually figure out how they want to sup-port larger device numbers. The driver-portingweb site will have the latest information on ker-nel API changes.

13 Acknowledgments

Previous versions of this material have beenimproved by comments from Jens Axboe, Ja-mal Hadi Salim, Greg Kroah-Hartman, An-drew Morton, Rusty Russell, and others Ihave certainly forgotten. The creation of the“driver porting” series of articles was fundedby LWN.net subscribers.

Class-based Prioritized Resource Control in Linux

Shailabh Nagar, Hubertus Franke, Jonghyuk Choi, Chandra SeetharamanScott Kaplan,∗ Nivedita Singhvi, Vivek Kashyap, Mike Kravetz

IBM Corp.{nagar,frankeh,jongchoi,chandra.sekharan}@us.ibm.com

{nivedita,vivk,kravetz}@us.ibm.com

[email protected]

Abstract

In Linux, control of key resources such asmemory, CPU time, disk I/O bandwidth andnetwork bandwidth is strongly tied to ker-nel tasks and address spaces. The kernel of-fers very limited support for enforcing user-specified priorities during the allocation ofthese resources.

In this paper, we argue that Linux kernel re-source management should be based on classesrather than tasks alone and be guided by classshares rather than system utilization alone.Such class-based kernel resource management(CKRM) allows system administrators to pro-vide differentiated service at a user or job leveland prevent denial of service attacks. It also en-ables accurate metering of resource consump-tion in user and kernel mode. The paper pro-poses a framework for CKRM and discussesincremental modifications to kernel schedulersto implement the framework.

1 Introduction

With Linux having made rapid advances inscalability, making it the operating system ofchoice for enterprise servers, it is useful and

∗on sabbatical from Amherst College, MA

timely to examine its support for resource man-agement. Enterprise workloads typically runon two types of servers: clusters of 1-4 waySMPs and large (8-way and upward) SMPs andmainframes. In both cases, system adminis-trators must balance the needs of the work-load with the often conflicting goal of main-taining high system utilization. The balancingbecomes particularly important for large SMPsas they often run multiple workloads to amor-tize the higher cost of the hardware.

A key aspect of multiple workloads is thatthey vary in thebusiness importanceto theserver owner. To maximize the server’s utility,the system administrator needs to ensure thatworkloads with higher business importance geta larger share of server resources. The ker-nel’s resource schedulers need to allow someform of differentiated service to meet this goal.It is also important that the resource usage bydifferent workloads be accounted accuratelyso that the customers can be billed accord-ing to their true usage rather than an averagecost. Kernel support for accurate accountingof resource usage is required, especially for re-source consumption in kernel mode.

Differentiated service is also useful to the desk-top user. It would allow large file transfers toget reduced priority to the disk compared todisk accesses resulting from interactive com-

Linux Symposium 151

mands. A kernel compile could be configuredto run in the background with respect to theCPU, memory and disk, allowing a more im-portant activity such as browsing to continueunimpeded.

The current Linux kernel (version 2.5.69 at thetime of writing) lacks support for the above-mentioned needs. There is limited and varyingsupport for any kind of performance isolationin each of the major resource schedulers (CPU,network, disk I/O and memory). CPU and in-bound network scheduling offer the greatestsupport by allowing specification of priorities.The deadline I/O scheduler [3] offers some iso-lation between disk reads and writes but notbetween users or applications and the VM sub-system has support for limiting address spacesize of a user. More importantly, the granular-ity of kernel supported service differentiation isa task (process), or infrequently the userid. Itdoes not allow the user to define the granular-ity at which resources get apportioned. Finally,there is no common framework for a systemadministrator to specify priorities for usage ofdifferent physical resources.

The work described in this paper addressesthese shortcomings. It proposes a class-basedframework for prioritized resource manage-ment of all the major physical resources man-aged by the kernel. A class is a user-defined,dynamic grouping of tasks that have associatedpriorities for each physical resource. The pro-posed framework permits a better separationof user-specified policies from the enforcementmechanisms implemented by the kernel. Mostimportantly, it attempts to achieve these goalsusing incremental modifications of the existingmechanisms.

The paper is organized as follows. Section 2 in-troduces the proposed framework and the cen-tral concepts of classification, monitoring andcontrol. Sections 3,4,5,6 explore the frame-

work for the CPU, disk I/O, network and mem-ory subsystems and propose the extensionsnecessary to implement it. Section 7 concludeswith directions for future work.

2 Framework

Before describing the proposed framework, wedefine a few terms.

Tasks are the Linux kernel’s common represen-tation for both processes and threads. Aclassis a group of tasks. The grouping of tasks intoclasses is decided by the user using rules andpolicies.

A classification rule, henceforth simply calleda rule, is a method by which a task can be clas-sified into a class. Rules are defined by the sys-tem administrator, generally as part of a policy(defined later) but also individually, typicallyas modifications or increments to an existingpolicy. Attributes of a task, such as real uid,real gid, effective uid, effective gid, path name,command name and task or application tag (de-fined later) can be used to define rules. A ruleconsists of two parts: a set of attribute-valuetuples (A,V) and a class C. If the rule’s tuplevalues match the corresponding attributes of atask , then it is considered to belong to the classC1.

A policy is a collection of class definitions andclassification rules. Only one policy is active ina system at any point of time. Policies are con-structed by the system administrator and sub-mitted to a CKRM kernel. The kernel option-ally verifies the policy for consistency and ac-tivates it. The order of rules in a policy is im-portant. Rules are applied in the order they aredefined (one exception is the application tagsas described in the notes below).

An Application/Task Tagis a user-defined at-tribute associated with a task. Such an attribute

Linux Symposium 152

is useful when tasks need to be classified andmanaged based on application-specific crite-ria. In such scenarios, an applications tasks canspecify its tag to the kernel using a system call,ioctl, /proc entry etc. and trigger its classifica-tion using a rule that uses the task tag attribute.Since the task tag is opaque to the kernel, itallows applications and system administratorsadditional flexibility in grouping tasks.

A Resource Manageris the entity which de-termines the proportions in which resourcesshould be allocated to classes. This could beeither a human system administrator or a re-source management application middleware.

With all the new terms of the framework de-fined, we can now describe how the frameworkoperates to provide class-based resource man-agement. Figure 1 illustrates the lifecycle oftasks in the proposed framework with an em-phasis on the three central aspects of classifi-cation, monitoring and control.

2.1 Initialization

Sometime after system boot up, the ResourceManager commits a policy to the kernel. Thepolicy defines the classes and it is used to clas-sify all tasks (pre-existing and new) createdand all incoming network packets. A CKRM-enabled Linux kernel also contains a defaultpolicy with a single default class to which alltasks belong. The default policy determinesclassification and control behaviour until theResource Manager’s policy gets loaded. Newpolicies can be loaded at any point and overridethe existing policy. Such policy loads triggerreclassification and reinitialization of monitor-ing data and are expected to be very infrequent.

2.2 Classification

Classification refers to the association of tasksto classes and the association of resource re-

quests to a class. The distinction is mostlyirrelevant as most resource requests are initi-ated by a task except for incoming networkpackets which need to be classified before it isknown which task will consume them. Classi-fication is a continuous operation. It happenson a large scale each time a new policy is com-mitted and all existing tasks get reclassified. Atlater points, classification occurs whenever (1)a new task is created e.g. fork(), exec(); (2)the attributes of a task change e.g setuid(), set-gid(), application tag change (initiated by theapplication) and (3) explicit reclassification ofa specific task by the Resource Manager. Sce-narios (1) and (2) are illustrated in Figure 1.

Classification of tasks potentially allows allwork initiated by the task to be associatedwith the class. Thus the CPU, memory andI/O requests generated by this task, or by thekernel on behalf of this task, can be moni-tored and regulated by the class-aware resourceschedulers. Kernel-initiated resource requestswhich are on behalf of multiple classes e.g.a shared memory page writeout need specialtreatment as do application initiated requestswhich are processed asynchronously. Classifi-cation of incoming network connections and/ordata (which are seen by the kernel before thetask to which they are destined) is discussedseparately in Section 5.

2.3 Monitoring

Resource usage is maintained at the class levelin addition to the regular system accounting bytask, user etc. The system administrator or anexternal control program with root privilegescan obtain that information from the kernel atany time. The information can be used to as-sess machine utilization for billing purposes oras an input to a future decision on changing theshare allotted to a class. The CKRM API pro-vides functions to query the current usage dataas shown in Figure 1.

Linux Symposium 153

Figure 1: CKRM lifecycle

2.4 Control

The system administrator, as part of the ini-tial policy or at a later point in time, assigns aper-resource share to each class in the system.Each class gets a separate share for CPU time,Memory pages, I/O bandwidth and incomingnetwork I/O bandwidth. The resource sched-ulers try their best to respect these shares whileallocating resources to a class. e.g. the CPUscheduler tries to ensure that tasks belongingto Class A with a 50% CPU share collectivelyget 50% of the CPU time. At the next levelof the control hierarchy, the system administra-tor or a control program can change the sharesassigned to a class based on their assessmentof application progress, system utilization etc.Collectively, the setting of shares and share-based resource allocation constitute the controlpart of the resource management lifecycle andare shown in Figure 1. This paper concentrateson the lower level share-based resource alloca-tion since that is done by the kernel.

The next four sections go into the details ofclassification, monitoring and control aspectsof managing each of the four major physical

resources.

3 CPU scheduling

The CPU scheduler is central to the operationof the computing platform. It decides whichtask to run when and how long. In general re-altime and timeshared jobs are distinguished,each with different objectives. Both are re-alized through different scheduling disciplinesimplemented by the scheduler. Before address-ing the share based scheduling schemes, we de-scribe the current Linux scheduler.

3.1 Linux 2.5 Scheduler

To achieve its objectives, Linux assigns a staticpriority to each task that can be modified bythe user through thenice() interface. Linuxhas a range of[0 . . . MAX_PRIO] priority classes,of which the firstMAX_RT_PRIO(=100) areset aside for realtime jobs and the remaining40 priority classes are set aside for timesharing(i.e. normal) jobs, representing the[−20 . . . 19]nice value of UNIX processes. The lower thepriority value, the higher the “logical” priority

Linux Symposium 154

of a task, i.e. its general importance. In thiscontext we always assume the logical prioritywhen we are talking about priority increasesand decreases. Realtime tasks always have ahigher priority then normal tasks.

The Linux scheduler in 2.5, a.k.a the O(1)scheduler, is a multi queue scheduler that as-signs to each cpu a run queue, wherein localscheduling takes place. A per-cpu runqueueconsists of two arrays of task lists, the activearray and the expired array. Each array indexrepresents a list of runnable task at their respec-tive priority level. After executing for a periodtask move from the active list to the expiredlist to guarantee that all tasks get a chance toexecute. When the active array is empty, ex-pired and active arrays are swapped. More de-tail is provided further on. The scheduler sim-ply picks the first task of the highest priorityqueue of the active queue for execution.

Occasionally a load balancing algorithm rebal-ances the runqueues to ensure that a similarlevel of progress is made on each cpu. Real-time issues and load balancing issues are be-yond this description here, hence we concen-trate on the single cpu issue for now. For moredetails we refer to [12]. It might also be ofinterest to abstract this against the Linux 2.4based scheduler, which is described in [10].

As stated earlier, the scheduler needs to decidewhich task to run next and for how long. Timequantums in the kernel are defined as multiplesof a systemtick. A tick is defined by the fixeddelta(1/HZ) of two consecutive timer inter-rupts. In Linux 2.5:HZ=1000, i.e. the inter-rupt routinescheduler_tick() is calledonce every msec at which time the currentlyexecuting task is charged a tick.

Besides thestatic priority (static_prio ),each task maintains aneffective priority(prio ). The distinction is made in or-der to account for certain temporary prior-

ity bonuses or penalties based on the recentsleep averagesleep_avg of a given task.The sleep average, a number in the range of[0 . . . MAX_SLEEP_AVG ∗ HZ], accounts for thenumber of ticks a task was recently desched-uled. The time (in ticks) since a task went tosleep (maintained insleep_timestamp ) isadded on task wakeup and for every time tickconsumed running, the sleep average is decre-mented.

The current design provisions arange of PRIO_BONUS_RATIO=25%[−12.5%..12.5%] of the priority range for thesleep average. For instance a “nice=0” task hasa static priority of 120. With a sleep averageof 0 this task is penalized by 5 resulting in aneffective priority of 125. On the other hand,if the sleep average isMAX_SLEEP_AVG=10 secs, a bonus of 5 is granted leading to aneffective priority of 115. The effective prioritydetermines the priority list in the active andexpired array of the run queue. A task isdeclaredinteractivewhen its effective priorityexceeds its static priority by a certain level(which can only be due to its accumulatingsleep average). High priority tasks reachinteractivity with a much smaller sleep averagethan lower priority tasks.

The timeslice, defined as the maximum time atask can run without yielding to another task,is simply a linear function of the static pri-ority of normal tasks projected into the rangeof [MIN_TIMESLICE . . . MAX_TIMESLICE]. Thedefaults are set to[10 . . . 200] msecs. Thehigher the priority of a task the longer itstimeslice . A task’s initial timeslice is de-ducted from parents’ remaining timeslice. Forevery timer tick the running task’s timeslice isdecremented. If decremented to “0”, the sched-uler replenishes the timeslice, recomputes theeffective priority and either reenqueues the taskinto the active array if it is interactive or intothe expired array if it is non-interactive. It then

Linux Symposium 155

picks the next best task from the active array.This guarantees that all others tasks will exe-cute first before any expired task will run again.If a task is descheduled, its timeslice will notbe replenished at wakeup time, however its ef-fective priority might have changed due to anysleep time.

If all runnable tasks have expired their times-lices and have been moved to the expiredlist, the expired array and the active arrayare swapped. This makes the scheduler O(1)as it does not have to traverse a potentiallylarge list of tasks as was needed in the 2.4scheduler. Due to interactivity the situationcan arise that the active queue continues tohave runnable tasks. As a result tasks in theexpired queue might get starved. To avoidthis, if the longest expired task is older thanSTARVATION_LIMIT=10secs, the arrays areswitched again.

3.2 CPU Share Extensions to the Linux Sched-uler

We now examine the problem of extending theO(1) scheduler to allocate CPU time to classesin proportion of their CPU shares. Propor-tional share scheduling of CPUs has been stud-ied in depth [7, 21, 13] but not in the contextof making minimal modifications to an exist-ing scheduler such as O(1).

Let Ci, i=[1 . . . N ] be the set ofN dis-tinct classes with corresponding cpu sharesScpu

i such that∑N

i=1 Scpui = 1. Let

SHARE_TIME_EPOCH(STE) be the time in-terval, measured in ticks, over which the modi-fied scheduler attempts to maintain the propor-tions of allocated CPU time. Further, we useΦ(a) andΦ(a, b) to denote a functions of pa-rameters a and b.

Weighted Fair Share (WFS): In the first ap-proach considered, henceforth called weightedfair share (WFS), a per class scheduling re-

source container is introduced that accountsfor timeslices consumed by the tasks currentlyassigned to the class. Initially, the timesliceTSi, i=[1 . . . N ] of each classCi is determinedby TSi = Scpu

i × STE. The timeslice allo-cated to a taskts(p) remains the same as inO(1). Everytime a task consumes one of itsown timeslice ticks, it also consumes one fromthe class’ timeslice. When a class’ ticks areexhausted, the task consuming the last tick isput into the expired array. When the sched-uler picks other tasks from the same class torun, they immediately move to the expired ar-ray as well. Eventually the expired and activearrays are switched at which time all resourcecontainers are refilled toTSi = Scpu

i × STE.Since the array switch occurs as soon as the ac-tive list becomes empty, this approach is workconserving (the CPU is never idle if there arerunnable tasks). A variant of this approach wasinitially implemented by [17] based on a patchfrom Rik van Riel for allocatingequalsharesto all usersin the system.

However, WFS has some problems. If the tasksof a class are CPU bound and

∑p∈Ci

ts(p) >TSi then a class could exhaust its timeslice be-fore all its tasks have had a chance to run atleastonce. Therefore the lower priority tasks of theclass could perpetually move from the activeto expired lists without ever being granted exe-cution time. Starvation occurs because neitherthe static priority (sp) nor the sleep average (sa)of the tasks is changed at any time. Hence eachtask’s timeslicets(p) = Φ(sp) and effectivepriority ep(p) = Φ(sp, sa) remain unchanged.Hence the relative priority of tasks of a classnever changes (in a CPU bound situation) nordoes the amount of CPU time consumed by thehigher priority tasks.

To ensure a fair share for individual taskswithin classes, we need to ensure that the rateof progress of a task depends on the share as-signed to its class. Three approaches to achieve

Linux Symposium 156

this are discussed next.

Priority modifications (WFS+P): Let aswitching intervalbe defined as the time in-terval between consecutive array switches ofthe modified scheduler,∆j be its duration andtj andtj+1 be the timestamps of the switches.In the priority modifications approach to al-leviating starvation in WFS, henceforth calledWFS+P, we track the number of array switchesse at which a task got starved due to its class’timeslice exhaustion and increase the task’s ef-fective priority based onse, i.e. ep(p) =Φ(sp, sa, se). This ensures that starving taskseventually get theirep high enough to get achance to run at which pointse is reset. Thedrawback of this approach is that the increasedscheduler overhead of tasks being selected forexecution and moving directly to the expiredlist due to class timeslice exhaustion, remainsunchanged.

Timeslice modifications (WFS+T1,WFS+T2): Recognizing that starvationcan occur in WFS for classCi only if∑

p∈Cits(p) > TSi, the timeslice modification

approaches attempt to change one or theother side of the inequality to convert it to anequality. In WFS+T1, the left hand side of theinequality is changed by reducing the times-lices of each task of a starved class as follows.Let exhi =

∑p∈Ci

ts(p) − TSi when classtimeslice exhaustion occurs. At array switchtime, eachts(p) is multiplied byλi = TSi

TSi+exhi

which results in the desired equality. WFS+T1is slow to respond to starvation because tasktimeslices are recomputed in O(1)beforetheymove into the expired array and not at arrayswitch time. Hence any task timeslice changestake effect only one switching interval lateri.e. two intervals beyond the one in whichstarvation occurred. One way to address thisproblem is to treat a task as having exhaustedits timeslice whents(p) gets decremented to(1−λi× ts(p)) instead of 0. A bigger problem

with WFS+T1 is that smaller timeslices fortasks could lead to increased context switcheswith potentially negative cache effects.

To avoid reducingts(p)’s, WFS+T2 increasesTSi of a starving class to make

∑p∈Ci

ts(p) =TSi i.e. the class does not exhaust its times-lice until each of its tasks have exhausted theirindividual timeslices. To preserve the relativeproportions between class timeslices, all otherclass timeslices also need to be changed ap-propriately. Doing so would disturb the sameequality for those classes and hence WFS+T2is not a workable approach.

Two-level scheduler: Another way to regu-late CPU shares in WFS is to take tasks outof the runqueue upon timeslice exhaustion andreturn them to the runqueue at a rate commen-surate with the share of the class. A prototypeimplementation of this approach was describedin [17] in the context of user-based fair sharing.This approach effectively implements a two-level scheduler and is illustrated in Figure 2. Amodified O(1) scheduler forms the lower leveland a coarse-grain scheduler operates at theupper level, replenishing class timeslice ticks.In the modified O(1), when a task expires, itis moved into a FIFO list associated with itsclass instead of moving to the expired array.At a coarse-granularity determined by the up-per level scheduler, the class receives new timeticks and reinserts tasks from the FIFO backinto O(1)’s runqueue. Class time tick replen-ishment can be done for all classes at every ar-ray switch point but that violates the O(1) be-haviour of the scheduler as a whole. To addressthis problem, [17] uses a kernel thread to re-plenish 8 ms worth of ticks to one class (user)every 8 ms and round robin through the classes(users). A variant of this idea is currently beingexplored.

Linux Symposium 157

Figure 2: Proposed two-level CPU scheduler

4 Disk scheduling

The I/O scheduler in Linux forms the interfacebetween the generic block layer and the lowlevel device drivers. The block layer providesfunctions which are used by filesystems and thevirtual memory manager to submit I/O requeststo block devices. These requests are trans-formed by the I/O scheduler and made avail-able to the low-level device drivers (henceforthonly called device drivers). Device driversconsume the transformed requests and forwardthem, using device specific protocols, to the de-vice controllers which perform the I/O. Sinceprioritized resource management seeks to reg-ulate the use of a disk by an application, theI/O scheduler is an important kernel compo-nent that is sought to be changed. It is also pos-sible to regulate disk usage in the kernel layersabove and below the I/O scheduler. Changingthe pattern of I/O load generated by filesytemsor the virtual memory manager (VMM) is animportant option. A less explored option isto change the way specific device drivers oreven device controllers consume the I/O re-quests submitted to them. The latter approach

is outside the scope of general kernel develop-ment and this paper.

Class-based resource management requirestwo fundamental changes to the traditional ap-proach to I/O scheduling. First, I/O requestsshould be managed based on the priority orweight of the request submitter with disk uti-lization being a secondary, albeit important ob-jective. Second, I/O requests should be associ-ated with the class of the request submitter andnot a process or task. Hence the weight associ-ated with an I/O request should be derived fromthe weight of the class generating the request.

The first change is already occurring in the de-velopment of the 2.5 Linux kernel with the de-velopment of different I/O schedulers such asdeadline, anticipatory, stochastic fair queueingand complete fair queueing. Consequently, theadditional requirements imposed by the sec-ond change (scheduling by class) are relativelyminor. This fits in well with our project goalof minimal changes to existing resource sched-ulers.

We now describe the existing Linux I/O sched-ulers followed by an overview of the changesbeing proposed.

4.1 Existing Linux I/O schedulers

The various Linux I/O schedulers can be ab-stracted into a generic model shown in Fig-ure 3. I/O requests are generated by theblock layer on behalf of processes access-ing filesystems, processes performing raw I/Oand from the virtual memory management(VMM) components of the kernel such askswapd, pdflush etc. These producers of I/Orequests call __make_request() which invokesvarious I/O scheduler functions such as eleva-tor_merge_fn. The enqueuing functions’ gen-erally try to merge the newly submitted blockI/O unit (bio in 2.5 kernels, buffer_head in

Linux Symposium 158

2.4 kernels) with previously submitted requestsand sort it into one or more internal queues. To-gether, the internal queues form a single log-ical queue that is associated with each blockdevice. At a later point, the low-level de-vice driver calls the generic kernel functionelv_next_request() to get the next request fromthe logical queue. elv_next_request interactswith the I/O scheduler’s dequeue function ele-vator_next_req_fn and the latter has an oppor-tunity to pick the appropriate request from oneof the internal queues. The device driver thenprocesses the request, converting it to scatter-gather lists and protocol-specific commandsthat are then sent to the device controller. Asfar as the I/O scheduler is concerned, the blocklayer is the producer of I/O requests and the de-vice drivers are the consumers. Strictly speak-ing, the block layer includes the I/O schedulerbut we distinguish the two for the purposes ofour discussion.

Figure 3: Abstraction of Linux I/O scheduler

Default 2.4 Linux I/O scheduler: The 2.4Linux kernel’s default I/O scheduler (eleva-tor_linus) primarily manages disk utilization.It has a single internal queue. For each newbio, the I/O scheduler checks to see if it can bemerged with an existing request. If not, a new

request is placed in the internal queue sortedby the starting device block number of the re-quest. This minimizes disk seek times if thedisk processes requests in FIFO order from thequeue. An aging mechanism limits the num-ber of times an existing request can get by-passed by a newer request, preventing starva-tion. The dequeue function is simply a removalof requests from the head of the internal queue.Elevator_linus also has the welcome propertyof improving request response timesaveragedover all processes.

Deadline I/O scheduler: The 2.5 kernel’sdefault I/O scheduler (deadline_iosched) in-troduces the notion of a per-request deadlinewhich is currently used to give a higher pref-erence to read requests. Internally, it main-tains five queues. During enqueing, each re-quest is assigned a deadline and inserted intoqueues sorted by starting sector (sort_list)andby deadline (fifo_list). Separate sort and fifolists are maintained for read and write requests.The fifth internal queue contains requests to behanded off to the driver. During a dequeueoperation, if the dispatch queue is empty, re-quests are moved from one of the four sortor fifo lists in batches. Thereafter, or if thedispatch queue was not empty, the head re-quest on the dispatch queue is passed on to thedriver. The logic for moving requests from thesort or fifo lists ensures that each read requestis processed by its deadline without starvingwrite requests. Disk seek times are amortizedby moving a large batch of requests from thesort_list (which are likely to have few seeks asthey are already sector sorted) and balancing itwith a controlled number of requests from thefifo list (each of which could cause a seek sincethey are ordered by deadline and not sector).Thus, deadline_iosched effectively emphasizesaverage read request response times over diskutilization and total average request responsetime.

Linux Symposium 159

Anticipatory I/O scheduler: The anticipatoryI/O scheduler [9, 4] attempts to reduceper-processread response times. It introduces acontrolled delay in dispatchingany new re-quests to the device driver, thereby allowinga process whose request just got serviced tosubmit a new request, potentially requiring asmaller seek. The tradeoff between reducedseeks and decreased disk utilization (due to theadditional delays in dispatch) are managed us-ing a cost-benefit calculation. anticipatory I/Oscheduling method is an additional optimiza-tion that can potentially be added to any of theI/O scheduler mentioned in this paper

Complete Fair Queueing I/O scheduler:Two new I/O schedulers recently proposedin the Linux kernel community, introducethe concept of fair allocation of I/O band-width amongst producers of I/O requests. TheStochastic Fair Queueing (SFQ) scheduler [5]is based on an algorithm originally proposedfor network scheduling [11]. It tries to appor-tion I/O bandwidth equally amongst allpro-cessesin a system using 64 internal queuesand one output (dispatch) queue. During anenqueue, the process ID of the currently run-ning process (very likely to be the I/O requestproducer) is used to select one of the inter-nal queues and the request inserted in FIFOorder within it. During dequeue, SFQ round-robins through the non-empty internal queues,picking requests from the head. To avoid toomany seeks, one full round of requests arecollected, sorted and merged into the dispatchqueue. The head request of the dispatch queueis then passed to the device driver. CompleteFair Queuing is an extension of the same ap-proach where no hash function is used. Henceeach process in the system has a correspond-ing internal queue and can get an fair shareof the I/O bandwidth (equal share if all pro-cesses generate I/O requests at the same rate).Both CFQ and SFQ manage per-process I/Obandwidth and can provide fairness at a pro-

cess granularity.

Cello disk scheduler: Cello is a two-levelI/O scheduler [16] that distinguishes betweenclasses of I/O requests and allows each classto be serviced by a different policy. A coarsegrain class-independent scheduler decides howmany requests to service from each class. Thesecond level class-dependent scheduler thendecides which of the requests from its classshould be serviced next. Each class has its owninternal queue which is manipulated by theclass-specific scheduler. There is one outputqueue common to all classes. Enqueuing intothe output queue is done by the class-specificschedulers in a way that ensures individual re-quest deadlines are met as far as possible whilereducing overall seek time. Dequeuing fromthe output queue occurs in FIFO order as inmost of the previous I/O schedulers. Cello hasbeen shown to provide good isolation betweenclasses as well as the ability to meet the needsof streaming media applications that have softrealtime requirements for I/O requests.

4.2 Costa: Proposed I/O scheduler

This paper proposes that a modified version ofthe class-independent scheduler of the CelloI/O scheduling framework can provide a low-overhead class-based I/O scheduler suitable forCKRM’s goals.

The key difference between the proposedscheduler called Costa and Cello is the elimina-tion of the class-specific I/O schedulers whichmay be add unnecessary overhead for CKRM’sgoal of I/O bandwidth regulation. Fig 4 illus-trates the Costa design. When the kernel isconfigured for CKRM support, a new internalqueue is created for each class that gets addedto the kernel. Since each process is alwaysassociated with a class, I/O requests that theygenerate can also be tagged with the class idand used to enqueue the request in the class-

Linux Symposium 160

Figure 4: Proposed Costa I/O scheduler

specific internal queue. The request->class as-sociation cannot be done through a lookup ofthe current process’ class alone. During de-queue, the Costa I/O scheduler picks up re-quests from several internal queues and sortsthem into the common dispatch queue. Thehead request of the dispatch queue is thenhanded to the device driver.

The mechanism also allows internal queues tobe associated withsystemclasses that groupI/O requests coming from important produc-ers such as the VMM. By separating these out,Costa can give them preferential treatment forurgent VM writeout or swaps.

In addition to a weight value (which deter-mines the fraction of I/O bandwidth that a classwill receive), the internal queues could alsohave an associatedpriority value which deter-mines their relative importance. At a given pri-ority level, all queues could receive I/O band-

width in proportion of their weights with theset of queues at a higher level always gettingserviced first. Some check for preventing star-vation of lower priority queues could be usedsimilar to the ones used in deadline_iosched.

5 QoS Networking in Linux

Many research efforts have been made in net-working QoS (Quality of Service) to providequality assurance of latency, bandwidth, jitter,and loss rate. With the proliferation of mul-timedia and quality-sensitive business traffic,it becomes essential to provide reserved qual-ity services (IntServ [23]) or differentiated ser-vices (DiffServ [2]) for important client traffic.

The Linux kernel has been offering a well es-tablished QoS network infrastructure for out-bound bandwidth management, policy-basedrouting, and DiffServ. Hence, Linux is beingwidely used for routers, gateways, and edgeservers, where network bandwidth is the pri-mary resource to differentiate among classes.

When it comes to Linux as an end server OS,on the other hand, networking QoS has notbeen given as much attention because QoS isprimarily governed by the system resourcessuch as CPU, memory, and I/O and less bythe network bandwidth. However, when weconsider the end-to-end service quality, wealso should require networking QoS in the endservers as exemplified by the fair share admis-sion control mechanism proposed in this sec-tion.

In the rest of the section, we first briefly intro-duce the existing network QoS infrastructureof Linux. Then, we describe the design of thefair share admission control in Linux and pre-liminary performance results.

Linux Symposium 161

5.1 Linux Traffic Control, Netfilter, DiffServ

The Linux traffic control [8] consists of queue-ing disciplines (qdisc) and filters. A qdisc con-sists of one or more queues and a packet sched-uler. It makes traffic conform to a certain pro-file by shaping or policing. A hierarchy ofqdiscs can be constructed jointly with a classhierarchy to make different traffic classes gov-erned by proper traffic profiles. Traffic canbe attributed to different classes by the filtersthat match the packet header fields. The filtermatching can be stopped to police traffic abovea certain rate limit. A wide range of qdiscsranging from a simple FIFO to classful CBQor HTB are provided for outbound bandwidthmanagement, while only one ingress qdisc isprovided for inbound traffic filtering and polic-ing [8]. The traffic control mechanism canbe used in various places where bandwidth isthe primary resource to control. For instancein service providers, it manages bandwidth al-location shared among different traffic flowsbelonging to different customers and servicesbased on service level agreements. It also canbe used in client sites to reduce the interferencebetween upstream and downstream traffic andto enhance the response time of the interactiveand urgent traffic.

Netfilter provides sophisticated filtering rulesand targets. Matched packets can be accepted,denied, marked, or mangled to carry out vari-ous edge services such as firewall, dispatcher,proxy, NAT etc. Routing decisions can bemade based on the netfilter markings so pack-ets may take different routes according to theirclasses. The qdiscs would enable various QoSfeatures in such edge services when used withNetfilter. Netfilter classification can be trans-ferred for use in later qdiscs by markings ormangled packet headers.

The Differentiated Service (DiffServ) [2] pro-vides a scalable QoS by applying per-hop be-

havior (PHB) collectively to aggregate trafficclasses that are identified by a 6-bit code pointin the IP header. Classification and condition-ing are typically done at the edge of a DiffServdomain. The domain is a contiguous set ofnodes compliant to a common PHB. The Diff-Serv PHB is supported in Linux [22]. Classes,drop precedence, code point marking, and con-ditioning can be implemented by qdiscs and fil-ters. At the end servers, the code point can bemarked by setting theIP_TOS socket option.

In the policy based networking [18], a pol-icy agent can configure the traffic classificationof edge and end servers according to a pre-defined filtering rules that match layer 3/4 orlayer 7 information. Netfilter, qdisc, and appli-cation layer protocol engines can classify traf-fic for differentiated packet processing at laterstages. Classifications at prior stages can beoverridden by the transaction information suchas URI, cookies, and user identities as they areknown. It has been shown that a coordinationof server and network QoS can reduce end-to-end response time of important client requestssignificantly by virtual isolation from the lowpriority traffic [15].

5.2 Prioritized Accept Queues with Propor-tional Share Scheduling

We present here a simple change to the existingLinux TCP accept mechanism to provide dif-ferentiated service across priority classes. Re-cent work in this area has introduced the con-cept of prioritized accept queues [19] and ac-cept queue schedulers using adaptive propor-tional shares to self-managed web servers [14].

Under certain load conditions [14], the TCP ac-cept queue of each socket becomes the bottle-neck in network input processing. Normally,listening sockets fork off a child process tohandle an incoming connection request. Someoptimized applications such as the Apache web

Linux Symposium 162

Figure 5: Proportional Accept Queue Results.

server maintain a pool of server processes toperform this task. When the number of incom-ing requests exceeds the number of static poolservers, additional processes are forked up toa configurable maximum. When the incom-ing connection request load is higher than thelevel that can be handled by the available serverprocesses, requests have to wait in the acceptqueue until one is available.

In the typical TCP connection, the client initi-ates a request to connect to a server. This con-nection request is queued in a single global ac-cept queue belonging to the socket associatedwith that server’s port. Processes that performan accept() call on that socket pick up thenext queued connection request and process it.Thus all incoming connections to a particularTCP socket are serialized and handled in FIFOorder.

We replace the existing single accept queue persocket with multiple accept queues, one foreach priority class. Incoming traffic is mappedinto one of the priority classes and queued onthe accept queue for that priority. There areeight priority classes in the current implemen-tation.

The accepting process schedules connectionacceptance according to a simple weighteddeficit round robin to pick up connection re-quests from each queue according to its as-signed share. The share, or weight can be as-signed by the sysctl interface.

In the basic priority accept queue design pro-posed earlier in [6], starvation of certain pri-ority classes was a possibility as the accept-ing process picked up connection requests ina descending priority order. With a propor-tional share scheme in this paper, it is easierto avoid starvation of particular classes to giveshare guarantees to low priority classes.

The efficacy of the proportional accept queuemechanism is demonstrated by an experiment.In the experiment, we used Netfilter with man-gle tables andMARKoptions to characterizetraffic into priority classes based on source IPaddress. Httperfs from two client machinessend requests to an Apache web server run-ning on a single server over independent giga-bit Ethernet connections. The only Apache pa-rameter changed from the default was the max-imum number of httpd threads. This was set to50 in the experiment.

Figure 5 shows throughput of Apache for twopriority classes, sharing inbound connectionbandwidth by 7:3. We can see that the through-put of the priority class 0 requests is slightlyhigher than that of the priority class 1 requestswhen the load is low. As load increases, theacceptance rates to the priority classes 0 and 1will be constrained in proportion to their rel-ative share, which in turn determines the pro-cessing rate of the Apache web server and con-nection request queueing delay. Under a severeload, the priority class 0 requests are processedat a considerably higher throughput.

Linux Symposium 163

6 Controlling Memory

While many other system resources can bemanaged according to priorities or proportions,virtual memory managers (VMM)currently donot allow such control. Theresident set size(RSS)of each process—the number of phys-ical page frames allocated to that process–will determine how often that process incursa page fault. If the RSS of each process isnot somehow proportially shared or prioritized,then paging behavior can overwhelm and un-dermine the efforts of other resource manage-ment policies.

Consider two processes,A andB, whereA isassigned a larger share thanB with the CPUscheduler. IfA is given too small an RSS andbegins to page fault frequently, then it will notoften be eligible for scheduling on the CPU.Consequently,B will be scheduled more oftenthan A, and the VMM will have become thede factoCPU scheduler, thus violating of therequested scheduling proportions.

Furthermore, it is possible for most existingVMM policies to exhibit a specific kind of de-generative behavior. Once processA from theexample above begins to page fault, its infre-quent CPU scheduling prevents it from ref-erencing its pages at the same rate as other,frequently scheduled processes. Therefore,its pages will become more likely to evicted,thereby reducing its RSS. The smaller RSSwill increasethe probability of further pagefaults. This degenerative feedback loop willcease only when some other process either ex-its or changes its reference behavior in a man-ner that reduces the competition for main mem-ory space.

Main memory use must be controlled just asany other resource is. The RSS of eachaddressspace—the logical space defined either by afile or by the anonymous pages within the vir-

tual address space of a process—must be cal-culated as a function of the proportions (or pri-orities) that are used for CPU scheduling. Be-cause this type of memory management has re-ceived little applied or academic attention, ourwork in this area is still nascent. We presenthere the structural changes to the Linux VMMnecessary for proportional/prioritized memorymanagement; we also present previous, appli-cable research, as well as future research di-rections that will address this complex prob-lem. While the proportional/prioritized man-agement of memory is currently less well de-veloped than the other resources presented inthis paper, it is necessary that it be comparablydeveloped.

6.1 Basic VMM changes

Consider a function that explicitly calculatesthe desired RSS for each address space—thetarget RSS—when the footprints of the activeaddress spaces exceeds the capacity of mainmemory. After this function sets these targets,a system couldimmediatelybring the actualRSS into alignment with these targets. How-ever, doing so may require a substantial num-ber of page swapping operations. Since diskoperations are so slow, it is inadvisable to useaggressive, pre-emptive page swapping. In-stead, a system should seek to move the the ac-tual RSS values toward their targets in alazyfashion, one page fault at a time. Until the ac-tual RSS of address space matches its target, itcan be labeled as being either inexcessor indeficitof its target.

As the VMM reclaims pages, it will do sofrom address spaces with excess RSS values.This approach to page reclamation suggestsa change in the structure of thepage lists—specifically, theactive and inactive lists thatimpose an order on pages1. The current Linux

1In the Linux community, these are known as the

Linux Symposium 164

VMM usesglobal page lists. If this approachto ordering pages in unchanged, then the VMMwould have to search the page lists for pagesthat belong to the desired address spaces thathave excess RSS values. Alternatively, onepair of page lists—activeand inactive—couldexist for each address space. The reclamationof pages from a specific address space wouldtherefore require no searching.

By ordering pages separately with each addressspace, we also enable the VMM to more eas-ily track reference behavior for each addressspace. While the information to be gatheredwould depend on the underlying policy that se-lects target RSS values, we believe that suchtracking may play an important role in the de-velopment of such policies.

Note that the target RSS values would need tobe recalculated periodically. While the periodshould be directly proportional to thememorypressure—some measure of the current work-load’s demand for main memory space—it is atopic of future research to determine what thatperiod should be. By coupling the period fortarget RSS calculations to the memory pres-sure, we can ensure that this strategy only in-curs significant computational overhead whenheavy paging is occuring.

6.2 Proportionally sharing space

Waldspurger [20] describes a method of pro-portionally sharing the space of a system run-ning VMWare’s ESX Server between a numberof virtual machines. Specifically, as with a pro-portional share CPU scheduler, memory sharescan be assigned to each virtual machine, andWaldspurger’s policy will calculate target RSSvalues for each virtual machine.

Proportionally sharing main memory space can

LRU lists. However, they only approximate an LRU or-dering, and so we refer to them only aspage lists.

result in superfluous allocations to some virtualmachines. If the target RSS for some virtualmachine is larger than necessary, and some ofthe main memory reserved for that virtual ma-chine is rarely used2, then its target RSS shouldbe reduced. Waldspurger addresses this prob-lem with ataxation policy. In short, this policypenalizes each virtual machine for unused por-tions of its main memory share by reducing itstarget RSS.

Application to the Linux VMM. This ap-proach to proportionally sharing main memorycould easily be generalized so that it can applyto address spaces within Linux instead of vir-tual machines on an ESX Server. Specifically,we must describe how shares of main memoryspace can be assigned to each address space.Given those shares, target RSS values can becalculated in the same manner as for virtualmachines in the original research.

The taxation scheme requires that the systembe able to measure the active use of pages ineach address space. Waldspurger used a sam-pling strategy where some number of randomlyselected pages for each virtual machine wereaccess protected, forcingminor page faultsto occur upon the first subsequent referenceto those pages, and therefore giving the ESXServer an opportunity to observe those pageuses. The same approach could be used withinthe Linux VMM, where a random samplingof pages in each address space would be ac-cess protected. Alternatively, a sampling of thepages’ reference bits could be used to monitoridle memory.

Waldspurger observes that, within a normalOS, the use of access protection or referencebits, taken from virtual memory mappings, willnot detect references that are a result of DMAtransfers. However, since such DMA transfers

2Waldspurger refers to such space as beingidle.

Linux Symposium 165

are scheduled within the kernel itself, those ref-erences could also be explicitly counted withhelp from the DMA scheduling routines.

6.3 Future directions

The description above does not address a par-ticular problem:shared memory. Space can beshared between threads or processes in a num-ber of ways, and such space presents impor-tant problems that we must solved to achieve acomplete solution.

The problem of shared spaces. The assign-ment of shares to an address space can be com-plicated when that address space is shared byprocesses in different process groups or ser-vice classes. One simple approach is for eachaddress space to belong to a specific serviceclass. In this situation, its share would be de-fined only by that service class, and not by theprocesses that share the space. Another ap-proach would be for each shared address spaceto adopt the highest share value of its sharedtasks or processes. In this way, an importantprocess will not be penalized because it is shar-ing its space with a less important process.

Note that this problem does not apply only tomemory mapped files and IPC shared mem-ory segments, but to any shared space. Forexample, the threads of a multi-threaded pro-cess share a virtual address space. Similarly,when a process callsfork() , it creates a newvirtual address space whose pages are sharedwith the original virtual address space usingthe copy-on-write (COW)mechanism. Theseshared spaces must be assigned proportionalshares even though the tasks and processes us-ing them may themselves have differing shares.

Proportionally sharing page faults. Thegoal of proportional share scheduling is to

fairly divide the utility of a system among com-peting clients (e.g., processes, users, serviceclasses). It is relatively simple to divide theutility of the CPU because that utility islin-ear and indepedent. One second of scheduledCPU time yields a nearly fixed number of exe-cuted instructions3. Therefore, each additionalsecond of scheduled CPU time nearly yieldsa constant increase in the number of executedinstructions. Furthermore, the utility of CPUdoes not depend on the computation being per-formed: Every process derives equal utilityfrom each second of scheduled CPU time.

Memory, however, is a more complex resourcebecause its utility is neither independent norlinear. Identical RSS values for two processesmay yield vastly different numbers of pagefaults for each process. The number of pagefaults is dependent on the reference patterns ofeach process.

To see the non-linearity in the utility of mem-ory, consider two processes,A andB. Assumethat for A, an RSS ofp pages will yieldmmisses, wherem > 0. If that RSS were in-creased top + q pages, the number of missesincurred byA may take any valuem′ where0 ≤ m′ ≤ m4. Changes in RSS do not im-ply a constant change in the number of missessuffered by an address space.

The proportial sharing of memoryspace, there-fore, does not necessarily achieve the statedgoal of fairly dividing the utility of a system.Consider thatA should receive 75% of thesystem, whileB should receive the remain-ing 25%. Dividing the main memory space bythese proportions could yeild heavy page fault-

3We assume no delays to access memory, as memoryis a separate resource from the CPU.

4We ignore the possibility ofBelady’s anomaly[1],in which an increase in RSS could imply an increase inpage faults. While this anomaly is likely possible for anyreal, in-kernel page replacement policy, it is uncommonand inconsequential for real workloads.

Linux Symposium 166

ing for A but not forB. Note also that noneof the 25% assigned toB may be idle, and soWaldspurger’s taxation scheme will not reduceits RSS. Nonetheless, it may be the case that areduction in RSS by 5% forB may increase itsmisses only modestly, and that an increase inRSS by 5% forA may reduce its misses drasti-cally.

Ultimately, a system should proportionallyshare the utility of main memory. We con-sider this topic a matter of significant futurework. It is not obvious how to measure onlinethe utility of main memory for each addressspace, nor how to calculate target RSS valuesbased on these measurements. Balancing thecontention between fairness and throughput forvirtual memory must be considered carefully,as it will be unacceptable to achieve fairnesssimply by forcing some address spaces to pagefault more frequently. We do, however, believethat this problem can be solved, and that theutility of memory can be proportionally sharedjust as with other resources.

7 Conclusion and Future Work

In this paper we make a case for providing ker-nel support for class-based resource manage-ment that goes beyond the traditional per pro-cess or per group resource management. Weintroduce a framework for classifying tasks andincoming network packets into classes, mon-itoring their usage of physical resources andcontrolling the allocation of these resources bythe kernel schedulers based on the shares as-signed to each class. For each of four ma-jor physical resources (CPU, disk, network andmemory), we discuss ways in which propor-tional sharing could be achieved using incre-mental modifications to the corresponding ex-isting schedulers.

Much of this work is in its infancy and the ideas

proposed here serve only as a starting point forfuture work and for discussion in the kernelcommunity. Prototypes of some of the sched-ulers discussed in this paper are under develop-ment and will be made available soon.

8 Acknowledgments

We would like to thank team members fromthe Linux Technology Center, particularlyTheodore T’so, for their valuable comments onthe paper and the work on individual resourceschedulers. Thanks are also in order for nu-merous suggestions from the members of thekernel open-source community.

References

[1] L. A. Belady. A study of replacementalgorithms for virtual storage.IBMSystems Journal, pages 5:78–101, 1966.

[2] S. Blake, D. Black, M. Carlson,E. Davies, Z. Wang, and W. Weiss. AnArchitecture for Differentiated Services.RFC 2475, Dec 1998.

[3] Jonathan Corbet. A new deadline I/Oscheduler.http://lwn.net/Articles/10874 .

[4] Jonathan Corbet. Anticipatory I/Oscheduling.http://lwn.net/Articles/21274 .

[5] Jonathan Corbet. The ContinuingDevelopment of I/O Scheduling.http://lwn.net/Articles/21274 .

[6] IBM DeveloperWorks. Inboundconnection control home page.http://www-124.ibm.com/pub/qos/paq_index.html .

Linux Symposium 167

[7] Pawan Goyal, Xingang Guo, andHarrick M. Vin. A Hierarchical CPUScheduler for Multimedia OperatingSystems. InUsenix Association SecondSymposium on Operating SystemsDesign and Implementation (OSDI),pages 107–121, 1996.

[8] Bert Hubert. Linux Advanced Routing &Traffic Control.http://www.lartc.org .

[9] Sitaram Iyer and Peter Druschel.Anticipatory scheduling: A diskscheduling framework to overcomedeceptive idleness in synchronous I/O.In 18th ACM Symposium on OperatingSystems Principles, October 2001.

[10] M. Kravetz, H. Franke, S. Nagar, andR. Ravindran. Enhancing LinuxScheduler Scalability. InProc. 2002Ottawa Linux Symposium, Ottawa, July2001.http://lse.sourceforge.net/scheduling/ols2001/elss.ps .

[11] Paul E. McKenney. Stochastic FairnessQueueing. InINFOCOM, pages733–740, 1990.

[12] Ingo Molnar. Goals, Design andImplementation of the new ultra-scalableO(1) scheduler. In 2.5 kernel source treedocumentation(Documentation/sched-design.txt).

[13] Jason Nieh, Chris Vaill, and Hua Zhong.Virtual-time round-robin: An o(1)proportional share scheduler. In2001USENIX Annual Technical Conference,June 2001.

[14] P. Pradhan, R. Tewari, S. Sahu,A. Chandra, and P. Shenoy. AnObservation-based Approach TowardsSelf-Managing Web Servers. InIWQoS2002, 2002.

[15] Quality of Service White Paper.Integrated QoS: IBM WebSphere andCisco Can Deliver End-to-End Value.http://www-3.ibm.com/software/webservers/edgeserver/doc/v20/QoSwhitepap%er.pdf .

[16] Prashant J. Shenoy and Harrick M. Vin.Cello: A disk scheduling framework forbext generation operating systems. InACM SIGMETRICS 1998, pages 44–55,Madison, WI, June 1998. ACM.

[17] Antonio Vargas. fairsched+O(1) processscheduler.http://www.ussg.iu.edu/hypermail/linux/kernel/0304.0/0060.html .

[18] Dinesh Verma, Mandis Beigi, andRaymond Jennings. Policy based SLAManagement in Enterprise Networks. InPolicy Workshop, 2001.

[19] T. Voigt, R. Tewari, D. Freimuth, andA. Mehra. Kernel Mechanisms forService Differentiation in OverloadedWeb Servers. In2001 USENIX AnnualTechnical Conference, Jun 2001.

[20] Carl A. Waldspurger. Memory resourcemanagement in {VM}ware {ESX}{S}erver. In Proceedings of the 5thSymposium on Operating SystemsDesign and Implementation, December2002.

[21] Carl A. Waldspurger and William E.Weihl. Stride scheduling: Deterministicproportional- share resourcemanagement. Technical ReportMIT/LCS/TM-528, 1995.

[22] Werner Almesberger Werner and JamalHadi Salim amd Alexye Kuznetsov.Differentiated Services on Linux. InGlobecom, volume 1, pages 831–836,1999.

Linux Symposium 168

[23] J. Wroclawski. The Use of RSVP withIETF Integrated Services. RFC 2210,Sep 1997.

Trademarks and Disclaimer


IBM is a trademark or registered trademarks of In-ternational Business Machines Corporation in theUnited States, other countries, or both.

Linux is a trademark of Linus Torvalds.

UNIX is a registered trademark of The Open Groupin the United States and other countries.

Other trademarks are the property of their respec-tive owners.

Linux Support for NUMA Hardware

Matthew Dobson, Patricia Gaughen, Michael HohnbaumIBM LTC, Beaverton, Oregon, USA

[email protected], [email protected], [email protected]

Erich Focht

NEC HPCE, Stuttgart, [email protected]

Abstract

New large CPU-count machines are being de-signed with non-uniform memory architecture(NUMA) characteristics. The 2.5 Linux® ker-nel includes many enhancements in support ofNUMA machines. Data structures and macrosare provided within the kernel for determin-ing the layout of the memory and processorson the system. These enable the VM subsys-tem to make decisions on the optimal place-ment of memory for processes. This topologyinformation is also exported to user-space viasysfs . In addition to items that have beenincorported into the mainline Linux kernel,there are NUMA features that have been de-veloped that continue to be supported as patch-sets. These include NUMA enhancements tothe scheduler, multipath I/O, and a user-levelAPI that provides user control over the alloca-tion of resources in respect to NUMA nodes.

1 Introduction

1.1 Non-Uniform Memory Architecture

Demand for greater computing capacity haslead to the increased use of multi-processorcomputers. Most multi-processor computersare considered Symmetric Multi-Processors

(SMP) as each processor is equal and has equalaccess to all system resources (e.g., memoryand I/O busses). SMP systems generally arebuilt around a system bus that all system com-ponents are connected to, and is used to com-municate between the components. As SMPsystems have increased their processor count,the system bus has increasingly become a bot-tleneck. One solution that is gaining in use byhardware designers is Non-Uniform MemoryArchitecture (NUMA).

NUMA systems co-locate a subset of thesystem’s overall processors and memory intonodes, and provide a high speed and high band-width interconnect between the nodes, see Fig-ure 1. Thus there are multiple physical re-gions of memory, but all memory is tied to-gether into a single cache-coherent physicaladdress space. The resulting system has theproperty such that for any given region of phys-ical memory, some processors are closer to itthan other processors. Conversely, for any pro-cessor, some memory is considered local (i.e.,it is close to the processor) and other memoryis remote. Similar characteristics may also ap-ply to the I/O busses—that is, I/O busses maybe associated with nodes.

While the key characteristic of NUMA systemsis the variable distance of portions of memoryfrom other system components, there are nu-

Linux Symposium 170

merous NUMA system designs. At one end ofthe spectrum are designs where all nodes aresymmetrical—they all contain memory, CPUs,and I/O busses. At the other end of the spec-trum are systems where there are differenttypes of nodes—the extreme case being sepa-rate CPU nodes, memory nodes, and I/O nodes.All NUMA hardware designs are characterizedby regions of memory being at varying dis-tances from other resources, thus having dif-ferent access speeds.

To maximize performance on a NUMA plat-form, Linux must take into account the way thesystem resources are physically laid out. Thisincludes information such as which CPUs areon which node, which range of physical mem-ory is on each node, and what node an I/O busis connected to. This type of information de-scribes the topology of the system.

There are several challenges Linux must ad-dress to provide NUMA support. These in-clude:

• discovery and internal representation ofthe system topology

• minimization of traffic over the intercon-nect between nodes

• localization of memory references

• fair access to locks

• I/O locality

• synchronization of time between nodes

• the location of low address memory (e.g.,memory with physical address under 4GB) all on the first node, on i386™ pro-cessor (and potentially other 32-bit pro-cessor architectures) based machines

• scheduling of processes and groups ofprocesses on the same node.

Figure 1: Simple view of NUMA system

Figure 2: NUMA system with supernodes

Linux running on a NUMA system obtainsoptimal performance by keeping memory ac-cesses to the closest physical memory. For ex-ample, processors benefit by accessing mem-ory on the same node (or closest memorynode), and I/O throughput gains by using mem-ory on the same (or closest) node to the bus theI/O is going through. At the process level, itis optimal to allocate all of a process’s mem-ory from the node containing the CPU(s) theprocess is executing on. However, this also re-quires keeping the process on the same node.

This paper looks at how Linux addresses theseNUMA challenges, focusing on the NUMAsupport that is available in the 2.5 develop-ment kernel. In addition, some discussion isincluded about additional NUMA support thatis under development for future Linux releases.

1.2 Hardware Implementations

There are many design and implementationchoices that result in a wide variety of NUMAplatforms. This variety creates additional chal-

Linux Symposium 171

lenges for the Linux OS developer, as a singlesolution is desired to support the many types ofNUMA hardware. This section discusses hard-ware implementations, and provides examplesand descriptions of NUMA hardware imple-mentations.

Types of nodes

The most common implementation of NUMAsystems consists of interconnecting symmetri-cal nodes. In this case, the node itself is anSMP system that has some form of high speedand high bandwidth interconnect linking it toother nodes. Each node contains some num-ber of processors, physical memory, and I/Obusses. Typically, there is a node-level cache.This type of NUMA system is depicted in Fig-ure 1.

A variant on this design is to only put the pro-cessors and memory on the main node, andthen have the I/O busses separate. Another de-sign option is to have separate nodes for pro-cessors, memory, and I/O busses which are allinterconnected.

It is also possible to have nodes which containnodes, resulting in a hierarchical NUMA de-sign. This is depicted in Figure 2.

Types of interconnects

There is no standardization of interconnecttechnology. More relevant to Linux is thetopology of the interconnect. NUMA ma-chines exist that use the following interconnecttopologies:

• ring topology – each node is connected tothe node on either side of it. Memory ac-cess latencies can be non-symmetric; thatis, accesses from node A to node B might

take longer than accesses from node B tonode A.

• crossbar interconnect – all nodes connectinto a common crossbar.

• point to point – each node has a number ofports to connect to other nodes. The num-ber of nodes in the system is limited to thenumber of connection ports plus one, andeach node is directly connected to eachother node.

• mesh topologies – more complex topolo-gies that, like point to point topologies,are built upon each node having a numberof connection ports. But unlike point topoint topologies, there is not a direct con-nection between each node. Hypercubeand torus topologies are examples of meshtopologies.

The topology provided by the interconnect af-fects the distance between nodes. This distanceneeds to be accounted for when Linux is mak-ing resource placement decisions.

Latency ratios

One important measurement for determiningthe “NUMA-ness” of a system is the latencyratio. This is the ratio between memory la-tency for on-node memory access versus off-node memory accesses. Depending upon thetopology of the interconnect, there might bemultiple off-node latencies. This latency canbe used to analyze the cost of memory refer-ences to different parts of the physical addressspace, and thus be used to influence decisionsaffecting memory usage.

Linux Symposium 172

Specific NUMA implementations

Several hardware vendors are building NUMAmachines that run the Linux operating system.This section briefly describes some of thesemachines, but is not an all inclusive survey ofthe existing implementations.

One of the earlier commercial NUMA ma-chines is the IBM® NUMA-Q® box. Thismachine is based upon nodes which contain4 processors (i386), memory and PCI busses.Each node also contains a management moduleto coordinate booting, monitor environmen-tals, and communicate with the system con-sole. The nodes are interconnected using a ringtopology. Up to 16 nodes can be connectedfor a maximum of 64 processors and 64 GBof memory. Remote to local memory latencyratios range from 10:1 to 20:1. Each node hasa large remote cache which helps compensatefor the large remote memory latencies. Muchof the Linux NUMA development has been onthese boxes due to their availability.

NEC builds NUMA boxes using Intel™ Ita-nium™ processors. The most recent system inthis line is the NEC TX7. The TX7 supports upto 32 Itanium2 processors in nodes of 4 pro-cessors each. The nodes are connected by acrossbar and grouped in two supernodes of fournodes each. The crossbar provides fast accessto non-local memory with low latency and highbandwidth (12.8 GB/second per node). Thememory latency ratio for remote to local mem-ory in the same supernode is 1.6:1. The re-mote to local memory latency ratio for outsidethe supernode is 2.1:1. There is no node levelcache. I/O devices are connected through PCI-X busses to the crossbar interconnect and thusare all the same distance to any CPU/node.

The large IBM xSeries® boxes use Intel pro-cessors and the IBM XA-32™ chipset. Thischipset provides an architecture that supports

four processors, memory, PCI busses, andthree interconnect ports. These interconnectports allow point to point connection of up to4 nodes for a 16 processor system. Also sup-ported is a connection to an external box withadditional PCI slots to increase the I/O capac-ity of the system. The IBM x440 is built on thisarchitecture with Intel Xeon™ processors.

2 Linux NUMA Support

The basic infrastructure for supporting NUMAhardware has been incorporated into the Linux2.5 development kernel. This support in-cludes topology discovery and internal repre-sentation, memory allocation, process schedul-ing, and timer support. In addition there arekernel extensions in support of NUMA thatare not yet included in the mainline kernel,but are being maintained as separate patchsets.These include NUMA-aware multi-path I/O,application-level directed binding of memoryto nodes, and scheduler extensions.

2.1 CONFIG Options

Most NUMA support options within the ker-nel are enabled by the CONFIG_NUMA op-tion. This includes the scheduler extensions,NUMA memory allocation, and topology sup-port. There is also an option, CONFIG_DISCONTIGMEM, that is used for enabling aportion of the NUMA memory support.

2.2 Linux Architecture Support

Linux supports many types of processors andmany hardware architectures. Within Linux,when reference is made to an architecture ittypically refers to the processor type (e.g.,i386, Power4™, alpha, etc.). Subarchitecturesare used to refer to a substantially differenthardware implementation of a particular archi-tecture. Also, within an architecture there are

Linux Symposium 173

platforms which are an implementation of thearchitecture.

For example, the x440 is part of the i386 archi-tecture within Linux. However, the x440 is alsoa unique subarchitecture within the i386 archi-tecture. Another example is the IA64 architec-ture which has DIG-64 and HP platforms, andSGI-SN1 subarchitecture.

Throughout this paper references are made toarchitecture which are more correctly subarchi-tectures. References to architectures are meantto refer to a specific hardware implementationof a NUMA system.

3 Topology

There are performance penalties involved inaccessing hardware devices (CPUs, memory,disks, network cards, etc.) that are remote tothe currently executing CPU. These can be sig-nificantly reduced by having a knowledge ofthe system’s topology, and using that informa-tion to make good scheduling, allocation, mi-gration, and I/O decisions. Topology informa-tion is crucial to the kernel for making gooddecisions on a NUMA machine, but this infor-mation is also important to some user-space ap-plications as well.

Topology information is currently used in thekernel to schedule processes and allocate mem-ory. This has contributed to performance im-provements for NUMA architectures through-out the 2.3, 2.4, and 2.5 kernel series.

3.1 Topology Elements

The topology of a system includes all hardwarecomponents that make up the system. How-ever, for the context of this paper, topologyis restricted to those physical elements whichare directly affected by the NUMA character-istics of the system. These elements are nodes,

processors, memory, and I/O busses. Physicalcomponents not considered here consist of theactual I/O devices which are connected into thesystem through the I/O busses.

CPUs provide the computing power in thesystem. The location of the individual CPUsin the overall topology is extremely impor-tant for scheduling decisions. The currentin-kernel topology API exposes 4 CPU relatedfunctions: cpu_to_node() , node_to_cpumask() , node_to_first_cpu() ,andpcibus_to_cpumask() . Informationabout these functions is provided in the nextsection. The two-node system in Figure 1,contains 8 CPUs, 4 on node 0, and 4 on node1.

A Memory Block, or memblk for short, repre-sents a physically contiguous piece of memory.It is typically used to represent all the mem-ory in a particular memory bank on a particularnode. A node is permitted to have either 0 or 1memory blocks. For example, in Figure 1, wehave a two-node system. Its total memory issplit into two memblks, one on node 0, one onnode 1. In UP/SMP systems, all the memory inthe system is represented by a single memoryblock.

I/O bus elements represent physical I/O bussesin the underlying system. These are importantelements for operations like scheduling diskI/O, networking tasks, or any I/O intensive pro-cess. By utilizing knowledge about I/O local-ity, processes can ensure they run efficiently byconstraining themselves to CPUs and memoryblocks on or close to the I/O bus they utilize.

When discussing NUMA, the word node is of-ten overloaded. For the purposes of a Linuxtopology discussion, a node is solely an ab-stract container. Nodes are not meant to rep-resent any physical element of the underlyingarchitecture. All elements, including nodesthemselves, are children of a node in the sys-

Linux Symposium 174

tem. The node element is designed to be amedium through which queries can be made.For example, in Figure 1, we see a simple il-lustration of a two node system. Each node has4 CPUs in it, as well as a block of memory.To find out which memory block is closest toCPU 3, a process can determine that CPU 3 isa child of node 0, and that memblk 0 is also achild of node 0. Thus, for efficiency, a processrunning on CPU 3 would want to be sure thatits memory is allocated from memblk 0.

On some systems nodes can be nested, as seenin Figure 2. This is important with things likehyperthreading and multi-core processors be-coming more available, which can be easilyrepresented as a small node with processors,but no memory or I/O busses. Another usageof nested nodes is for hierarchical NUMA sys-tems, such as the NEC TX7, which is built withsupernodes that contain nodes.

There exists a strong possibility that there willbe a need to introduce new topology elementsin the future. Due to the simplicity of the de-sign of the topology subsystem, adding new el-ements is a straightforward procedure. As longas there is a parent-child relationship betweenthe new element and nodes, the new elementshould drop right in to the existing infrastruc-ture.

3.2 Topology Kernel Functions

The following is a list of topology-related ker-nel calls that form the basis of the currenttopology framework. Along with the descrip-tion of the call is a default return value forthe non-NUMA case. Architecture specificdefinitions of these kernel calls are providedfor each architecture that has NUMA topol-ogy support. Most architectures simply use thedefault asm-generic/topology.h version, usu-ally because the architecture does not supportNUMA.

• cpu_to_node(int cpu) – Returnsthe number of the node containing CPUcpu . For non-NUMA, defaults to 0.

• memblk_to_node(int memblk) –Returns the number of the node contain-ing memory blockmemblk . For non-NUMA systems, defaults to 0.

• parent_node(int node) – Returnsthe number of the node containing nodenode . If the node number returned isnode , node is a top-level node. For non-NUMA, defaults to 0.

• node_to_cpumask(int node) –Returns a bitmask of the cpus on Nodenode . For non-NUMA, defaults tocpu_online_map .

• node_to_first_cpu(int node)– Returns the number of the first CPU onNodenode . For non-NUMA, defaults tothe lowest-numbered CPU in the system.

• node_to_memblk(int node) –Returns the number of the MemoryBlock, if any, on Nodenode . Fornon-NUMA, defaults to 0.

• pcibus_to_cpumask(int bus) –Returns a bitmask of the CPUs closest toPCI busbus . For non-NUMA, defaultsto cpu_online_map .

• numa_node_id() – Returns the num-ber of the node containing the currentCPU. For non-NUMA systems, defaultsto 0.

3.3 Closest Element versus Distance Matrix

The current implementation of the topologysystem is very helpful if the caller is look-ing for information relating to the closest el-ement. This choice was made primarily be-cause this made the code small and compact,

Linux Symposium 175

but also because the majority of consumers ofthis information simply want the closest ele-ment. The other option, and possible futuremethod, is to use a distance matrix approach.Using this approach, each machine type wouldbuild a latency matrix representing the distancefrom element X to element Y. The distancematrix approach allows us much more flexi-bility when retrieving information, and muchmore complicated queries can be satisfied. Wedecided against this approach, however, be-cause it was determined that the added com-plexity did not offer that much benefit, andlikely would have few consumers. However,in hierarchical NUMA systems this type of ap-proach is more likely to be required for optimalperformance benefits.

Exporting Topology Information to User Space

Topology information is important to multi-threaded user-space applications. With a largeparallel NUMA machine, threads can be co-ordinated across, but more importantly within,nodes. This can yield significant speedups overa standard SMP version run on the same ma-chine. There is also a proposal to facilitatethe sharing of memory regions by establishingbindings for those regions. This would allowmulti-threaded applications to specify memoryblocks close to the set of CPUs the group ofthreads is executing on, and guarantee pagesfaulted into specific memory regions (likelyshared) come from those memory blocks.

User-space applications currently have ac-cess to NUMA topology information throughsysfs . The information is laid out follow-ing the normal directory structure. Node di-rectories contain CPU and memory block di-rectories, as well as other node directories onmachines that take advantage of nested nodes.Each of these directories have files in them,with those files containing various bits of in-formation that can be read and/or written. For

example, the nodes contain acpumap file thatcontains a bitmap of CPUs on that node.

Currently I/O busses are not represented in thetopology directory ofsysfs . Adding this is afuture work item.

There is currently no way for a user-space ap-plication to determine the CPU it is currentlyexecuting on. This data is inherently volatile,as it requires going into kernel-space to get it,and while returning from the kernel it is pos-sible for the process to be switched to anotherCPU thus invalidating the information that isabout to be returned. There are other ways togive user-space access to this information; forexample, by mapping a page that is shared be-tween user-space and kernel-space and havingthe kernel store the CPU that the process is cur-rently executing on at a set location within thatpage.

4 Memory

This section describes some of the issues en-countered during the development of 2.5 tosupport NUMA memory allocation, and theLinux implementation to address the issues.The purpose of this section is not to pro-vide an in depth look at the Linux memorysubsystem—there is documentation availableon the net for that [1].

4.1 Discontiguous Memory Support

Each architecture needs to describe its physi-cal layout to the kernel. This includes spec-ifying which address ranges belong to whichnode, and whether there are holes in betweenthose ranges (a hole is a physical address rangefor which there is no real memory). CONFIG_DISCONTIGMEM is currently used to repre-sent a solution to some of these problems. Thename of the config option is a bit of a misnomer

Linux Symposium 176

because the memory may not be discontiguous.In the case of the IBM x440 the memory iscontiguous, except for a large hole on the firstnode.

The core data structure for describing the phys-ical layout is thepg_data_t . This data struc-ture currently has a 1:1 mapping to nodes. Foreach node in the system, there exists onepg_data_t . Thepg_data_t describes the startand end of memory for the node, a pointer tothe zones for the node, and related information.Support for multiplepg_data_t ’s have beenin the kernel since 2.4 (although several fixesand optimizations have occured since then). Itis up to each architecture to populate these cor-rectly for their system.

The config option, CONFIG_DISCONTIGMEM turns on the function-ality for creating multiple pg_data_t ’s.CONFIG_NUMA turns on the code (i.e.scheduling decisions, allocation decisions)that makes use of the per nodepg_data_t ’s.

Zone Normal Memory on 32-bit Systems

During the setup of memory in system initial-ization a special allocator is used—the boot-mem allocator. This allocator only allo-cates memory out of what will later becomeZONE_NORMAL. Once memory is setup thebootmem allocator is no longer used. Thebootmem_data_t represents the addressrange used by the bootmem allocator. Thepg_data_t for the node containing the mem-ory has a pointer to thebootmem_data_t(bdata ).

One of the early issues ran into dur-ing the development of i386 CONFIG_DISCONTIGMEM support was the idea thatnot all pg_data_t ’s will have a portion ofZONE_NORMAL, or abootmem_data_t .

On i386, ZONE_NORMAL is limited to thefirst 896 Mb of physical memory because oflimitations of the 32-bit architecture. So, a sys-tem populated with 1GB of RAM per node willonly have a ZONE_NORMAL (and ZONE_DMA) on node 0, the rest of the nodes will onlycontain ZONE_HIGHMEM. Because the slaballocator only allocates memory from ZONE_NORMAL, and the kernel uses the slab alloca-tor to allocate memory for internal data struc-ture, most kernel related memory will be onnode 0. This also means that during early bootonly the first node’s memory will be used bythe bootmem allocator.

Two changes were made to make ZONE_NORMAL only on node 0 work: (1) a deref-erencing a null pointer bug was fixed in__alloc_pages() that didn’t check thatthe node had ZONE_NORMAL before us-ing it. (2) alloc_bootmem_node() andfriends needed to be made to only use thefirst node’spg_data_t , because the othernodes would not have abootmem_data_t .Since alloc_bootmem_node() is archi-tecture independent, it was important to notput arch specific requirements in the code,so the changes were made in the headerfiles. Thus the creation of CONFIG_HAVE_ARCH_BOOTMEM_NODE.

Page to Node Translation

Finding the node a memory address belongs tois used throughout the kernel. The core routinedoing the translation from address to node id,is pfn_to_nid() . Because it is called often,it is important that the translation is fast. Onsome architectures, the physical layout is suchthat the first 64GB of address space belongsto the first node (whether or not there reallyis 64GB of RAM), second 64GB for the sec-ond, and so on. This makes the algorithm forfiguring out what node an address belongs to

Linux Symposium 177

very simple, and very fast: if the address is be-tween 0-64GB it’s node 0. But on i386 NUMAarchitectures that have been tested, it is not asclear. Because the memory is contiguous, thereisn’t a nice GB to node translation. The solu-tion was to create a mapping of addresses tonode IDs on 256MB address ranges. The mapis created during memory setup and allows forfast translations.

4.2 Node Aware Memory Allocation

One of the features enabled by CONFIG_NUMA is that the system makes NUMA-awarememory allocation decisions. The current pol-icy is when memory is allocated, the kerneltries to allocate from the local node. If thatfails, the allocator will allocate from the othernodes. The exception is in the case of a systemwith ZONE_NORMAL only on the first node;in an i386 NUMA box for example, memoryallocated from ZONE_NORMAL will only beallocated from the first node.

The allocation policies only apply to newmemory. Should a process migrate acrossnode, the memory related to the process willnot be migrated. Although if the memory isswapped out, when the pages are swapped backin they will be swapped to the node the processhas migrated to. One thing to keep in mind, isthat migrating the memory is expensive; how-ever, if a page is being accessed often, it wouldbe a performance benefit to move it to the localnode.

When the kernel is attempting to allocate mem-ory, and the system is low on pages,kswapdwill be woke to address the low memory is-sue. Without NUMA awarenesskswapd mayfree up lots of memory by swapping pages out,but it may not make available memory local tothe node that is in need of memory. The solu-tion was to makekswapd per node.kswapdmonitors the memory on the local node. When

memory needs to be freed, it’s freed from thelocal node. Rmap [3] made this change tokswapd possible, because of the ability to findthe virtual address(es) associated with a physi-cal address (local to the node).

4.3 Node Local Kernel Data Structures

For kernel data structures that are frequentlyaccessed and have node specific information,it makes sense to have their data structures innode local memory. On most architectures,when the bootmem allocator is available, itis possible to allocate memory on a specificnode through the use ofalloc_bootmem_node() . However, on i386 the bootmem al-locator only allocates from node 0, soalloc_bootmem_node() doesn’t work for allocat-ing per node and all memory is allocated fromnode 0. Because of the limited lifespan ofthe bootmem allocator,alloc_bootmem_node() is not a complete solution. No othergeneric mechanism is available at this time forallocating data structures on a per node basis.A possible solution for a generic mechanism iscurrently in development by Bill Irwin [2].

To work around this 32-bit architecture limi-tation, for the specific case of themem_mapand pg_data_t , Martin Bligh has success-fully made these two types of data structuresreside in node local memory. That is, thesestructures are located on the node for the mem-ory that they are describing.

The first phase was to makemem_mappernode. This was done by reserving pages at thetop of the node and decrementing the size ofthe address space by the size ofmem_map, andthen making use of that reserved space whenmem_mapwas set up. Nothing special had tobe done for node 0, because it is where thebootmem allocator gets it memory for node 0’sdata. So, for node 0 the the normal bootmenallocator can be used. Phase two was to make

Linux Symposium 178

pg_data_t per node. This was done usingthe same method as for themem_map.

4.4 Replication

Since kernel text is read-only on productionsystems, there is little downside to replicat-ing it and placing a copy on each node. Thisdoes consume extra memory, but kernel textis relatively small and memories of NUMAmachines relatively large. Kernel-text replica-tion requires special handling from debuggerswhen setting breakpoints and from/dev/memin cases where users are sufficiently insane tomodify kernel text on the fly. In addition,kernel-text replication means that there is nolonger a single “well-known” offset between akernel-text virtual address and the correspond-ing physical address.

This functionality is present in some architec-tures (e.g., sgi-ip27) in the 2.4 kernel. Also, theIA64 discontigmem patch provides kernel textreplication support for IA64. It is not likely toshow up in the i386 tree because of the limita-tions of the architecture.

4.5 Memory Binding

As mentioned in other sections of this paper,writing code to run on a NUMA machine canrequire changes to take advantage of the inter-esting hardware configurations these machinesoffer. Memory Binding is one API that wefeel large user-space programs will be able touse to make significant performance improve-ments for NUMA. The idea behind MemoryBinding is that processes can selectively bindranges of their virtual memory space to par-ticular blocks of memory, according to differ-ent allocation policies. For example, a largedatabase program that has many threads couldbind its threads to CPUs on two nodes, and alsobind a large section of its shared memory to thememory that belongs on those two nodes. By

setting a policy that enforces an equal distribu-tion of pages, the database could be sure thatall its shared pages are at least on the same setof nodes as its processes, and that the memoryis evenly spread across those nodes. The Mem-ory Binding API is available as a patch from:

http://www-124.ibm.com/linux/

patches/?patch_id=753

http://www-124.ibm.com/linux/

patches/?patch_id=754

5 NUMA scheduler

5.1 Introduction

As explained in the introductory section, ac-cessing the memory of a remote node impliestaking penalties in memory access latency andbandwidth. Therefore, it is desirable to keepprocesses on or near the node on which theirmemory (or most of it) is allocated.

The old (pre 2.5) Linux scheduler wasn’t awareof the NUMA structure of a machine. Pro-cesses could migrate to any CPU in the sys-tem if the CPU was less loaded. On NUMAmachines with many CPUs, two scalabilityproblems were additionally limiting the per-formance: the CPUs were competing for therunqueue lock and the time needed for select-ing the task to be scheduled next was linearlygrowing with the length of the runqueue.

The scalability problems were mostly solvedby the O(1) scheduler[4]. Like other ap-proaches [5, 6] it implements per CPU run-queues1 avoiding the lock starvation problem.Additionally it implements anO(1) search al-gorithm for the task to be scheduled next.The scalability problems for SMP machineswere solved, but theO(1) scheduler was not

1There are actually two runqueues per priority levelper CPU.

Linux Symposium 179

NUMA-aware, either. An idle CPU could eas-ily steal a task from the node where its mem-ory was allocated letting it run with degradedmemory performance.

5.2 NUMA scheduler approaches

The first notable Linux NUMA scheduler wasthe one Andrea Arcangeli made on top of theold Linux scheduler[7]. It implemented pernode runqueues and scheduled across nodeboundaries only after failing to find an optimalCPU within the same node. Being built on topof the old Linux scheduler this approach suf-fered of very similar scalability limitations.

Another approach was the extension of theIBM MQ scheduler [5] to allow reschedulingonly inside pools of CPUs [8]. A loadbalanc-ing module was added which allowed periodicrebalancing across the pool boundaries.

The first NUMA scheduler on top of theO(1)scheduler was designed and implemented byErich Focht [9]. Tasks were assigned a homenode at creation time (either atfork() or atexec() , selectable for each task), allocatedtheir memory on (or near) the home node, andwere attracted by the home node CPUs. Thetasks were node-affine. Because the schedulerchanges were too complex for inclusion intothe 2.5 kernel baseline, Erich Focht, MichaelHohnbaum, Martin Bligh, and Andrew Theurercollaborated with the target to strip down andrewrite the node-affine scheduler to a slimNUMA variant acceptable for inclusion. Theresult was included into the 2.5.59 kernel andis described in the following section.

5.3 NUMA scheduler in the 2.5 kernel: imple-mentation

When stripping down the node-affine sched-uler, the goals were to keep the changes to theO(1) scheduler as small as possible, and to add

NUMA awareness by making it difficult for atask to change the node while trying to keep thenode load well-balanced. This was achieved bythree patches.

Initial load balancing at exec()

The NUMA support for the memory subsys-tem described in section 4 ensures that mem-ory pages are allocated from the node on whichthe page-faulting task is running2. Normallyprocesses allocate most of their memory rightafter creation; therefore, the choice of the ini-tial node and CPU is very important for gettingwell-balanced nodes.

Initial load balancing implies some over-head because it involves scanning the cur-rent node loads and determining the bestCPU on which the freshly created task shouldbe scheduled. This can be done either atfork() or atexec() . Doing it at fork()(andclone() ) has the advantage that multi-threaded jobs lead to a balanced machine aswell. This might be desireable on machineswith good latency ratios between the nodes.On the other hand, every small and short livingthread picks up the initial balancing overhead,unnecessarily migrates pages to other nodes bycopy on write (COW), and finds a cold instruc-tion cache.

Doing initial balancing atexec() avoids theCOW problem because all pages are droppedat that stage. Short-living threads whichdon’t exec() benefit from a warm instruc-tion cache. But long running memory intensivemulti-threaded programs might pick up perfor-mance penalties due to the unbalanced nodes.

The implementation adds the arraystaticatomic_t node_nr_running[MAX_NUMNODES]to keep track of the num-

2If the current node has sufficient free memory.

Linux Symposium 180

ber of tasks running on each node.kernel/sched.h is extended by threefunctions:

• sched_best_cpu() : Finds the leastloaded CPU on the least loaded node us-ing the current runqueue lengths.

• sched_migrate_task() : Migratesa task to a certain CPU.

• sched_balance_exec() : Called bydo_execve() , it moves the current taskto the least loaded CPU.

Intra-node load balancing

The load_balance() and find_busiest_queue() functions of theO(1) scheduler have been modified to restrictthe search for the busiest CPU to the setgiven by the newcpumask argument. In theNUMA scheduler this mask uses topologyinformation and usually limits the search tothe current node. To be precise: all calls toload_balance() except the one from thetimer interrupt are balanced only within thecurrent node.

Cross-node load balancing

Even with a perfect initial load balancing, amachine can easily end up with poorly bal-anced nodes, e.g. nodes with more runningtasks than available CPUs and idle nodes. Insuch cases, it is preferrable to use the idleCPUs for doing real work even if the tasksrunning on them need to access memory fromother nodes. It is better if a task runs slower ona remote node instead of waiting for the CPUon its own node. The cross-node balancingoccurs periodically during the timer interrupt,with the current settings (kernel 2.5.67) this

means: an idle CPU will try node-rebalancingevery 5 ticks (5ms on aHZ=1000 system); abusy CPU will do it every 20s.

There continues to be debate as to the fre-quency of the busy rebalance, with some be-lieving the busy rebalance is occuring muchtoo infrequently. It is felt that the current fre-quency, while showing advantages on simplebenchmarks is not optimal for real world con-ditions.

Node rebalancing is achieved by a change inscheduler_tick() and three additionalroutines:

• rebalance_tick() : Decides when tobalance within the node and when acrossthe node boundaries. In the later case itwill first try an intra-node rebalance.

• balance_node() : Calls load_balance() with cpumask set to theleast-loaded node plus the current CPU.

• find_busiest_node() Finds busi-est node and uses a geometrically decay-ing weight for the load measure:loadt =loadt−1/2 + nr_node_runningt. Thisflattens out sudden load peaks.

5.4 Current limitations and future develop-ments

The NUMA scheduler currently implementedin the Linux kernel is far from being complete.The degree of NUMA-awareness of the sched-uler gives clear performance boosts for “sim-ple” load situations like parallel kernel com-piles or an arbitrary but more or less constantnumber of similar and long runningexec ’dprocesses. The limitations are shown in envi-ronments with long running jobs, and in sud-denly varying loads, or with long running mul-tithreaded applications like OpenMP.

Linux Symposium 181

The stripped down version of the node-affinescheduler strongly reflects the influence of for-mer IBM work [8]. Some of the useful featuresof [9] were lost, among them the capability ofa process to remember the node on which itsmemory resides and to return to that node. Ascheduler with such features is in productionon NEC TX7 IA64 servers and shows signifi-cant benefits in production environments. Thuspossible extensions of the 2.5 NUMA sched-uler could be:

• An option to allow particular tasks to ini-tially balance their children atfork() .

• A method of keeping track of where onetask’s memory is.

• A method of pushing tasks to the nodewhere most of their memory resides.

6 Locking

In contrast to much of the other NUMA work,NUMA-aware locking is not about making aper-node lock, but rather it is about prevent-ing lock starvation on highly-contended locks.Lock starvation occurs when the contention ona given lock is so high that by the time a CPUreleases the lock, at least one other CPU onthat same node is requesting it again. NUMAlatencies mean that these local CPUs can ac-quire the lock when it becomes available fasterthan remote CPUs can. On some architectures,the CPUs on the node where the lock is locatedcan monopolize the lock, completely starvingCPUs on other nodes.

The best solution is to reduce lock contention,but NUMA-aware locks can be an interim fixwhile the locking design is reworked.

Two fair locking primitives are:

• mcs locks [10]

mcs locks are queued locks. The primi-tive enforces fairness because requestersare queued. The queuing ensures that thelocal CPU does not have an advantageon getting the lock. It’s first come, firstserved.

• NUMA-aware locks [11]

NUMA-aware locks enforce fairness byusing a round-robin system amongstnodes waiting for a lock. The implemen-tation was written so that the fairness al-gorithm could be modified to fit the need.This means, if round robin proves ineffi-cient, another method can be inserted.

The work in the area of NUMA-aware lock-ing is currently not active. As previouslymentioned, the Linux solution for a highly-contended lock is to break the lock up, and sofar lock starvation has not been seen to be aproblem. Therefore a need for a NUMA-awarelock has not been established.

7 I/O

As with many aspects of writing software torun efficiently on NUMA platforms, I/O codebenefits from fine-tuning for these machines.The following section goes into more detailabout: why I/O subsystems require NUMAconsiderations, the current state of Linux sup-port of I/O on NUMA hardware, and where itmight be going.

7.1 I/O Locality

As discussed in the topology section, onNUMA machines I/O busses are ususallyspread across nodes. When scheduling I/Owe attempt to ensure that the memory beingused for the I/O is close to the specific I/Obus we are using. Cross-node I/O requests

Linux Symposium 182

suffer a performance penalty when comparedto I/O requests that are constrained to a sin-gle node. Cross-node I/O travels across thenode interconnect busses and has the potentialto consume interconnect bandwidth, thus de-grading the performance of other processes. Ifthe memory buffers used for I/O are physicallylocated in memory far from the I/O bus, therewill also be delays for cross-node memory ac-cess. Ideally, the requesting process executeson a CPU on the node with the memory and I/Obus, thus eliminating any inter-node accesses.

7.2 Multi-Path I/O

While Multi-Path I/O, or MPIO for short, is nota new concept, it can be a particularly powerfultool on a NUMA platform. MPIO involves us-ing multiple I/O adaptors (i.e., SCSI cards, net-work cards) to gain multiple paths to the under-lying resource (i.e., hard disks, the network),thus increasing overall bandwidth. On SMPplatforms, potential speedups due to MPIO arelimited by the fact that all CPUs and mem-ory typically share a bus, which has a maxi-mum bandwidth. On NUMA platforms, how-ever, different groups of CPUs, memory, andI/O busses have their own distinct busses. Thisallows potentially achieving larger aggregrateI/O throughput by allowing each node to inde-pendently reach its maximum bandwidth. Anideal MPIO on NUMA setup consists of an I/Ocard (SCSI, Network, etc.) on each node con-nected to every I/O device, so that no matterwhere the requesting process runs, or wherethe memory is, there is always a local route tothe I/O device. With this hardware configura-tion, it is possible to saturate several PCI busseswith data. This is even further assisted by thefact that many machines of this size will be us-ing RAID or other MD devices, thus further in-creasing the potential bandwidth by using mul-tiple disks.

There is a patch, currently against 2.5.59, that

implements MPIO for the SCSI Mid-Layer inLinux. The SCSI layer is in the midst of manychanges right now, some of which affect algo-rithms this patch was based on. This patch [12]is maintained by Patrick Mansfield, and is be-ing discussed in vastly more detail at anotherpresentation at OLS.

7.3 Interrupt Routing and Balancing

Interrupt handling is another area where ig-noring NUMA locality issues can be costly.When dealing with interrupts, it is importantthat they are handled locally. Some architec-tures and APIC setups prevent interrupts frombeing handled remotely by their design, but forthose that don’t, we must make sure that inter-rupts are kept local. What this means is that if,for example, an I/O device raises an interrupt,it should be handled by a CPU on the samenode as the I/O device. At the same time, wedon’t want every interrupt occurring on a par-ticular node to be handled by the same CPU.Currently the Linux kernel takes advantage ofthe balance IRQ functionality, which changesthe destination of individual IRQs to a differ-ent CPU after a certain number of ticks. Thiscode is not aware of NUMA topology, though,and thus may sometimes make poor IRQ des-tination decisions. There is significant work tobe done still in this area for NUMA support.

On some chipsets, IRQ balancing is providedby the hardware, for example the 460GX re-alted chipsets (used by the NEC TX7). Thischipset provides either a fixed redirection orcan redirectable within a target node based onpriorities.

8 Timers

On UP systems, the processor has a time sourcethat is easily and quickly accessible, typicallyimplemented as a register. On SMP systems,

Linux Symposium 183

the processors’ time source is usually synchro-nized as all of the processors are clocked atthe same rate, and thus synchronization of thetime register between processors is a straightforward task.

On NUMA systems synchronization of theprocessors’ time source is not practical as notonly does each node have its own crystal pro-viding the clock frequency, but there tend tobe minute differences in the frequencies thatthe processors are driven at which thus leadsto time skew.

On multi-processor systems it is imperativethat there is a consistent system time. Other-wise time stamps provided by different proces-sors cannot be relied upon for ordering and if aprocess is dispatched on a different processor itis possible that there can be unexpected jumps(backward or forward) in time.

Ideally, the hardware provides one global timesource with quick access times. Unfortunately,global time sources tend to require off-chip ac-cess and often off-node access which tend to beslow. Clock implementations are very architec-ture specific, with no clear leading implemen-tation amongst the NUMA platforms. On thex440, for example, the global time source isprovided by node 0 and all other nodes mustgo off-node to get the time.

In Linux 2.5, the i386 timer subsystem has anabstraction layer that simplifies the addition ofa differnt time source provided by a specificmachine architecture. For standard i386 archi-tecture machines, the TSC is used which pro-vides a very quick time reference. For NUMAmachines, a global time source is used (e.g., onthe x440 the cyclone timer).

9 Summary

Much work has been done to provide NUMAsupport for the Linux kernel. At this point, thebasic infrastructure is in place. Performancetesting has shown measureable improvements,though they tend to be widely variable depen-dent upon the workload and the NUMA hard-ware. As Linux gets used on more NUMAhardware platforms, there are bound to be ad-ditional areas exposed which will benefit fromadditional NUMA optimizations.

Some areas that are actively being worked onor considered for future work are:

• I/O busses insysfs topology

• MPIO

• scheduler enhancements

• interrupt routing and balancing

• kernel data structure placement

• memory binding

• page migration

• timers

Legal Statement


IBM, NUMA-Q, Power4, and xSeries are trade-marks or registered trademarks of InternationalBusiness Machines Corporation in the UnitedStates and/or other countries.

Intel, i386, Itanium and Xeon are trademarks of In-tel Corporation in the United States, other coun-tries, or both.


Linux Symposium 184


References

[1] M. Gorman: “Understanding the LinuxVirtual Memory Manager,” April 2003,http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand/

[2] W. Irwin, March 2003http://marc.theaimsgroup.com/?l=linux-kernel&m=104660383911943&w=2

[3] LWN.net, “Speeding up reversemapping”http://lwn.net/Articles/9555/

[4] Ingo Molnár,http://people.redhat.com/mingo/O(1)-scheduler/

[5] M. Kravetz, H. Franke: “Implementationof a Multi-Queue Scheduler for Linux,”April 2001,http://lse.sourceforge.net/scheduling/mq1.html

[6] Davide Libenzi, October 2001,http://www.xmailserver.org/linux-patches/lnxsched.html

[7] Andrea Arcangeli, “NUMA,”presentation at the UKUUG Manchester,June 2001,http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-numa.html

[8] H. Franke et al, “PMQS: Scalable LinuxScheduling for High-End Servers,”http://lse.sourceforge.net/scheduling/als2001/pmqs.ps

[9] Erich Focht: “Node-Affine NUMAScheduler,” Feb. 2002,http://home.arcor.de/efocht/sched

[10] John Stultz: “Nodeless MCS Lock,”http://www-124.ibm.com/linux/patches/?patch_id=218

[11] “NUMA AWARE LOCKS,”http://lse.sourceforge.net/numa/locking

[12] Patrick Mansfield: “SCSI Mid-LevelMulti-path/port storage,”http://www-124.ibm.com/storageio/multipath/scsi-multipath/index.php .

Kernel configuration and building in Linux 2.5

Kai GermaschewskiUniversity of Iowa

[email protected]

Sam RavnborgEricsson Diax A/S

[email protected]

Abstract

The development phase of Linux 2.5 broughtsubstantial changes to the kernel configurationprocess, the actual kernel build and, in partic-ular, implementation and building of loadablemodules.

The first part of this paper will give anoverview of the user-visible changes which oc-cured in Linux 2.5, on the one hand for userswhich build kernels themselves, on the otherhand for developers which maintain drivers orother parts of the kernel, in order to help port-ing to Linux 2.5/2.6.

The second part of the paper deals with the ac-tual design and implementation of the currentkbuild, showing howGNU makeis actuallyflexible enough to allow for nice condensedMakefile fragments which per subdirectory de-scribe which objects to build into the kernel oras loadable modules. The paper ends with anoutlook showing possible approaches for im-plementing additional features.

The paper also explains the improvements inhandling loadable kernel modules, includingsymbol versioning, and the necessary buildsystem changes.

1 Introduction and history

Why is a kernel build system necessary at all,and why does the Linux kernel use its own spe-

cial solution?

As the Linux kernel evolved from a student’sterminal emulation program towards a full-featured UNIX-like kernel, changes to the wayit was built became necessary and were inte-grated, so the kernel build system basically fol-lowed the evolutionary development of the ker-nel itself.

In the science world, in particular people run-ning numerical simulations, many people con-sider a build system completely unnecessary,they just run

f77 code.f./a.out

However, this approach obviously doesn’tscale to large projects. To keep projects main-tainable, some kind of modularization occursthe code is divided into a number of source filesand as the project is growing further, a direc-tory hierarchy is introduced which helps orga-nizing the code even further.

During development, normally only one or afew files are edited and then the developerwants to rebuild the program, in this case thekernel vmlinux, be it just for compile-timechecks or testing.

First of all, one does not want to enter all thecommands manually for each build, so sometype of script is necessary to record those com-mands. Next, it is actually a waste to recom-pile every file if only few have changed. Smart

Linux Symposium 186

programmers recognized this a long time agoand invented a tool calledmakewhich is stillthe most popular build tool used today. We as-sume in this paper that the audience is familiarwith the basics ofmake.

So even Linux 0.01 came already with a Make-file which took care of building the kernel.

As time passed and Linux matured, new fea-tures were incorporated into the build system,such as

• Automatic generation of dependencyinformation. makeonly handles simpledependencies like the dependency of anobject file on the corresponding source au-tomatically, other prerequisites as for ex-ample included header files need to beadded to the Makefile explicitly, a taskwhich can be (and was) automated.

• Configurability. As the code base forthe Linux kernel expanded, a need for auser selectable configuration became ap-parent and was introduced before releaseof Linux 1.0. This system allows the userto answer questions with respect to whichcomponents are desired, and then onlybuilds those components into the kernel.

• Different architectures and cross–compilation. Linux introduced supportfor different architectures, which meansthe kernel is build from a large arch-independent code base as well as somemachine-specific low-level code. It isalso often necessary to cross–compilethe kernel, i.e. do the compilation on adifferent platform than it is actually runon.

• Loadable modules. Within Linux 1.3,support for loadable kernel modules wasintroduced, which again needed special

support in configuration and building ofthose objects.

In particular the high configurability and sup-port for loadable modules distinguish theLinux kernel from most other projects, and itthus comes as no surprise that its build systemalso evolved away from a standard Makefile.However,makeis still the underlying tool usedfor building the kernel. In fact, the extensibilityof theGNUversion ofmake[1] in conjunctionwith some support scripts / C code renders itpossible to meet the goals listed above.

2 A dummy’s guide to kbuild

This section is addressed to users and will ex-plain how to use the kernel build system inLinux-2.5/2.6. “Users” (as opposed to “devel-opers”) here mean people who download thelinux kernel tree source, possibly apply patchesand then build and install their own kernels. Ofcourse, since kernel developers need to buildand run kernels, too, this section is of relevancefor them as well.

The build system is based onGNU make, i.e.all commands are given tomakeby invoking itas

make <target>

Contrary to many userspace packages whichare using autoconf/automake, there is no pre-ceding ./configure necessary, the neces-sary configuration process is embedded intothe build process.

The actual targets are in part platform-specific,for example on i386 one typical wants to buildthe boot imagebzImage and modules. A listof supported targets for the platform can be ob-tained frommake help .

Linux Symposium 187

Arch maintainers should setup their arch-specific Makefile in a way that invokingmakewithout parameters will build the commonlyused boot target for the architecture, for exam-ple on i386 just typing

make

will build bzImage and modules (the latteronly whenCONFIG_MODULESis selected, ofcourse) which is what is typically needed.

If one just runsmakeafter unpacking the ker-nel source tarball,makewill actually just errorout, asking you to configure your kernel firstby runningmake *config . (In Linux-2.4and before, it would invokemake configfor you, but this is the wrong choice in 99%of the cases, since nobody likes answering astraight sequence of a couple of hundred ques-tions. . . )

To generate a new kernel configuration, it isrecommended to usemake menuconfig ,make xconfig (which uses Qt now) ormake gconfig (uses gtk).

However in most cases, it is easier to adapt anexisting kernel configuration to the current ker-nel than to create a new one from scratch. Thisis done by copying the.configfile into the top-level directory of the source tree. kbuild willrecognize that the .config file may need adap-tion for the current kernel source and automat-ically run make oldconfig for you, whichmakes sure that.config is consistent with thecurrent rules and asks the user about the valueof previously not existing options.

So the normal sequence for building a kernel isjust

cp /my/old/.config .configmake

where one could insert amake *config be-

tween those two steps if a change of configura-tion options is desired.

The last remaining step is the installation ofthe newly built kernel. The procedure to in-stall the boot image depends of course on thebootloader used.

For lilo , the kernel boot imagebzImageshouldbe copied to a certain location (typically/boot),then/etc/lilo.confmay need an appropriate en-try and finally/sbin/lilo must be run.

For grub, copying the kernel image to/bootand possibly editing/etc/grub.confshould suf-fice.

An important change is that on i386bzIm-age/zImagecan not be directly booted froma floppy disk anymore. Instead the targetszdisk and fdimage create a boot floppydisk and a boot disk image, respectively. Thosetargets now require mtools and syslinux to beinstalled.

Since the actual installation of the bootimage varies as described above, one cangive the install target to make, whichwill invoke a user– or distro–providedscript, ˜/bin/installkernel or/sbin/installkernel which can becustomized for the local setup.

Installing modules is simpler, just invok-ing make modules_install will do thenecessary work. By default this will in-stall into /lib/modules/‘uname -r‘/ ,though this can be customized by settingINSTALL_MOD_PATH, e.g. if one wants to col-lect the modules for transfer onto a differentmachine.

This is basically all knowledge which is neededto build a Linux kernel—everything else ishandled automatically by the build system.Applying patches, editing files, changing con-figuration options or adding compiler flags—

Linux Symposium 188

the build system will notice the change and re-build whatever is needed. The one exceptionto this rule is changing the architecture (by set-ting the ARCHvariable), which needs an ex-plicit make distclean to work correctly.

3 kbuild for kernel developers

3.1 kbuild in the daily work

Since developers tend to build kernels andmodules a lot, the previous section of coursealso applies to them, in particular the fact thatjust runningmake will recognize all changesand rebuild whatever is necessary to generate aconsistentvmlinuxand modules.

Some additional features exist to support thedevelopment / debugging process:

• make some/path/file.o will re-build the single file given, using compilerflags (e.g. -DMODULE) according to thecurrent.config.

• make some/path/file.i willgenerate a preprocessed version of/some/path/file.c , again usingcompiler flags for the current configura-tion.

• make some/path/file.s will gen-erate a file containing the raw assemblercode forsome/path/file.[cS] .

• make some/path/file.lst (littleknown but very useful) gives interspersedassembler code with the C source, relo-cated to the correct virtual address when acurrentSystem.mapexists.

Another useful feature for the daily work,which has existed for a long time, is the ability

to override theSUBDIRSvariable on the com-mand line, which will forcemaketo only de-scend into the given subtree. This can be veryuseful for faster build times, but it bypassessome dependencies and thus does not guaran-tee to result in a consistent state.

So while e.g. working on thehisax ISDNdriver, it’s useful to callmakeas

make SUBDIRS=drivers/isdn/hisax \modules

for compile checks etc. However, beforeinstalling a new kernel and modules, theauthors advise to always run a fullmakebzImage/vmlinux/modules (or other-wise, do not complain ;).

3.2 Integrating a driver

Basically each subdirectory in the Linux ker-nel tree contains a file calledMakefile, whichis included bymakeduring the kernel buildprocess. However, these Makefiles are differ-ent from regular Makefiles in that they nor-mally don’t have any targets or rules, but onlyset variables which tell the build process whatshould be built and the latter takes control ofthe actual compiling and linking.

In conjunction with the Makefile there nor-mally exists aKconfig file, these files wereintroduced with the configurator rewrite byRoman Zippel and replace the oldCon-fig.in/Config.helpfiles used during the config-uration phase of the kernel build.

This paper does not intend to elaborate on thenew kernel configuration system, however thefollowing examples will provide some basicusage guidance.

The most common case is adding a new driverwhich is built from a single source file.

Linux Symposium 189

config TIGON3tristate "Broadcom Tigon3 support"depends on PCIhelp

This driver supports Broadcom Tigon3 based gigabit Ethernet cards.

If you want to compile this driver as a module ( = code which can beinserted in and removed from the running kernel whenever you want),say M here and read <file:Documentation/modules.txt>. This isrecommended. The module will be called tg3.

Figure 1:Kconfigfragment for the Tigon3 driver.

Figure 1 shows theKconfig fragment for theTigon3 driver, which defines a config optionTIGON3 (the corresponding variable will begiven the nameCONFIG_TIGON3), which isa tristate, i.e. can have the valuesy , m, orn with the obvious meanings (a config optionwhich has been turned off, actually has thevalue "" , to be correct here). The fragmentdepends on PCI states that this option isonly selectable when the optionPCI is also set(that is, if the kernel supports the PCI bus).

The Makefile fragment for the Tigon3 driver

obj-$(CONFIG_TIGON3) += tg3.o

is very short and though a little awkward atfirst, a very elegant way to quickly expresswhat files are supposed to be built. Theidea dates back to Micheal Elizabeth Castain’sdancing Makefiles[2] and was globally intro-duced into the kernel by Linus shortly beforethe release of kernel 2.4.

What happens is that depending on the con-fig optionCONFIG_TIGON3, the valuetg3.ois appended to either of the variablesobj-y ,obj-m or obj- .

The meaning of those special variables is asfollows:

• obj-y . All objects listen inobj-y will

be compiled as built-in objects, and willfinally be linked intovmlinux.

• obj-m . All objects listed inobj-m andnot not listed inobj-y will be compiledas modules (so they actually end up beingcalled e.g.tg3.ko in 2.5/2.6).

• obj- . All objects listed inobj- and notin obj-y or obj-m will be ignored bykbuild.

Since the build system does not have any fur-ther information ontg3.o , it will try to buildit from a source file calledtg3.c (or an as-sembler sourcetg3.S , which only happens inthe architecture dependent part of the kernel,though).

This is all what is needed to integrate a simpledriver into the kernel build, other than of coursewriting the driver (tg3.c ) itself.

It is also possible to list more than one object tobe built in the Makefile statement. The Make-file line dealing with theeepro100driver lookslike the following:

obj-$(CONFIG_EEPRO100) += \eepro100.o mii.o

If this driver is selected, themii.o supportmodule also needs to be compiled, which isachieved by simply appending it to the state-ment.

Linux Symposium 190

Other network drivers will, if selected, alsoadd mii.o to the list of objects to be built—this is fine, the build system handles this case.It is even possible that a support module likemii.o got added to the list of built-in objectsobj-y and obj-m —again, the build systemrecognizes this fact and just compiles the built-in version, which will also be usable for thedrivers compiled modular.

The newe100driver examplifies two more fea-tures. drivers/net/Makefileonly contains theline

obj-$(CONFIG_E100) += e100/

which tells kbuild that it should descend intothe e100/subdirectory if the optionCONFIG_

E100 is set. What to do there will then bedetermined bydrivers/net/e100/Makefile(Fig-ure 2):

The first line after the comment looks famil-iar, it advises the build system to builde100.obuilt-in/modular depending on the value ofCONFIG_E100. WhenCONFIG_E100 equals“m” the e100 driver is built as a module andwill be named em e100.ko.

The next line then states thate100.ois a com-posite object which should be linked from thelisted individual object files—these object fileswill automatically compiled with the appropri-ate flags.

As a last point, instead of using the vari-able<modname>-objs to declare the compo-nents of the module<modname>.o, the variant<modname>-y can be used, which allows foreasy definition of optional parts to a compositemodules, as seen in the example in Figure 3.

4 What is new in Linux-2.5/2.6’skbuild?

In this section, we describe some of the stepsin the evolution of the kernel build systemduring the development phase of Linux 2.5.One purpose is to show how this evolutioncould actually be divided into small, Linus-compatible “piece-meal” patches without thefamous “flag-day” patches and with only littlebreakage along the way.

We will also show how using the extensionsprovided by GNU make were actually ex-ploited to provide a better build system whilestill using a standard tool instead of creating aspecialized build solution for the kernel fromscratch.

We start by comparingdrivers/isdn/Makefilein 2.4 and 2.5 (Figure 4), where many of theimprovements are easily seen. (a) shows theMakefileas it is present in Linux 2.4.20, and(b) shows the simpler variant present in 2.5.kbuild has been adapted incrementally to allowthe more concise syntax. The following sec-tions will describe the internals that eventuallyallowed for the layout seen in (b).

4.1 O_TARGET/ linking objects in subdirecto-ries

First of all, we start with a short descrip-tion of what the kbuild interal implementa-tion, which is hidden in the top-levelMake-file and scripts/Makefile.*typically does in asubdirectory: From the kbuildMakefile lo-cated in the subdirectory we obtain a listof what to build from the variablesobj-y(built-in) and obj-m (modular) as explainedin the previous section. The default targetin scripts/Makefile.buildis __build and thecorresponding rule, shown in Figure 5, defineswhat work needs to be done. Important hereis that we buildO_TARGETor L_TARGET, re-spectively, when buildingvmlinuxandobj-mwhen compiling modules. As opposed to 2.4,in 2.5 O_TARGETis a kbuild internal variable

Linux Symposium 191

## Makefile for the Intels E100 ethernet driver

obj-$(CONFIG_E100) += e100.o

e100-objs := e100_main.o e100_config.o e100_phy.o \e100_eeprom.o e100_test.o

Figure 2:/drivers/net/e100/Makefile

## Makefile for the Linux X.25 Packet layer.#

obj-$(CONFIG_X25) += x25.o

x25-y := af_x25.o x25_dev.o x25_facilities.o x25_in.o \x25_link.o x25_out.o x25_route.o x25_subr.o \x25_timer.o x25_proc.o

x25-$(CONFIG_SYSCTL) += sysctl_net_x25.o

Figure 3:net/x25/Makefile

and needs no longer be defined in the kbuildmakefiles. Except for the rare case of build-ing an actual library,O_TARGETis used in thebuilt-in case and we find the rule how to makeit as

$(O_TARGET): $(obj-y) FORCE$(call if_changed,link_o_target)

SoO_TARGETis linked from the objects listedin obj-y , which contains files locally com-piled in the current directory as well as objectswhich are built in subdirectories by descend-ing. In Figure 6, we see how going from theleaves to the root, theO_TARGETin each subdi-rectory (here always calledbuilt-in.o) accumu-lates the objects built below that directory untilwe finally end up withvmlinuxat the root of thehierarchy containing all built-in objects gen-erated throughout the tree (this example onlyshows a small fraction of the objects linked ina normal build).

vmlinux|-- drivers/built-in.o| ‘-- isdn/built-in.o| |-- isdn.o| | |-- isdn_common.o| | ‘-- isdn_net.o| || |-- hisax/built-in.o| | |-- hisax.o| | | |-- config.o| | | ‘-- isdnl*.o| | ‘-- hisax_fcpcipnp.o| || ‘-- icn/built-in.o| ‘-- icn.o|‘-- fs/built-in.o

Figure 6: The hierarchy for linkingvmlinux

Linux Symposium 192

(a)

O_TARGET := vmlinux-obj.oexport-objs := isdn_common.o

list-multi := isdn.oisdn-objs := isdn_net.o isdn_tty.o isdn_v110.o isdn_common.oisdn-objs-$(CONFIG_ISDN_PPP) += isdn_ppp.oisdn-objs += $(isdn-objs-y)

obj-$(CONFIG_ISDN) += isdn.oobj-$(CONFIG_ISDN_PPP_BSDCOMP) += isdn_bsdcomp.o

mod-subdirs := hisaxsubdir-$(CONFIG_ISDN_HISAX) += hisaxsubdir-$(CONFIG_ISDN_DRV_ICN) += icn

obj-y += $(addsuffix /vmlinux-obj.o, $(subdir-y))

include $(TOPDIR)/Rules.make

isdn.o: $(isdn-objs)$(LD) -r -o $@ $(isdn-objs)

(b)

obj-$(CONFIG_ISDN) += isdn.oobj-$(CONFIG_ISDN_PPP_BSDCOMP) += isdn_bsdcomp.o

isdn-y := isdn_net_lib.o isdn_fsm.o isdn_tty.o \isdn_v110.o isdn_common.o

isdn-$(CONFIG_ISDN_PPP) += isdn_ppp.o

obj-$(CONFIG_ISDN_DRV_HISAX) += hisax/obj-$(CONFIG_ISDN_DRV_ICN) += icn/

Figure 4:drivers/isdn/Makefilein (a) 2.4.20 and (b) adapted for the build system in 2.5

In Linux 2.4, the name for the object whichaccumulates all built-in objects in and belowthe current subdirectory was chosen by settingthe variableO_TARGETin the local Makefile.In Linux-2.5, it is instead just set tobuilt-in.oby the build system. This allows to get rid ofthe assignment ofO_TARGETin every subdirMakefile and, more importantly, allows for fur-ther clean-up:

In kbuild-2.4, we need to explicitly add thesubdirectories to descend into to the variablessubdir-y/m , and then also add the subdir-generated built-in objects toobj-y so that theyget linked. This is redundant and error-prone,in 2.5 it is sufficient to just add the objects gen-erated in the subdirectories to the list of ob-jects to be linked, and the build system will de-duce from there that it needs to descend intothe named subdirectories. To simplify thingsfurther, the name of theO_TARGET(now al-ways beingbuilt-in.o) itself is left out and onlythe trailing slash is kept:

Linux Symposium 193

__build: $(if $(KBUILD_BUILTIN),$(O_TARGET) $(L_TARGET) $(extra-y)) \$(if $(KBUILD_MODULES),$(obj-m)) \$(subdir-ym) $(always)

@:

Figure 5: The default__build rule fromscripts/Makefile.build

obj-$(CONFIG_...HISAX) += hisax/obj-$(CONFIG_...ICN) += icn/

4.2 Multi-part modules

As can be seen from Figure 4, a number ofstatements were necessary for generating amulti-part module,isdn.oin that example. Firstof all, the parts constituting the module need tobe declared by assigning them to the variableisdn-objs . This step is of course essentialand was kept in 2.5.

However, it was also necessary to declare thatisdn.o is a multi-part module by listing it inthe variablelist-multi . This informationis redundant as it can be deduced by checkingfor the existence of<module>-objs , which isnow done in 2.5.

Furthermore, in 2.4 a link rule has to be ex-plicitly given for each multi-part object, whichwas annoying and error-prone. In the new buildsystem, this link rule is generated bymake, hit-ting just about the limits of whatGNU makeis capable of. We use a feature called “staticpattern rules,” and the code looks like the fol-lowing:

cmd_link_multi-m = $(LD) ... \-o $@ $(link_multi_deps)

$(multi-used-m) : \%.o: $(multi-objs-m) FORCE

$(call if_changed,link_multi-m)

multi-used-m contains all multi-part mod-ules we want to be built in the current directoryand multi-objs-m contains all of the indi-vidual objects those are built of. This makes

each multi-part module in the directory dependon the set of all components for all multi-partmodules in that directory, which is actuallytoo large, as it of course only is dependendits own components; however the latter is notimplementable within the restrictions ofGNUmake. When doing the link, the variablelink_

multi_deps recovers the right list of compo-nents from the target$@, so that linker is in-voked correctly.

Another interesting detail is that we here aswell as in other places need to uniquify theprerequisites, so that listing a component mul-tiple times doesn’t lead to a link error.GNUmakeoffers thesort function, which throwsaway duplicates, however it is unfortunatelynot usable for this purpose since it sorts, i.e.reorders its arguments and thus changes thelink/init order. The workaround here is to usethe variable$^ which actually uniquifies thelist of prerequisites exactly as needed. Finally,since$^ lists all prerequisites which as men-tioned above exceeds the list of components forthe current module, we filter the uniquified listwith that list of components to get the informa-tion we need.

4.3 Including Rules.make

Each subdirectory Makefile in kbuild-2.4needed to include$(TOPDIR)/Rules.makeex-plicitly. In 2.5, when descending into subdirec-tories, the build system now always callsmakewith the same Makefile,scripts/Makefile.build,which then again includes the local subdi-rectory Makefile, so the statement to includeRules.makecould be dropped.

Linux Symposium 194

Furthermore, in 2.5. the build is still organizedin a recursive way, i.e.makeis only invokedto build the objects in the local subdirectoryand other instances ofmakeare spawned forthe underlying directories. However, it doesnot actually descend into the subdirectories,it always does its work from the top-leveldirectory and prepends the path as necessary.One of the advantages is that the outputincludes the correct paths, so a compilerwarning will not show “inode.c: Warning...”, but “fs/ext2/inode.c: ...”, which makes iteasier to recognize where the problem occurs.More importantly, it allows to use relativepaths throughout the build, so that pathslike “BUG in /home/kai/src/kernel/

v2.5/linux-2.5.isdn/include/linux/

fs.h ” are history. Renaming/moving a kerneltree will not cause spurious rebuilds due tochanging paths as seen above anymore, andtools like “ccache” can work more effectively.

4.4 Objects exporting symbols

The old module symbol versioning schemeused with Linux 2.4 needed the Makefiles todeclare which objects export symbols to mod-ules, which was done by listing them in thevariableexport-objs . In 2.5, module ver-sioning was completely redesigned, removingthe need for this explicit declaration. Thechanges are so complex that they are rewardedtheir own section in this paper.

Here we conclude the comparison between a2.4 and 2.5 subdirectory Makefile, where wehave shown that all the redundant and de-ducible information has been removed and thenecessary information is revealed to the buildsystem in a very compact form.

Two additional important internal changes,which did not affect the subdirectory Makefilelayout will be described in the following:

4.5 Compiling built-in objects and modules ina single pass, recognizing changed com-mand line arguments

The major performance issue for the kernelbuild are the invocations ofmake(most of thetime is of course normally spent compiling /linking, but this cost is independent of the buildsystem used).makehas to read the local Make-file, the general rules and all the dependenciesand figure out the work to be done from there.An obvious way to optimize the performanceof the build system is thus to avoid unneces-sary invocations. In 2.4,makeneeds to do sep-arate passes for modules and built-in objectsand within each directory, it will even call it-self again, so an about four-times performanceincrease is possible by just combining those in-vocations into a single pass.

The primary reason why kbuild-2.4 needs twopasses for built-in and modules lies in its flagshandling. This means that it tries to check notonly whether prerequisites have changed (e.g.the C source for an object), but also if the com-piler flags have changed.

This objective was achieved by generating a.<target>.flagsfile like the following (simpli-fied) for each target built:

ifeq (-D__KERNEL__ -DMODULE-DEXPORT_SYMTAB,$(CFLAGS) -DEXPORT_SYMTAB)))

FILES_FLAGS_UP_TO_DATE += config.oendif

On a rebuild, the Makefile would read all those.*.flagsfragments and forces all files which arenot listed inFILES_FLAGS_UP_TO_DATEtobe rebuild.

The flaw of this method is that it cannot handlediffering flags for different groups of files, somakeneeds to invoked twice, once for the tar-gets to be built-in with the normalCFLAGS, andagain for the modular targets with-DMODULE

Linux Symposium 195

added toCFLAGS. In the example above it isalso visible that the handling for-DEXPORT_SYMTABis broken, this method can not de-tect when a file was added / removed fromthe list of files exporting symbols, since the-DEXPORT_SYMTABwas hardcoded on bothsides of the comparison and thus useless—theonly way to fix this within in the old frame-work would have been to invokemake fourtimes, for all combinations of built-in/moduleand export/no-export.

A more flexible scheme to handle changingcommand lines withinGNU makewas created:

As an example, we present the rule which isresponsible for linking built-in objects into alibrary in Figure 7. The actual use is prettysimple, instead of writing the command di-rectly into the command part of the rule, it isinstead assigned to the variablecmd_link_

l_target and the build system takes care ofexecuting the command as necessary, keepingtrack of changes to the command line itself.

The implementation works as follows: Af-ter executing the command, the macroif_

changed , records the command line into thefile .<target>.cmd. As makeis invoked againduring a rebuild, it will include those.*.cmdfiles. As it tries to decide whether to rebuildL_TARGET, it will find FORCEin the prerequi-sites, which actually forces it to always rerunthe command part of the rule.

However, the command part of the rule nowdoes the actual work: It checks whether anyof the prerequisites changed, i.e.$? is non-empty or if the command line changed, whichis achieved by the twofilter-out state-ments. Only if either of those two conditionsis met, if_changed expands to a commandrebuilding the target, otherwise it is empty andthe target will not be rebuilt.

The advantage of this method, apart from the

easier use in a rule as shown above, is that allthe checking is done within the context of theactual rule and not in a unrelated place later inthe Makefile. This allows for the use and cor-rect checking ofGNU make’s per target vari-ables, e.g.

modkern_cflags := $(CFLAGS_KERNEL)$(real-objs-m) : \

modkern_cflags := $(CFLAGS_MODULE)

which sets modkern_cflags to$(CFLAGS_KERNEL) by default, but to$(CFLAGS_MODULE) for objects listed in$(real-objs-m) , i.e. for objects compiledas modules. The compilation rule can thenjust use $(modkern_cflags) to get theright flags for the current object, where themechanism described above will take care ofrecognizing changes and acting accordingly.

4.6 Dependencies

Between configuration and building of a ker-nel, the old kernel build needed the user torun “make dep”, which served to purposes:It generated dependency information for the Csource files on headers and other included files,and it generated the version checksums for ex-ported symbols.

Both of these task have become unnecessary in2.5, so the reliance on the user to rerun “makedep ” as needed is gone (additionally, the sys-tem in 2.4 is broken that in some modversionscases it’s not even sufficient to rerun “makedep ”, the only solution then is to do “makedistclean ” and start over).

2.4 used a small tool calledmkdepto generatedependencies for C sources. This tools basi-cally extracted the names of the included filesout of the source, but did not actually recur-sively scan those includes then. So, iffoo.cin-cludesfoo.h, which itself includesbar.h, mkdepwould only pick up the dependency offoo.confoo.h, but foo.c also needs recompiling when

Linux Symposium 196

cmd_link_l_target = rm -f $@; $(AR) $(EXTRA_ARFLAGS) rcs $@ $(obj-y)

$(L_TARGET): $(obj-y) FORCE$(call if_changed,link_l_target)

targets += $(L_TARGET)

[...]

if_changed = $(if $(strip $? \$(filter-out $(cmd_$(1)),$(cmd_$@))\$(filter-out $(cmd_$@),$(cmd_$(1)))),\

@set -e; \$(cmd_$(1)); \echo ’cmd_$@ := $(cmd_$(1))’ > $(@D)/.$(@F).cmd)

Figure 7: Checking for a changed command line

foo.h changes. This problem was solved in2.4 by assuming thatfoo.hwould reside inin-clude/* (which is mostly, but not always, true).For those files it would generate another set ofdependencies, basically:

foo.h: bar.h@touch $@

So asbar.h changes, this rule will update thetimestamp onfoo.h, which will then be seen bythe rule forfoo.cand causefoo.cto be rebuild.

This method has several disadvantages:

• Changing the timestamp on files whichhave not actually been modified confusesa number of source management systems.

• It only works for header files in thein-clude/*subdirectories.

• As foo.h is changed to also includebaz.h,the dependency information does not getupdated, so a subsequent change tobaz.hwill erroneously not causefoo.c to be re-compiled.

• Starting from a clean tree, the user hasto wait for the dependency information

to be created (for all files, even for en-tire subsystem which may not be selectedin the configuration at all), even thoughthis information is totally useless for afirst build—it’s only useful for decidingwhether a file needs to be rebuilt.

The build system in Linux 2.5 instead usesgcc’s -MD flag to generate the dependency in-formation during the build. This flag generatesthe full list of all files included during the com-pile, so in the example above it would generate“ foo.o: foo.c foo.h bar.h” (and “baz.h” as thatgets added). This procedure is much simpler,and it gets around all the disadvantages listedabove.

The only quirk which is applied similarly in 2.4and 2.5 is related to the high configurability ofthe linux kernel.

Using the gcc generated list of dependenciesas-is has the drawback that virtually every filein the kernel includes<linux/config.h>whichthen again includes<linux/autoconf.h>

If a user rerunsmake *config to change aconfiguration option,linux/autoconf.hwill beregenerated.makewill notice this and rebuild

Linux Symposium 197

every file which includes autconf.h, i.e. basi-cally all files. This is correct, but extremelyannoying if the user just changed some optionCONFIG_THIS_DRIVERfrom n to m.

So we use the same trick that “mkdep” ap-plied before. We replace the dependency onlinux/autoconf.hby a dependency on everyconfig option which is mentioned in any of thelisted prerexquisites.

The effect is that if a user changes theCONFIG_THIS_DRIVER option, only the ob-jects which (themselves, or in any ofthe included files) referenceCONFIG_THIS_

DRIVER will be rebuilt, which most likely isonly this one driver.

5 Modules and the kernel buildprocess

The implementation of loadable kernel mod-ules has been substantially rewritten by RustyRussell in the development cycle 2.5. Thesechanges are so complex that this paper will notattempt to describe them in detail. Instead, weconcentrate on the changes which were done inthe build system to accomodate the new con-cepts.

5.1 Module symbol versions

Loadable modules need to interface with thekernel. They do this by accessing certaindata structures and functions which have beenmarked as exported symbols in the source.That means not all global symbols in the kernelare accessible to modules, but only an explic-itly exported API.

These symbols remain unresolved in theloadable module objects at build time andare then resolved at load time, either by anexternal program,modutils, in 2.4, or by an

in-kernel loader in 2.5. A common problemis that Linux does not guarantee a stablebinary interface to modules, in fact the binaryinterface often changes between releases ina stable kernel series and even depending onthe configuration of the kernel. One simpleexample is the struct net_device ,which embeds aspinlock_t . If the kernelis configured for uni-processor operation, thislock expands to nothing, so the layout of thestruct net_device changes . Whencalling register_netdev(structnet_device *) where the in-kernel func-tion register_netdev() assumes theSMP layout, though the module set up theargument in the UP layout, we have an obviousmismatch which often leads to hard to explainkernel crashes.

Other operating systems solve this problem byprescribing a stable ABI between kernel andmodules, however in Linux it is preferred to notcarry around binary compatibility layers andcope with unflexible interfaces, instead sincethe source is openly accessible, one just needsto recompile the modules so that they matchthe kernel.

Now, it is easily possible for users to get thiswrong and we thus want a way to detect ver-sion mismatches and refuse to load the mod-ules or at least warn. This is what “modulesymbol versioning” accomplishes. The basicidea is to analyze the exported symbols, includ-ing the types of the arguments for function callsand generate a checksum for a specific layout.If anything changes in the ABI, the versioningprocess will generate a different checksum andthus detect the mismatch. The main work inthis scheme is done by the programgenksyms,which is basically a C parser that reads a pre-processed source file and finds the definitionsfor the exported symbols from there.

This procedure has caused trouble in the

Linux Symposium 198

build system for a long time. In Linux 2.4,the “make dep” stage, apart from build-ing dependency information, preprocessesall source files which export symbols (thatis why they need to be specifically de-clared in the Makefiles) and then gener-ates include/linux/modversions.hwhich man-gles the exported symbols with the gen-erated checksum, using the C preproces-sor. The kernel will then not export thesymbol register_netdev , but insteadregister_netdev_R43d2381 . A mod-ule referencingregister_netdev will endup with an unresolved symbolregister_netdev_R43d2381 , so loading it into thekernel will work fine. Has the module how-ever built against a different kernel or a differ-ent configuration, the checksum has changedand any attempt to load it will result in an errorabout unresolved symbols.

This implementation was rather fragile, asit relies on the user to rerun “make dep”whenever the version information has possiblychanged, and even if only one symbol changed,that basically forces a recompilation of everyfile. In addition, some of the optimizationsmade in 2.4’s build system were actually bro-ken, leading to the well-known fact that it canget into a state where not even running “makedep ” will recover from generating inconsis-tent version information, and starting overfrom “make mrproper/distclean ” isneeded.

Module versioning is still a challenge to thebuild system in 2.5, the underlying reasonfor that is that it introduces cross-directorydependencies, which a recursive build sys-tem cannot easily handle. For example,the ISDN moduledrivers/isdn/hisax/hisax.kousesregister_isdn() , which is exportedby drivers/isdn/isdn_common.o. So buildinghisax.ko needs knowledge of the checksumgenerated fromdrivers/isdn/isdn_common.o,

but it has no way to make sure that it is up-to-date since it is located in a different subdi-rectory.

Module versioning is instead implemented asa two stage process, the first stage is the nor-mal build, which also generates all the check-sums. After this stage is completed, we canbe sure that all checksums are up-to-date now,and then just record this up-to-date informationinto the modules. This is one of the reasonswhy modules have been renamed with a “.ko”extension: The first stage just builds the nor-mal “.o” objects, and afterwards a postprocess-ing step follows, which builds “.ko” modulesadding version checksums for unresolved sym-bols and other information.

In more detail, the following steps are exe-cuted:

• Compiling

Knowledge of which source files exportsymbols is not required up front. Asan EXPORT_SYMBOL(foo) is encoun-tered, the definition ofEXPORT_SYMBOLfrom include/linux/module.hwill generatespecial sections with tables containg thename of the symbol, its address and itschecksum. Actually, since the checksumis not known at this time, the value ofthe checksum is set to a symbol called__crc_foo . This is a trick which allowsto use the linker to record the checksumeven after the object file is already com-piled.

As the object file has been generated, wecheck it for the existance of the specialsection mentioned above. If it exists,the source file did export symbols andgenksymsis run to obtain the checksumsfor those symbols. Finally, these check-sums are entered into the object using thelinker in conjunction with a small linkerscript.

Linux Symposium 199

$ nm drivers/isdn/i4l/isdn.ko | grep __crc86849dd0 A __crc_isdn_ppp_register_compressor843d2381 A __crc_isdn_ppp_unregister_compressor66d136e2 A __crc_register_isdn

Figure 8: Examining the checksums for exported symbols

The checksums can easily examined atrunning the command shown in Figure 8.

• Postprocessing

After stage one, we have the check-sums for the exported symbols embeddedwithin vmlinux and the modules. Whatis yet to be done is recording the check-sums into the consumers, that is addingthe checksums for unresolved symbolsinto the modules.

This step was initially handled by a smallshell script but is now done by a C pro-gram for performance reasons, which alsohandles other postprocessing needs likegenerating aliases.

This program basically reads all the ex-ported symbols and their checksums fromall modules, and then scans the modulesfor unresolved symbols. For each unre-solved symbol, an entry in a table associ-ating the symbol string with the checksumis made, this table is output as C sourcemodule.mod.cand compiled and linkedinto the final.ko module object.

Figure 9 shows an excerpt fromdrivers/isdn/hisax/hisax.mod.c whichcalls register_isdn() . Thechecksum obviously matches thechecksum for the exported symbolin drivers/isdn/i4l/isdn.ko, so that themodule will load without complaint.

An additional advantage of the new way ofhandling module version symbols, apart frombeing cleaner from a build system point of

view, is that the actual symbols are not man-gled, so it became possible to force a moduleload even if the checksums do not match—though the kernel will set the taint flag in thesecase.

The module postprocessing step, introducedmainly for the module symbol versioning, al-lowed for a number of additional features, i.e.module aliases / device table handling, addi-tional version checks as well as recognition ofunresolved symbols during the build stage.

6 Conclusion and Outlook

This paper presented an introduction to usingthe kernel build system for the Linux kernel2.5 and 2.6 for users who want to compile theirown kernels and developers working on ker-nel code. We also showed how in the transi-tion from kbuild-2.4 to 2.5, features ofGNUmakecould be applied to remove redundant in-formation and allow for simpler Makefile frag-ments as well as a more consistent and fool-proof build system.

Additionally, parts of the internal implementa-tion have been described and an overview overchanges related to the new module loader andnew module versioning system has been given.

The kernel build system in 2.5 has been im-proved significantly, but some features remainto be implemented.

Linux Symposium 200

static const struct modversion_info ____versions[]__attribute__((section("__versions"))) = {

{ 0xfa7bbba7, "struct_module" },{ 0x66d136e2, "register_isdn" },{ 0x1a1a4f09, "__request_region" },

[...]

Figure 9: Excerpt fromdrivers/isdn/hisax/hisax.mod.c, generated by the postprocessing stage

Separate source and object directories

As opposed to kernel 2.4, source files are notaltered or touched during the build in 2.5 any-more, enhancing interoperability with sourcemanagement systems. The next step is to al-low for completely separate source and ob-ject directory trees, so that the source can becompletely read-only and multiple builds at thesame time from the same source are possible.The current code in 2.5 has taken preparatorysteps for this feature but work is not completedyet.

Non-recursive build

It is an open question whether it is actually ad-visable to switch to a non-recursive build sys-tem. Obviously, distributing build informationwith the source files is desirable, this trend isvisible in e.g. the split of the global Con-figure.help file into per-directory fragmentswhich eventually were unified with the newKconfigconfiguration info. Of course it is es-sential to keep the build information in the per-subdirectory Makefiles distributed as it is cur-rently, it would be a step back to collapse it intoone big file.

However this does not preclude collecting thedistributed information when starting a buildand generating a global Makefile, which is thenused as a main stage. The advantage of thismethod is that it can handle cross-directory de-pendencies more easily, whereas the currentsystem has to resort to a two-stage process for

module post-processing. On the other hand, aglobal Makefile which contains also needs toincorporate dependencies for all files will usea significant amount of memory and may turnout to be problematic on low–end systems.

There are two ways to implement a globalMakefile: One possibility is usingGNU makeitself, replacing the rules to actually compile /link objects by dummy routines recording thenecessary actions into a global Makefile. Thesecond possibility is, as the subdir Makefileshave a very consistent form by now, to write aspecialized parser for those files and have thatgenerate a global Makefile.

Whether switching to a non-recursive buildsystem is worth the tradeoffs will be investi-gated in the Linux 2.7 development cycle.

References

[1] GNU makehttp://www.gnu.org/software/make/make.html

[2] Michael Elizabeth Castain:dancing-makefileshttp://www.kernel.org/pub/linux/kernel/projects/kbuild/dancing-makefiles-2.4.0-test10.gz

Device discovery and power management inembedded systems

David GibsonOzLabs, IBM Linux Technology Center


Abstract

This paper covers issues in device discoveryand power management in embedded Linuxsystems. In particular, we focus on the IBM®

PowerPC® 405LP (a “system-on-chip” CPUdesigned for handheld applications) and IBM’sPDA reference design based upon it. Peripher-als in embedded systems are often connectedin an ad-hoc manner and are not on a buswhich can be scanned or probed. Thus thekernel must have knowledge of what devicesare present built in at compile time. We ex-amine how the new unified device model pro-vides a clean method for representing this in-formation, while allowing good re-use of codefrom machine to machine. The 405LP includesa number of novel power management fea-tures, in particular the ability to very rapidlychange CPU and bus frequencies. We also ex-amine how the device model provides a frame-work for representing constraints the periph-erals and their interconnections place upon al-lowable frequencies and other information rel-evant to power management.

1 Introduction: the device discov-ery problem

Device discovery is the process the kernel andits device drivers use to determine what pe-ripheral devices are present in a machine and

how to communicate with them. Generally thismeans determining what IO addresses, inter-rupt lines and/or other bus specific addressesand resources are associated with each device.

Usually there are a few peripherals that arepresent in every machine of a particular type.Then there are optional devices that may ormay not be installed in a particular machine.Some of these may be added or removed onlyfrom one boot to the next, and some may behot-pluggable, added or removed while the ma-chine is running.

The peripheral devices in an embedded ma-chine often look very different to those in aconventional desktop or server. Even when asimilar peripheral is used, differences in theway it is connected into the system can meanthat it must be accessed and initialised quitedifferently. Many assumptions that are madeabout devices in a “normal” machine cannotbe made in embedded machines, and the hard-ware and firmware of embedded machines gen-erally provides much less assistance to the ker-nel for device discovery. All these things re-quire different approaches to device discoveryto be used.

Linux Symposium 202

� � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � ��

� � � � ��

USB

MDOC

SDIO

FPGA

PCCF

TCPA

LED Ctrl

USB Gadget

SDIO

Tricolour LED

FrontlightTDES

IICTouchpanel

GPIO

Buffers

USB Host

Ethernet

PCMCIA

Battery ADC

CSI Audio CodecSpeakers

Microphone

Line in/out

Buttons

LCD Panel

32MB SDRAM

UIC

On−

boar

d P

erip

hera

l Bus

(OP

B)

External Bus

Frontlight Ctrl

405LPPowerPC 405LP PDA Reference Design (eLAP)

SDRAM

EBC

SLA

LCDC

POB

DMAC

Dev

ice

Con

trol R

egis

ter (

DC

R) B

us

Pro

cess

or L

ocal

Bus

(PLB

)

I2C Bus

Debug Sled

32MB NOR Flash

CorePowerPC 405

Ethernet MAC

UART1

UART0

TS Ctrl.

RS232

PLBOPBExternal Bus

Interrupt LineOther connectionDCR BusI2C Bus

Socket

Physical device

External connection

APM

CPM

RTC

APM Power management unit, implements the CPU sleep/suspendstates (NB: this is not related to the APM BIOS in PC laptops)

Audio Codec Texas Instruments TLV320AIC23 stereo audio codecBattery ADC ADS7823 ADC monitoring battery voltage

CPM Clock and Power Management unitCSI Codec Serial Interface (interface to audio codec devices)

DMAC DMA controller (can DMA to/from devices on both the PLB andOPB)

EBC External Bus ControllerEthernet MAC RTL8109 Ethernet controllerFrontlight Ctrl DS1050 LCD frontlight controller

GPIO General Purpose IO interfaceIIC I2C bus interface (both master and slave capable)

LED Ctrl BU8770KN tricolour LED driverLCDC LCD display controller

MDOC 64M of M-Systems Millennium Plus Disk-on-Chip(NAND flash with specialised controller)

PCCF PCMCIA and Compact Flash controllerPOB PLB-to-OPB bridgeRTC Real Time Clock

SDIO Toshiba TC6380AF Secure Digital and SDIO inter-face

SDRAM SDRAM controllerSLA Speech Label Accelerator (hardware implementation

of a particular speech recognition algorithm)TCPA Atmel AT97SC3201 TPMTDES Triple-DES accelerator

TS Ctrl Semtech UR7HCT52_S840L touchscreen controllerUARTx NS16550 compatible UARTs

UIC Universal Interrupt ControllerUSB Phillips ISP1161 USB interface (includes both host

controller and gadget-side interface)

Figure 1: Block diagram of the eLAP

2 The PowerPC 405LP PDA refer-ence design

Most of the issues we discuss in this paperapply to many different embedded machines.However, for simplicity we focus on one ex-ample machine, the PowerPC 405LP PDA ref-erence design, also known as the EmbeddedLinux Application Platform or eLAP. As thename suggests, this is a prototype reference de-sign for a PDA based on the PowerPC 405LPCPU.

The IBM PowerPC 405LP is a CPU from thePowerPC 4xx family. This series of CPUs isdesigned for “system-on-chip” embedded ap-plications. As the name suggests these proces-sors are implementations of the PowerPC Ar-chitecture™, however they have some notabledifferences from “classic” PPC CPUs (as usedin IBM pSeries™ servers and Apple worksta-tions). The 4xx CPUs operate at much lowerclock rates (and hence are cooler and cheaper),although they are in the high end by embeddedstandards. They have a much simpler MMU(just a software loaded TLB) and they have no

Linux Symposium 203

FPU. More interestingly, they include a num-ber of peripheral devices built into the CPU dieitself (hence the term “system-on-chip”).

Different CPUs in the 4xx family are designedfor different applications and have differentcollections of on-chip peripherals. The 405LPis designed for handheld, battery-powered ap-plications. Figure 1 shows a block diagramof the eLAP, including the various built-inperipherals of the 405LP. The chip includesno less than three internal buses: the high-bandwidth Processor Local Bus (PLB) con-nected directly to the CPU core, the slowerOn-Board Peripheral bus (OPB), and the spe-cial DCR bus. The latter is used to implementDevice Control Registers: rather than usingnormal memory-mapped IO, some of the on-chip devices use these special registers whichare accessed using special machine instruc-tions. As shown, the 405LP’s peripherals in-clude amongst other things, an LCD controller,a real-time clock, an I2C interface and two se-rial ports. Other 4xx chips can include devicessuch as Ethernet controllers, HDLC interfaces,PCI host bridges, IDE and USB controllers.The 405LP also includes a number of novelpower management features, which we’ll ex-amine in §5.

In addition to the devices within the 405LP, theeLAP includes 32MB of RAM, 32MB of NORFlash and a number of additional peripherals,also shown in Figure 1. Most of these are con-nected via a minimal bus driven by the 405LP’son-chip External Bus Controller (EBC) unit.An extra debug and development sled can beattached to the eLAP, again shown in Figure1. It includes an Ethernet controller, the physi-cal PCMCIA slot driven by the 405LP’s PCCFcore and the physical connectors for the USBhost port and serial port. Of course, as well asthe peripherals shown, further devices can beattached via the PCMCIA and SDIO slots andthe USB host interface.

3 Current approaches

3.1 Conventional machines

On normal server or workstation machines, de-vice discovery is mostly quite straightforward.1

Nearly all modern machines are based on PCI,which (like most modern buses) is designed sothat devices can be queried and configured in astandard way. This makes it easy for the kernelto scan the PCI bus (or buses), determine whatdevices are present and their addresses, andpass this information to the appropriate devicedrivers. USB devices provide similar function-ality, as do PCMCIA and ISA/PnP devices.2

The few remaining devices (including the PCIhost bridge itself) are usually standard—to befound on all machines of this type and oftennearly all machines of this architecture. Theycan be found at well-known addresses, so thedrivers for these devices simply hardcode thisinformation. On PCs, non-PnP ISA devices dointroduce some problems. In fact they demon-strate a subset of the problems with embeddedhardware that we will examine in the next sec-tion.

Many non-x86 machines make device discov-ery even simpler with firmware support. OpenFirmware (on IBM and Apple PowerPC ma-chines) and likewise its ancestor OpenPROM(on Sun machines) builds a tree with informa-tion about each of the peripherals on the ma-chine. At boot time the kernel queries this in-formation, making a copy of the device treewhich can later be used by drivers to finddevices. The ACPI BIOS found on recentIntel® machines provides some similar infor-mation, although neither the ACPI implemen-tations nor Linux’s use of them is very well es-

1Although on big servers keeping track of the devicesonce they’re discovered can be another matter.

2At least in theory; many ISA/PnP implementationsare buggy in practice.

Linux Symposium 204

tablished as yet.

3.2 Embedded weirdness

On embedded machines all the assumptionsthat are made on “normal” machines breakdown. Embedded machines can and do havearbitrarily peculiar combinations of peripheralsconnected in a more-or-less ad-hoc manner.

Often, many of an embedded machine’s pe-ripherals are connected via an unconventionalbus which provides no facilities for system-atic scanning or probing of devices. On the405LP this is true of both the on-chip busesand the main external bus. Devices can appearessentially anywhere within the CPU’s phys-ical address space. Some times the address orother behaviour of a device is affected by a cus-tom FPGA or other programmable logic devicewith its own control registers. Device inter-rupt lines introduce even more problems, beingrouted in complex and arbitrary ways that areoften controlled or masked by a board-specificFPGA or CPLD. Devices can sometimes havemultiple dependencies on other devices: forexample on the eLAP, audio is driven by the TIcodec, which is controlled and configured viathe I2C bus. However, the actual audio data isdelivered to the codec via a serial connection tothe on-chip Codec Serial Interface (CSI). TheCSI in turn depends on the on-chip DMA con-troller to supply data from RAM.

Sometimes machines also have a more conven-tional bus such as PCI or PCMCIA, but it mayhave to be accessed via a bridge which is notconfigured in the same way as one would ex-pect on a conventional machine. Worst of all,there are dozens or hundreds of different typesof embedded machine, each with its own com-pletely different set of devices and connections.

Under these circumstances, it is tempting toturn to each board’s firmware to provide in-

formation about which devices are present.Unfortunately the firmware on most embed-ded machines is very primitive, providing littlemore than a boot loader. Usually it will providea few useful pieces of information, such as theamount of RAM on the system or the boardrevision, but it certainly won’t give compre-hensive device information. Furthermore, whatinformation the firmware does provide usuallycan’t be used without already knowing some-thing about the machine in question, since em-bedded firmwares are almost as varied as em-bedded machines themselves.

Since embedded machines can and do breakany assumption one might care to make abouthow devices are attached, there is no magicalway that the kernel can detect what devices arepresent. So, the only approach is to have thekernel “just know” the device setup for a par-ticular board by building the knowledge intothe kernel at compile time.

Although it is impossible to completely avoidhardwired knowledge of boards in the kernel,we do want to keep this information in as cleana way as possible. Specifically, we want to iso-late the direct knowledge of board specific de-tails to as small a section of the kernel as pos-sible, and we want to make it easy to add thedetails of new boards and their peripherals.

In this paper, we generally assume that the ker-nel must be configured for one particular typeof embedded machine, since this is the sim-plest case. Building a kernel which will sup-port multiple machines is certainly possible,and most of the methods we discuss can stillbe applied. In this case the kernel need to in-clude device information about all supportedboards. Early in boot, the kernel will identifythe machine it is running on (by some ad-hocmethod), and select which information to useon that basis.

Now, we examine some of the existing meth-

Linux Symposium 205

ods by which embedded devices are supportedin the kernel.

3.3 Hardcoded hacks

The naïve approach to handling embedded de-vices that aren’t on a conventional bus is totreat them all like the “system” devices on a PCor other conventional machine. That is, sim-ply hardcode knowledge of the device into therelevant device driver or into the kernel initial-isation code for the machine in question. Thisapproach is currently used for quite a numberof embedded devices—unsurprisingly since, aswe’ll see, a comprehensive better approach hasyet to be implemented.

This method has some serious shortcomings.The most obvious problems come with embed-ded peripherals that are similar to ones alsofound in conventional machines. For obviousreasons, it is normal in this case to adapt theexisting conventional driver for use in the em-bedded machine.

Sometimes this has been done by adding#ifdef s to the driver for the board specificcode. For example, this has been done with thecs89x0 driver for the CrystalLAN CS8900Ethernet chip. This chip is used on some ISAcards, but is also found on the “Beech” em-bedded board (another IBM reference boardbased on the 405LP). Apart from the factthat#ifdef s make the driver code ugly, thisclearly causes a problem if the embedded ma-chine can also have a normal ISA or PCMCIAversion of the device: the kernel can’t supportboth versions of the peripheral on the same ma-chine.

Another method is to copy, then modify theexisting driver to make a version specific to aparticular embedded machine. The approachwas taken for thearctic_enet driver forthe Ethernet on the eLAP’s debug sled. The

sled’s Ethernet is based on an RTL8019 chip,which is used in a number of ISA cards, aswell as several other embedded machines. Thismethod allows multiple versions of the de-vice to be simultaneously supported, but incursthe obvious maintenance problems of havingseveral almost-but-not-quite identical driverspresent in the kernel. The situation is aggra-vated as more embedded machines are sup-ported.

The fundamental problem with this approachis that there are many more types of embed-ded machine than there are of normal ma-chines. Indeed there is often only one domi-nant type of conventional machine per archi-tecture (PC for x86, CHRP for PowerPC, Sunserver/workstation for SPARC, etc.). With thelarge number of different types of embeddedmachine, direct hardcoding quickly becomesmessy: there is a lot of duplicated code, andit is inconvenient to add new machines and pe-ripherals.

3.4 The OCP subsystem

Since we can’t entirely avoid hardcoded infor-mation, the obvious approach to isolating themessiness is to encode information about de-vices into a data structure which is staticallycompiled into the kernel, but parsed to providedata to the drivers at runtime. This solution isconceptually similar to using device informa-tion from firmware except that the device treeis supplied by the kernel itself, rather than readat boot time.

The “OCP” subsystem (standing for On-ChipPeripheral) is a partial implementation of thisapproach. It only covers the on-chip devices onPPC 4xx chips, and is quite limited in the sortsof device information it can represent, but it issstill a substantial improvement over hardcodeddrivers.

Linux Symposium 206

The subsystem has been through several signif-icant rewrites before reaching its present form.The initial implementation in thelinuxppc_2_4_devel BK tree had a number of seri-ous design and interface problems. It was thenrewritten in 2.5 based on the new Linux uni-fied device model, and using the PCI subsys-tem as a reference. This version is considerablycleaner, but still contains some poorly thoughtout elements, in particular some things havebeen copied from PCI which make little sensein their new context. It has now been rewrit-ten again by Benjamin Herrenschmidt in thelinuxppc-2.4 BK tree. This latest rewriteis conceptually similar to the 2.5 version, butconsiderably simpler and cleaner. This finalversion still needs to be forward ported to 2.5,re-introducing the integration with the unifieddevice model.

For each CPU with OCP devices, there is a ta-ble of definitions like that in Figure 2. This isfound in a C file specific to the particular CPU,along with any initialisation or support code forthat CPU. The example in Figure 2 doesn’t in-clude all the 405LP’s on-chip devices, sincenot all the drivers have been adapted to usethe OCP infrastructure yet. The table con-sists of ocp_def structures, shown in Fig-ure 3. At boot time, the OCP system scans thecore_ocp table to produce a list of OCP de-vices present, making anocp_device struc-ture (also in Figure 3) for each to keep track ofit at runtime.

The vendor and function fields betweenthem identify the type of device. This mim-ics the vendor/function pairs used to iden-tify PCI and USB devices. However in thiscase the ID values are not built into the de-vice but are simply arbitrary values allocatedin include/asm/ocp_ids.h . Driversfor the on-chip peripherals register themselveswhen loaded, using theocp_register_driver function and a table of OCP device

fromarch/ppc/platforms/ibm405lp.c

struct ocp_def core_ocp[]__initdata = {

{ .vendor = OCP_VENDOR_IBM,.function = OCP_FUNC_OPB,.index = 0,.irq = OCP_IRQ_NA,.pm = OCP_CPM_NA,

},{ .vendor = OCP_VENDOR_IBM,

.function = OCP_FUNC_16550,

.index = 0,

.paddr = UART0_IO_BASE,

.irq = UART0_INT,

.pm = IBM_CPM_UART0},{ .vendor = OCP_VENDOR_IBM,

.function = OCP_FUNC_16550,

.index = 1,

.paddr = UART1_IO_BASE,

.irq = UART1_INT,

.pm = IBM_CPM_UART1},{ .vendor = OCP_VENDOR_IBM,

.function = OCP_FUNC_IIC,

.paddr = IIC0_BASE,

.irq = IIC0_IRQ,

.pm = IBM_CPM_IIC0},{ .vendor = OCP_VENDOR_IBM,

.function = OCP_FUNC_GPIO,

.paddr = GPIO0_BASE,

.irq = OCP_IRQ_NA,

.pm = IBM_CPM_GPIO0},{ .vendor = OCP_VENDOR_INVALID}

};

Figure 2: OCP device table for 405LP

Linux Symposium 207

from include/asm/ocp.h

struct ocp_def {unsigned int vendor;unsigned int function;int index;phys_addr_t paddr;int irq;unsigned long pm;void *additions;

};

struct ocp_device {struct list_head link;char name[80];const struct ocp_def *def;void *drvdata;struct ocp_driver *driver;u32 current_state;

};

Figure 3: OCP device structure

from drivers/i2c/i2c-ibm_iic.c

static struct ocp_device_idibm_iic_ids[] = {

{ .vendor = OCP_ANY_ID,.function = OCP_FUNC_IIC },

{ .vendor = OCP_VENDOR_INVALID }};

static struct ocp_driveribm_iic_driver = {

.name = "iic",

.id_table = ibm_iic_ids,

.probe = iic_probe,

.remove = iic_remove,};

...

ocp_register_driver(&ibm_iic_driver);

Figure 4: OCP driver registration (for the IICdriver)

IDs like that shown in Figure 4. This again isanalogous to a PCI driver. The OCP subsys-tem matches the driver against the list of OCPdevices, calling the driver’s probe routine foreach relevant device.

The index field is used to distinguish be-tween multiple devices of the same type. Thepaddr and irq fields, unsurprisingly, givethe device’s physical base address (on PPC allIO is memory-mapped) and its IRQ line. Thepmfield is used for power management, we’lllook at it in §5.3. Finally, theadditionsfield is a hack used to supply extra device-specific information. It is not needed for anyof the devices on the 405LP but it is usedon some other chips: for example, some 4xxCPUs, such as the 405GP and NPe405H in-clude one or more Ethernet MAC controller(EMAC) units. These make use of a spe-cialised DMA controller known as the MemoryAccess Layer (MAL). Theadditions fieldis used to identify which MAL channels areassociated with each EMAC—another piece ofinformation that the kernel has to “just know.”

The 2.5 version of the OCP system is in-tegrated with the unified device model. Atbootup the OCP code registers an OCPbus_type and one instance of it—all the OCP de-vices are registered as devices on this bus.3 Theocp_device and ocp_driver structuresbecome wrappers around the device model’sdevice anddevice_driver structures.

4 Future approaches

As yet, there is no really convenient and com-prehensive way of dealing with embedded “un-probeable” peripherals. The OCP system is

3This ignores the distinction between the two on-chipbuses, PLB and OPB. We can get away with this becausethe POB is always enabled and has a fixed configuration,so in practice we can ignore the distinction for the pur-poses of device discovery.

Linux Symposium 208

probably the closest thing to such a system,but it has significant limitations: it only coversPPC 4xx on-chip devices, and its data structureis a flat table so it cannot represent peripheralsbehind a bus-to-bus bridge or other more com-plex interconnections.

The fact that some peripherals are built intothe CPU chip is interesting from a hardwarepoint of view. However, for the purposes of de-vice discovery, there is little inherent differencebetween on-chip devices, and devices whichare on a separate chip, but which still can’t bestraightforwardly scanned or probed. It seemsworthwhile, then, to extend the idea of the OCPsystem to cover embedded devices more gener-ally.

The Linux unified device model provides theobvious place to represent information aboutthese devices at runtime: it already providesthe code for matching devices to drivers and itstree structure allows multiple buses with con-figurable bridges between them to be repre-sented.

It is not immediately clear how to representeverything that’s needed in the device tree,though. While for many devices the physicaladdress and IRQ number is all the informa-tion that is needed, some devices have mul-tiple IRQs and/or IO windows, at discontinu-ous addresses. Some buses require different re-source addresses: for example many of the 4xxon-chip devices need DCR numbers, and I2Cdevices need I2C addresses rather than phys-ical IO addresses. Hence, devices on differ-ent buses are likely to need different wrappersaroundstruct device providing differentaddress information.

Even less obvious is how to represent deviceswith multiple connections, such as the audiocodec on the eLAP, connected both the the I2C

bus and to the CSI.4 As yet, the device modeldoes not have a clear way to represent this.

Another as yet unanswered question is how thedevice information should be represented in thekernel image. In fact, we gain a little flexibil-ity if this information is removed from the ker-nel proper by having it as a blob of data whichis passed to the kernel by the bootstrap loader(the shim between the firmware bootloader andthe kernel proper which handles decompress-ing the kernel and moving it to the correct ad-dress in memory). This has the advantage thaton those machines which do have a reasonablesophisticated firmware or bootloader, such asPPCBoot/U-Boot (see [7]), the device informa-tion can be taken from there.

from include/asm/bootinfo.h

struct bi_record {unsigned long tag;unsigned long size;unsigned long data[0];

};

#define BI_FIRST 0x1010#define BI_LAST 0x1011#define BI_CMD_LINE 0x1012#define BI_BOOTLOADER_ID 0x1013#define BI_INITRD 0x1014#define BI_SYSMAP 0x1015#define BI_MACHTYPE 0x1016#define BI_MEMSIZE 0x1017#define BI_BOARD_INFO 0x1018

Figure 5: Boot info records

On PPC systems, there already exists a flexiblemethod of passing data from the bootstrap tothe kernel proper through “boot info records.”The bootstrap passes to the kernel a list ofbi_record structures, shown in Figure 5. Each

4Note that this is a different problem to multipath IO.That deals with the case where a device can be accessedby any of several routes, here we have devices that re-quires several connections simultaneously.

Linux Symposium 209

bi_record is a blob of data with a length andtag, the internal format of the information be-ing determined by the tag (some tag values arealso shown in Figure 5). Currently this methodis used for passing information such as the sizeof memory and the board and bootloader ver-sions. This system could be extended to passan entire set of device information to the kernel(there is no reason a particularbi_recordcouldn’t contain a list of furtherbi_record s,giving a tree structure).

A related question is how to represent the de-vice information in the kernel source. The sim-plest approach would be to directly include thedata structure used to represent the informa-tion at boot time. However, that’s likely tobe quite inconvenient to edit and extend, espe-cially if the format includes length fields (likebi_record s) or internal pointers. It might beworthwhile, then, to create adevice tree com-piler: a program used during the kernel buildto take a text file describing the device layoutand generate code or data to be included in thekernel image.

5 Power management

In a battery powered device such as the eLAP,it is clearly important to minimise power con-sumption. The most obvious way to do thisis to power down sections of the system whenthey are not in use. Obviously, this means thatthe kernel needs to know when a device is inuse, including when it is in use indirectly be-cause another device relies on it.

The topics of power management and devicediscovery are therefore related: device discov-ery is about providing exactly the sort of in-formation about the interconnection of devicesthat effective power management requires.

As yet the integration of power managementtechniques with detailed device information is

very much a work-in-progress, even on con-ventional systems and doubly so on embeddedmachines. So, we can only give an overviewhere of what the major issues are: most of thehard cases remain to be investigated, let aloneimplemented.

Again we will use the eLAP as our example:the 405LP’s power management features intro-duce some new (and largely unsolved) prob-lems in providing device information for powermanagement. So, we first examine these fea-tures, then in §5.2, §5.3 and §5.4 we examineseveral different methods of reducing powerconsumption and the device information prob-lems they introduce.

5.1 eLAP power management features

The 405LP CPU is designed especially for lowpower operation and as such it has some noveland interesting power management features.Most of the on-chip peripherals can be pow-ered on and off under software control. Somealso provide more detailed power control to al-low power savings when only parts of the pe-ripheral are in use, or when it is in use inter-mittently. It also allows for several methods ofsaving the CPU state while shutting down thechip as a whole (i.e., “sleep” modes).

More interestingly, the 405LP includes a clockgeneration core that allows the clock frequencyof the CPU core and also the PLB, OPB andEBC buses to be altered dynamically. The ra-tios between the CPU and various bus frequen-cies are not fixed, so the chip can be adjusteddifferently for IO versus compute performance.While a number of different CPUs allow thefrequency to be changed while running, the405LP can change frequency exceptionallyquickly (microseconds) which enables newpower management techniques based on dy-namically adjusting frequency based on work-load and idle periods. The 405LP can also op-

Linux Symposium 210

erate at a variety of different voltages, whichcan provide much greater power savings thanjust adjusting the frequency (power consump-tion varies roughly linearly with frequency andcubicly with voltage, maximum frequency isroughly linear with voltage). Another novelfeature of the 405LP is that it can continue tooperate, albeit slowly in some cases, while avoltage transition is in progress.

As well as supporting the 405LP’s features,the eLAP board has some extra power man-agement features of its own. A number of theon-board devices, such as the audio codec, in-clude the ability to power down some or all oftheir operations while not in use. Other de-vices, such as the USB and SDIO chips can bepowered down under the control of an FPGAregister.

5.2 Static power management

We use the term static power management forthe process of suspending or sleeping a ma-chine, i.e. saving the machine’s state whileturning most or all of the machine off, thenrestoring the state when the machine is pow-ered on again. This of course is normal in ev-eryday laptops, and handling this for embed-ded machines is not a great deal different.

Embedded devices do introduce some extracomplexities, though. On PC laptops, theBIOS (either APM or ACPI) provides somesupport to the kernel on how to properly sus-pend the machine (indeed, in the APM case theBIOS handles most of the work itself). Em-bedded machines, on the other hand, usuallyrequire the kernel to know how to suspend andresume the machine directly. For example thesuspend code for the eLAP knows how to usethe 405LP’s features to save the CPU state,how to configure the RAM to enter self-refreshmode, how to use the board’s FPGA registersto turn off the board, and how to rebuild the

state when the machine is resumed.

In addition, static power management on allmachines requires knowledge of what devices’dependencies on each other are, so they canbe shut down and later restarted in the cor-rect order—this was one of the major moti-vations for the creation of the unified devicemodel. Hence, all the complexities of obtain-ing detailed device information for embeddedsystems impact on static power management.However static power management doesn’t re-ally add further difficulties beyond those wehave already discussed for device discovery.

5.3 Peripheral power management

We use the term peripheral power managementto refer to disabling and powering down pe-ripheral devices when they are not in use. Thisis often relatively straightforward, since it canbe handled directly by the driver for the devicein question. This also delegates the questionof when the device is “in use” to the driver.Sometimes it is sufficient to enable power tothe device when it is open, and disable it whenclosed, other times more fine-grained controlof the power is desirable, e.g. to take advan-tage of idle periods.

When the device depends on other devices be-ing enabled, the situation is a little more com-plex. However the driver will generally knowwhat the other devices are and their drivers, soit is usually quite simple to create an ad-hoc in-terface whereby one driver can ask the other toenable the device it requires.

Difficulties do arise where power to one de-vice is controlled by another: for example the4xx on-chip devices are controlled by a cen-tral clock and power management unit (CPM).Similarly many boards have devices which arepowered on and off by board-specific FPGAregisters.

Linux Symposium 211

The 4xx CPM has a simple interface, allow-ing the OCP subsystem to support peripheralpower management quite easily. Thepmfieldin the OCP device definitions (see Figures 2& 3) is a mask describing which bit in theCPM registers controls power to this periph-eral. Drivers can then use functions from theOCP subsystem to switch the device on and off.

Obviously, for peripherals that are unique to aparticular board it is also easy for the driverto directly control power to the device. So farhowever, little work has been done on the gen-eral case where common devices may be pow-ered on and off by board specific controls.

5.4 Dynamic power management

Dynamic power management (DPM) refers todynamically adjusting CPU frequencies andvoltage during operation. This approach isquite new, at least in Linux, and very much un-der development. The details of the motiva-tions and approaches to dynamic power man-agement are outside the scope of this paper,for more information see [4]. IBM and Mon-taVista software are collaborating on furtherdevelopment in this area.

DPM does introduces some new problems re-lated to device information. To work properly,devices in operation may have to impose con-straints on what frequency or other settings areallowed. For example, a device may requirea certain amount of bus bandwidth, and henceimpose a minimum bus frequency, or it may re-quire interrupts to be handled without too higha latency, and hence impose a minimum CPUfrequency.

These constraint details are somewhat like thebasic information about device interconnectionthat we have already examined, but clearly re-quire even more detailed information about thedevices. Since these constraints may well de-

pend on details of the hardware interconnects,this is yet more information which the kernelmust “just know.”

Again, the unified device model provides anobvious framework in which to represent thisinformation. [4] discusses some methods forsetting constraints within drivers. A generalapproach to representing constraints in a waythat is easily extensible to new boards is yet tobe implemented, and is likely to take consider-able further investigation and development.

References

[1] IBM Corporation,PowerPC® 405LPEmbedded Processor User’s Manual,Preliminary, 2002.

[2] IBM Corporation,PowerPC® 405GPEmbedded Processor User’s Manual,Seventh Preliminary Edition, 2000.

[3] IBM Corporation,PowerNP NPe405HNetwork Processor User’s Manual,Preliminary, 2002.

[4] IBM and MontaVista Software,DynamicPower Management for EmbeddedSystems, Version 1.0,http://www.research.ibm.com/arl/projects/papers/DPM_V1.1.pdf , 2002.

[5] linuxppc_2_4_devel kernel tree,bk://[email protected]/linuxppc_2_4_devel .

[6] linuxppc-2.4 kernel tree,bk://[email protected]/linuxppc-2.4 .

[7] PPCBoot homepage,http://ppcboot.sourceforge.net/ .

Linux Symposium 212

About the author

David Gibson is an employee of the IBM LinuxTechnology Center, working from Canberra,Australia. Most recently he has been work-ing on board and device bringup for Linux onembedded PowerPC machines, along with var-ious bits of kernel infrastructure for cleanlysupporting PowerPC 4xx and other system-on-chip CPUs. He is also the author and main-tainer of the orinoco driver for Prism II based802.11b NICs. In the past he has worked onramfs (as included in the -ac kernel tree), and“esky,” a userspace implementation of check-point/resume.

Legal Statement


IBM, PowerPC, PowerPC Architecture and pSeriesare trademarks or registered trademarks of Interna-tional Business Machines Corporation in the UnitedStates and/or other countries.

Intel is a registered trademark of Intel Corporationin the United States, other countries, or both.



GnumericUsing GNOME to go up against MS Office

Jody [email protected]

Abstract

MS Office is popular for a reason. Microsoftand its massive user base have kicked it hard,and polished the roughest edges along the way.The hidden gotcha is that MS now holds yourdata hostage. Their applications define if yourdata can be read, and how you can manipu-late it. The Gnumeric project began as a wayto ensure the GNOME platform could supportthe requirements of a major application. It hasevolved into the core of a spreadsheet platformthat we hope will grow past the limitations ofMS Excel. Gnumeric has taught us a lot aboutspreadsheets and, for the purpose of this talk,about what types of capabilities MS has putinto its libraries and applications to provide theUI that people are familiar with. I’d like to dis-cuss the tools (available and unwritten) neces-sary to produce a competitor and eventually areplacement.

1 Introduction

History

• Aug 17 1997 GNOME Created

• July 2 1998 Gnumeric Created

• Dec 31 2001 version 1.0.0 released

• Aug ?? 2003 version 1.2.0 released

Miguel de Icaza began work on Gnumeric inJuly 1998 with the stated purpose of building

a large, real world, application to validate theGNOME libraries. Without much backgroundin spreadsheets he persevered, learning as hewent, writing maintainable code, and testingsuch core libraries as gnome-canvas, gnome-print, libgsf, and Bonobo.

Status

At this writing Gnumeric now has basic sup-port for 100% of the spreadsheet functionsshipped with Excel and can write MS Office95 through XP, and read all versions from Ex-cel 2 upwards. It covers most major features,but is still lacking pivot tables, and conditionalformats which are planned for the 1.3 releasecycle.

2 Container File Format

The first step in interoperability with MS Of-fice is to read and write its files. MS Office 95was an epochal event for the suite. Before theneach application had its own format. As of Of-fice 95 it shares a single container format usedto wrap each application’s native representa-tion. This is convenient for embedding becauseit allows a user to transfer the entire content ofa document, including all of its components asa single file.

Linux Symposium 214

2.1 OLE2, The MS solution

The wrapper format used throughout MS of-fice is calledOLE2. That title actually en-compasses the container and a sub-format usedto store metadata. GNOME had no supportfor reading or writing either one. For Gnu-meric Michael Meeks used earlier work inlaola by Arturo Tena and Caolan McNamara,and scraps of documentation to create libole2.In the last few years the set of available doc-umentation has improved greatly. Apache’sPOIFS project has generated some nice co-herent write-ups for their Java implementation.Additionally, although the Chicago project hasnot produced any code, they googled extremelywell, and collected enough documentation toilluminate the remaining corners.

OLE2 is quite literally a filesystem in a file. Ithas a directory tree for the offsets and file allo-cation tables to store the layout of the blocks.Unfortunately for libole2 it also has a metafile allocation table, which libole2 could notexport correctly. As a result, after approxi-mately 6.8 MB of data, the library and all of itsderivatives produced invalid results. Given thenew documentation, we’re written libgsf (theGNOME Structured File library) to solve that.The new code has been quite stable, and nowthat libole2 has been deprecated it has madeits way into koffice, abiword, and several otherapplications. Including potentially OOo via theWordPerfect library.

2.2 The Future

• Format should be easy to create and ma-nipulate without special tools.

• Existing filemagic type sniffing should beable to ID the documents as documents,not their container type

• Support for ‘filesystem in a file’ structuredfiles to facilitate embedding and split data

from metadata.

• Space & time efficient for reading andgeneration

• Portable to as many platforms as feasible

• signing

• encryption

• embedded object handling

• possible dual/multiple version support

• meta data

Things we do not need

Transaction support is probably overkill.Given the general size of documents, it seemssimpler to just generate the entire file each timeit is saved.

In-place update (rewrite one element withouttouching the rest) is also probably unnecessary.This is useful for things like presentation pro-grams with large data blobs (images, sound)and relatively little content, it is also conceiv-ably useful in situations where an external appis editing just metadata. However, the imple-mentation costs for these are high in compari-son to available manpower.

At the time of this writing, OOo’s containerformat is the leading contender. There are dis-cussions at this time to potentially adopt thecontainer format (not the underlying xml con-tent).

3 File Format

3.1 XLS

Like a Russian doll, parsing the wrapper for-mat is just the beginning. Within the OLE2

Linux Symposium 215

files are sub-files called Book or Workbook(depending on version) that are in theBIFFformat with several different variants. There isactually some reasonably good documentationfor this due to some MS greed (tm) and hardwork on the part of the OOo xls filter team.BIFF is fairly similar in structure to earlier lo-tus formats, and thanks to microsoft paddingone of its books (The Excel 97 Developers Kit)with their internal file format docs we knowa fair amount about how things are structured.Contrary to the common Slashdot wisdom thehard part is not in knowing the broad detailsof the records. The devil is in the details, im-plementers need to understandwhatMicrosoftmeans for each flag and have a strict supersetof the corresponding application to avoid infor-mation being lost when round tripping data toa proprietary format.

An odd truth of open-source spreadsheets isthat they generally interoperate better via xlsthan their native formats. E.g., Gnumeric canread OpenCalc, but not write it, and OpenCalchas no support for Gnumeric at all. This speaksvolumes for the ubiquitous nature of the Mi-crosoft formats. The resource expenditure tosupport xls fully limits the amount of develop-ment time for other formats.

Within the BIFF records is nested yet anotherformat to store the details of the expressions,which is also reasonably documented.

3.2 Escher

A major change in Office97 was the additionof a shared drawing layer for all of the ap-plications. This allows you to draw an orgchart in Excel and paste it into Powerpoint.Its high level format was reasonably docu-mented on the web until that was pulled afew years ago. This content is nested withinBIFF records within OLE2 records. Escheris a format in the classic ‘specification by im-

plementation’ genre. Although both OOo andGnumeric have decent parsers for the recordsthemselves, parsing the content is tricky. Onerarely knows what attributes correspond to vis-ible properties. Exporting escher is even morecomplex. Both Gnumeric and OOo appear tohave adopted a monkey see, monkey do ap-proach to work around Microsoft’s reluctanceto use ‘conventional’ values for flags requiringthings like 0xAAAAFFFF for true for some ofthe boolean flags.

3.3 WMF/EMF

Users expect their word/clip art to appear faith-fully. That content is usually stored using a setof drawing primitives in a series of MetaFileformats (Windows and Extended). The primi-tives are slightly higher level than a serializedset of X protocol requests. There in an existingrasterizer for wmf in libwmf, but it can not cur-rently handle emf. OOo has a parser for bothwmf and emf, but it is tightly coupled to theOO platform and of marginal quality. Properhandling of this is still an open question. Therehave been discussions with the maintaers oflibwmf, and libEMF to combine efforts to fillthis niche, but nothing concrete has material-ized as yet.

3.4 Security

Unfortunately each element of MS Office han-dles this slightly differently, and approachesvary from version to version. There is no se-cure notion of authentication within OLE2, orxls. Microsoft apparently assumes that is han-dled at a higher level. Within xls there are 3main forms of encryption. The first is sheetlevel protection that is little more than an XORof the records with a 16-byte hash of the pass-word. This is tissue-thin, and can be supportedtrivially. Workbook level encryption is sig-nificantly more secure and uses md5 hashed

Linux Symposium 216

passwords and rc4 to encrypt the BIFF recordcontent. Workbook level protection is ratherstrange. It uses the secure workbook level en-cryption, with a hard coded password. Gnu-meric is the only open source application thatcan handle all three.

3.5 VBA

This is still largely uncharted territory for opensource applications. MS Office appears to storethe VBA code in at least 3 formats.

1. Compressed source code. libgsf, libole2,and openoffice can all decompress thecode with varying degrees of accuracy.What none of us knows is how to lo-cate the offset to the start of the com-pressed stream. OO and libole2 both havekludges in place to guess, but neither isreliable. There is clearly documentationon the subject available to the anti-virusmanufacturers, but its licensing precludesits use in open source libraries. Thisis the most likely route to support im-porting VBA in the near term. It is notimmediately obvious that source code iswhat we really want, because it requires alexer, parser, and libraries to back it up—arather significant amount of work. Thereis some hope that the Mono project andits emulation of the .Net API will providesupport for this.

2. P-Code. A preparsed set of tokens to beinterpreted by the VB engine. The for-mat of this has no open documentation,although it is fairly amenable to parsing.On the positive side this is the holy grailin many ways. Having pre-parsed code re-moves the need for a lexer and parser, andallows us to map the content to more mod-ern languages such as python. The downside is that the p-code comes in many vari-ants, and depends on versioning.

3. S-Code. Like P-Code, but different insome unspecified way. No known parsersor documentation exists.

4 Expansion

In addition to their core functionality, office ap-plications are expandable. Organizations havesome limited ability to customize their installa-tions, and third party developers can use themas a development platform for their niche ap-plications. From tasks as simple as work-flow macros, up to massive extensions such asFEA’s @Risk, or DeltaGraph, people are ex-tending MS Excel.

4.1 XLLs & XLMs

Early versions of Excel offered 2 forms of ex-tension. The first was a kludge that grafted apseudo-procedural language into the functionalformat of a spreadsheet, called XLM. This isno longer widely used.

There is also the opportunity to load an XLL,a DLL shared library with special entry points,directly into the process’ address space. Al-though its interface is only partially docu-mented, somewhat byzantine, and deprecated,this is the most popular form of extension. Theprimary benefit is that a developer’s code canbe written in C/C++, and compiled to link theexternal libraries fairly easily.

To the best of my knowledge there are no opensource or proprietary applications that supportXLMs or XLLs other than MS Excel. Thishinders transition. XLMs could potentially besupported, but the limited remaining user basedoes not warrant the expense. XLLs might befeasible under win32, but due to the nature ofthe interface, would probably require WINEunder *nix.

Linux Symposium 217

4.2 VBA & OLE

With Office97 came a unification of embed-ding and scripting via VBA and the OLE com-ponent model. VBA as a language was a sig-nificant improvement over XLM and quicklysupplanted it. OLE was more of a mixed bless-ing. The increased complexity of the interfacerequired Dev Studio wizards to generate thewrapper code, which was fairly unmaintainble.As a result most installations fell back on VBAexternal declaration support to add new capa-bilities. This worked well for tasks amenableto scripting, but was painful when linking toexternal analytics. Adding a new spreadsheetfunction involved writing it in C, then creatinga dummy wrapper in VBA that links to it.

4.3 Gnumeric

Worksheet function management is an areawhere Gnumeric is well ahead of MS Office.Adding new functions to Gnumeric is triv-ial. An abstract interface for loading moduleshas been implemented for shared libraries (viaglib’s g_module utilities), and python. Cou-pled with an xml-based configuration mecha-nism and just-in-time loading, the vast major-ity of the worksheet functions are in plugins.

Less well defined are the scripting interfaces.Building on GNOME’s strong set of languagebindings there have also been experiments inscripting in guile, perl, CORBA, VB, andpython. The only clear result thus far has beenthat the scripting interface is fairly language-agnostic. Defining a clear and coherent api ison the short list of extensions to be made dur-ing the 1.3–to–2.0 development cycle.

5 Preferences

Storing user preferences is another area whereGNOME technology has the advantage over

its Windows counterpart. GConf attempts tolearn the lessons of the Windows registry whilelearning from its failures. By storing content inseveral distinct user readable xml files, gconfoffers the convenience of a global structuredstorage, while retaining the flexibility in theface of file errors or corruption not found in themore monolithic Windows registry. Work re-mains though. There is still some thought nec-essary to implement lockdown features, and toaddress logical paths (HOME, PREFIX, etc.).

6 GUI Toolkit

Over time, the initial separation betweengnome libraries as extensions to gtk have beenlargely removed. With the addition of pango tohandle advanced text, and extensions to Gtk’srendering model that produced the foocanvas,gtk+ now supports the primary display needsof Gnumeric. Coupled with libglade for easymaintenance and configurability, it is relativelypainless to produce extremely usable dialogs.There are, however, a few lingering issues toaddress.

Configurable UI

The current gtk+ api for menus and toolbarsmakes no distinction between the actions andtheir layout. Applications are forced to hard-code their menu/toolbar layouts in order tomodify them. This removes the ability of auser to reorganize things. The limitations ofthe gtk+ path-based API prompted the creationof GNOME_UI app helpers, which simplifiedcreation and added stock items to improve con-sistency between GNOME applications. How-ever, it did nothing to solve the issue of hard-coded layouts. In an effort to solve the prob-lem of merging menus and toolbars for com-ponents, Bonobo made an attempt to solve thelayout problems by separating the layout fromthe actions. Unfortunately, the API was pro-

Linux Symposium 218

duced without enough review, and it was in-sufficent for large applications. The hopefullyfinal rendition is now in its evaluation phasein libegg menu/toolbar. This code allows ef-fective management of different action groups,and the creation of new action types such ascombos and accelerators. The main remainingquestion is how to store a user’s edits. KDE haslong had support for this sort of editing, theydon’t appear to have a good solution to storingthe edits as yet.

File Selector

The gtk+ file selector has long been a sourceof ridicule and disgust. It is functional, but toobarebones for a modern desktop. The fact thatit has no solid support for network addresses,or histories has greatly hindered the adoptionof gnome-vfs. There have been several write-ups and Owen Taylor has apparently completeda replacement version that will be included ingtk+-2.4.

7 Spreadsheet-specific Functional-ity

Spreadsheets are an ideal testing ground forall those obscure datastructures we all learnedback in school. In most instances there is notenough data to make using something esotericworthwhile. With an apparent size of 256×64k (Gnumeric can scale considerably larger),it is very easy to quickly operate on significantswaths of data.

As an example, Gnumeric uses an asymetricquadtree to store style information. This allowsus to easily handle someone doing a “Select all,Bold” (explode kspread). It also supports “Se-lect all minus one row and one col” (explodeMS Excel, and OpenCalc).

8 Acknowledgements

The Gnumeric development team. You knowwho you are, as do the CVS logs.

Ximian employees for their continuing per-sonal contributions to Gnumeric.

9 References

http://www.gnome.org/projects/gnumeric

DMA Hints on IA64/PARISCOptimizing DMA performance for HP Chip sets

Grant GrundlerHewlett Packard

[email protected]

Abstract

Modern IO subsystems implement complexDMA transaction parameters, called DMAhints, which are not explicitly supported bythe Linux DMA API. This paper investigatesbenefits of using non-default DMA hints andthus whether such hints should be abstractedinto the DMA API. My conclusion is the im-plementation (ZX1) investigated does not war-rant changing the DMA API. Other implemen-tations need to be compared before proposingany changes.

HP PA-RISC (Astro[1]/Elroy[2]) and IA64 IOControllers (ZX1) both support several types ofDMA hints and both are commercially avail-able. My primary interest was the ability toprefetch cache lines for PCI devices. The ben-efit is same as for CPU: bring the data closerto the consumer. But to my surprise, cache lineprefetching is not the most important hint sincedefault prefetching works well for all devices.Relaxing the PCI ordering rules turns out tobe more important since firmware can’t knowwhen it’s safe to do so.

Updated versions of this paper will be avail-able from http://iou.parisc-linux.

org/ols2003/

1 Introduction

DMA performance seems like such an obviousthing. Drivers just need to tell the device whereto fetch something from memory, poke it, andlife is good. Unfortunately, those days are over.

Modern SMP servers require multiple levels ofbridges in order to support PCI-X Bandwidth(peak burst rate 133MHz/64-bits). In order towork well with CPUs and memory controllers,IO Devices participate in the CPU Cache Co-herency protocols. They also need to minimizethe number transactions used and use the ap-propriate type of transaction in order to opti-mally utilize available bandwidth.

Throughout this paper (and even in the title!)I use the wordHint which implies an “infor-mational only” parameter. This isn’t strictlyaccurate. Some platforms depend on certainparameters for correct operation. I.e. incor-rect results may occur for some combinationsof DMA hints. The DMA hints discussed inthis paper should always provide correct re-sults though I’ve crashed the ZX1 with somehints as noted.

And I recycled the ZX1 block diagram usedin my 2002 OLS talk, “Porting Drivers toZX1.”[3] The diagram is useful to understandthe routing of data between PCI devices, Mem-ory, and CPU. [4]

Linux Symposium 220

LBA LBA LBA LBA

Memory

USBSCSILAN

IO MMU

SBAMckinley Bus

CPU CPU

Figure 1: HP ZX1 Block Diagram

2 Overview of DMA

The following sections introduce some of thekey concepts relating to Direct Memory Ac-cess.

2.1 Consistent vs. Streaming DMA Mappings

The Linux DMA mapping interface differen-tiates between two memory access patterns.A short summary ofDMA-mapping.txt [5]follows.

Consistent DMA mappings are intended fordata which is concurrently accessed by bothCPU and PCI device(s) (i.e. Host RAM basedevice control structures like mailbox rings).Key feature is updates (writes) from eithermust be visible to the other based on PCI order-ing rules. In short, fairly strict R/W orderingrules and transactions are typically less than acache line in length.

Streaming DMA mappings are intended formemory regions exclusively accessed by thePCI device(s). This “exclusive” access beginswhen a host memory region is mapped andends when the same region is unmapped.

The Streaming DMA interface provides two

explicit hints: DMA direction and DMAlength. From the length, we know the blocksize on which the DMA will terminate. Butas noted in the introduction, DMA direc-tion is required for correct operation on someplatforms—but not ZX1 or PARISC IOM-MUs.1 The ZX1 System Bus Adapter(akaSBA; Seesba_iommu.c ) code does option-ally use direction to optimize VM bits. Otherhints regarding PCI Ordering compliance andDMA Read data consumption rate are not spec-ified.

2.2 PCI DMA

A single DMA operation is fairly straight for-ward at the PCI bus level. The PCI Deviceasks the PCI arbiter for bus ownership. TheArbiter eventually grants the PCI device own-ership of the bus. PCI device accepts owner-ship and sends the target address (possible twocycles worth for 64-bit addressing) followed bydata. The transaction ends when either the PCIController asks the device or the PCI devicevolunteers to give ownership.

PCI supports several different types of Com-mands. Here are the ones relating to DMAalong with summaries of their PCI Local busdefinition:

• Memory Read command is used to readdata from an agent mapped in the Mem-ory Address Space. The target is free todo an anticipatory read for this commandonly if it can guarantee that such a readwill have no side effects. Furthermore, thetarget must ensure the coherency (whichincludes ordering) of any data retained intemporary buffers after this PCI transac-tion is completed.

• Memory Read Line command is seman-

1PARISC platforms without IOMMU do requireR/W direction hint.

Linux Symposium 221

tically identical to the Memory Read com-mand. Use of MRL indicates the intentionto read a full cache line of data.

• Memory Read Multiple command is se-mantically identical to the Memory Readcommand. Use of MRM indicates the in-tention to read more than one cache lineof data before disconnecting. MRM isintended to be used with bulk sequentialdata transfers where the memory system(and the requesting master) might gainsome performance advantage by sequen-tially reading ahead one or more addi-tional cache line(s) when a software trans-parent buffer is available for temporarystorage.

• Memory Write command is used to writedata to an agent mapped in the MemoryAddress Space. When the target returns“ready,” it has assumed responsibility forthe coherency (which includes ordering)of the subject data.

• Memory Write and Invalidate com-mand is semantically identical to theMemory Write command. The differ-ence is MWI requires the device to writeat least one complete cache line and theHost Cache controller can invalidate ex-isting contents without having to send thecontents (just reassign ownership) to theIO or Memory Controller. This avoidsunnecessary cycles on the Front SideBus for DMA writes. MWI requiresCACHELINE_SIZE register in the de-vice configuration space to (a) be imple-mented and (b) programmed by BIOS orPCI Initialization code to a suitable value.

The number of bytes transferred is constrainedby the transfer type (MR, MRL, MRM, MW,MWI) and LATENCY_TIMERvalue. TheLATENCY_TIMERis described briefly later in

this paper and by the PCI Local Bus Specifica-tion.

Memory Writes are typically the simplercase from a software performance perspective.DMA Writes are buffered by the chip set androuted to the memory controller[6] at whateverrate the internal interconnect supports. Datathroughput is typically limited by the PCI buscontroller[7] or memory controller.

Reads are more complicated because the mem-ory controller latency is harder to hide. All datahandling systems (like disk IO) deal with thisproblem by “Read Ahead” (a.k.a. prefetching)or caching. However, large caches like thoseimplemented in a CPU are expensive in manyways. And large caches don’t help much sincethe read and write access patterns for IO de-vices typically aren’t for the same cache linesrepeatedly. Or in the case of shared data, theIO device competes at times with the CPU forthe cache line.

Successive requests for bulk data can beprefetched by the I/O Controller. I was ex-pecting prefetching to make a big differencefor PCI devices. But the individual devices Itested (except 53c1010 Consistent DMA) didnot perform better or worse for different levelsof prefetching. Like disk IO, the effectivenessof the prefetching really depends on the dataaccess pattern. I suspect this is because the PCIdevices are designed to work well with out anyprefetching and buffer enough data to keep theIO device from stalling.

2.3 DMA on PCI-X

PCI-X obsoletes much of what I was trying toaccomplish in this paper. Cache line prefetch-ing hint only applies to PCI. But three newPCI-X-only features are of interest:

• Attributesare part of the PCI-X command.

Linux Symposium 222

• Split Transactionsreplace the goofy “retryforever” schemes used when read data isnot available.

• Burst Transactionsreplace MRM, MRL,and MWI.

2.3.1 Command Attributes

The PCI-X Specification[8] (drafts are avail-able for free) has more details about commandattributes in section2.3 PCI-X Command En-coding. I’ll try to summarize below. Two at-tributes are currently defined for PCI-X com-mands.

Relaxed Ordering is also described in Sec-tion 4.1. In a nutshell, ignoring the PCI or-dering rules regarding Programmed I/O (CPUR/W) and DMA (PCI R/W) yields measurableperformance gains without sacrificing correctoperation. The handful of drivers I’ve usedunder PARISC-Linux and IA64-Linux run justfine with outbound data2 ordering rules re-laxed.

Note the PCI-X spec doesn’t require the PCI-Xdevice to use Relaxed Ordering attribute whenthe Relaxed Ordering bit is set in the commandregister. ZX1 chip can override the CommandAttribute for Relaxed Ordering behavior. AndZX1 chip set only implements this optimiza-tion for outbound data flow (PIO Write/DMARead return). The PCI-X spec defines opti-mizations in both directions of data flow.

No Snoopattribute isn’t relevant for most IOdevices. Most Linux drivers expect DMAtransactions under control of DMA mappingservices to be coherent. “No Snoop” means thehost driver guarantees the latest copy of a cacheline is in the memory controller. And it implies

2Inbound data reordering causes Bad Things™ tohappen. Discussed on ia64-linux mailing list.

the chip set can perform better if it doesn’t haveto Snoop. My understanding is non-coherenttransaction are interesting for graphics devices,not the LAN/Storage devices I work with.

2.3.2 Split Transactions

Split Transactions just means the request forinformation and the completion (reply) of thatrequest are separate transactions on the bus.This is a good thing for several reasons:

• Up to 32 transactions can be pending atthe same time. Only 5 bits are defined intheTagfield of the PCI-X command. Butthe exact number of outstanding transac-tions supported depends on chip set im-plementation. This is identical in conceptto Tagged Queuesdefined for SCSI proto-col.

• More efficient: eliminates the need to poll(aka “retry forever”) when the PCI Con-troller disconnects in the middle of a (e.g.)read transaction.

• Acts more like a Memory bus rather thanan IO bus. Thus, it’s easier for HWdesigners to route transactions across alarger fabric.

2.3.3 Burst Transactions

Memory Read Block (MRB) and MemoryWrite Block (MWB) are replacements forMRL/MRM and MWI respectively. The keydifference between the above PCI and PCI-Xcommands is addition of aByte Countfield. Bymaking the transfer length visible to the PCI-Xcontroller, the chipset can prefetch cache linesappropriately. This is significant since not in-volving driver writers for 4 or 5 different OSsto program DMA hint bits is a good thing.

Linux Symposium 223

3 DMA Parameters

Two additional parameters defined by PCILocal Bus specification affect DMA behav-ior. Both reside in the PCI device configu-ration space header:LATENCY_TIMERandCACHELINE_SIZE.

3.1 LATENCY_TIMER

LATENCY_TIMERconstrains how long thePCI device will burst DMA before volunteer-ing to give up the PCI bus. It should belong enough to transfer several cache linesof data if the device is capable. While thisis a “tunable” parameter, I didn’t feel it wasnecessary to experiment with this value sinceLATENCY_TIMERis a pretty well understoodand described else where.

3.2 CACHELINE_SIZE

CACHELINE_SIZE is only required by PCIMWI command. CACHELINE_SIZE has tobe the IO cache line size and not the CPU size.Typically this is either the line size of the mem-ory controller or the line size of outer mostCPU cache (e.g. L2 or L3 Cache). The CPUcan use smaller lines for first and/or secondlevel caches.

Since firmware is expected to programCACHELINE_SIZE with the appropriatevalue, it’s a not really either a parameter nortunable.

4 HP ZX1 DMA Hint bits

The HP ZX1 chip set implements several bitsfor DMA Hints. None of these are supportedby the current Linux DMA mapping API. Myoriginal goal was to propose extensions to theLinux DMA mapping API. But I’ve concluded

it’s not absolutely necessary and I don’t knowwhich hints are of interest to other chip sets.Hopefully this paper will precipitate more dis-cussion and comparison of chips and capabili-ties.

The primary reason it’s not absolutely neces-sary is IA64 firmware is compensating for anignorant OS. Firmware errs on the overly ag-gressive side in setting the cache line prefetch-ing (for simple, single unit tests) and errs con-servatively on the correctness case (RelaxedOrdering is disabled). Though it’s not optimal,ZX1 IOMMU code could blindly set RelaxedOrdering hint bit (i.e. not enforce ordering) andsome hacks can take care of the others.

And because PCI-X obsoletes the key PCI-only hint, I can’t argue HP needs them. Mypet architecture (PA-RISC) would benefit sinceHW shipped to date only supports PCI—butthat’s not of commercial interest. And PA-RISC alone is not a justification for extensionsto the interface.

Even if I had HW descriptions for other IOM-MUs, it would be a lot of work to indepen-dently abstract the DMA hints. Understand-ing platform IOMMU support well enough toabstract what’s important is non-trivial. Andsince I’m fundamentally lazy (or “good at op-timizing” as Bdale Garbee puts it), I’ll pass fornow.

Lastly, assuming hints are chip set specific(since no one has abstracted them), introduc-ing hints for each chip set is the path to hellfor driver writers. A relatively small number ofpeople (per OS) understand how one IOMMUon one platform works. Trying to get a broaderaudience to understand several platform chipsets is unrealistic. Been there, done that.

Linux Symposium 224

4.1 Relaxed Ordering

Relaxed Ordering tells the HW it can ignoreone PCI ordering rule. PCI-X specification of-fers this optimization in each direction (not justoutbound) with its definition of Relaxed Order-ing. HP’s chip set is sufficiently well imple-mented in the inbound (DMA writes) path thatthe inbound optimization isn’t helpful and thusnot implemented.

HP also calls this optimization “PIOW/DMAROrdering.” The cryptic acronym means “Pro-grammed IO Writes/DMA Reads” Ordering.Setting this hint indicates the driver and de-vice don’t depend on ordering of DMA Readreturns and PIO Writes for correct operation.This hint allows DMA read returns to bypassPIO Writes in order to prevent an in-progressDMA burst from disconnecting on the PCI busand force retries.

I didn’t have any expectations for this hint.Mostly because of my ignorance when I startedthis investigation.

4.2 Read Current

Read Current transactions gets the most re-cent copy of cache line datawithout changingthe cache line state. The key thing is the CPUcan keep cache line ownership. By not givingup ownership, the CPU can continue to modifycache line contents without having to fight withthe IO Controller (ping pong) for the cacheline. However, the copy of the cache line is notmaintained and can become stale. It should beconsumed immediately (for some finite defini-tion of NOW; parents will understand). Thusit’s most useful when data is leaving the cachecoherency domain (i.e. DMA reads).

Most driver writers will not need (or want) toworry about Read Current hint. First, differ-ent chip-sets have minor variations in imple-

mentation which may in fact still ping-pong thecache line. Secondly, Read Current hint has noeffect on chip sets which already implementDMA reads by issuing Read Current transac-tions. And third, to date, none of the PCI de-vices that interest me obviously benefited byexplicitly setting or clearing this hint bit on theplatforms I’ve tested.

Read Currentis implemented on both PARISCRunway[9] and McKinley bus.

4.3 Cache line Prefetching (PCI Only)

Cache line Prefetchingfor IO devices servesthe same purpose as prefetching does forCPUs: avoid stalling by bringing data closerbefore it’s needed. The amount of prefetchneeded is a function of the device’sdata con-sumption rateand the actual memory con-troller latency.

For example, if the memory controller can de-liver a cache line in 120ns and the device canconsume a cache line in 120ns (8 * 15ns), weneed to prefetch 1 cache line at any given mo-ment in time. In other words, 2 cache linesof data will be in flight at any given momentin time. But as system workload increases,the average memory controller latency usuallygoes up too. It might really take 200 or 300nsto deliver a cache line. We need to compro-mise and pick the number of cache lines toprefetch so things work OK under worst casebut perform optimally in the “expected workload” range.

ZX1 chip set can deliver a 128 byte cache linein about 110 ns[10] PCI devices can only con-sume 128 bytes every 240 ns (16 * 15ns) atbest. The PCI device probably stalls less than1/3 of the time waiting for data. This is sub-stantially different than for the PARISC chipset which can only deliver a 64 byte cache linein about 180 ns.[11]

Linux Symposium 225

PCI-X mode of operation does NOT supportcache line prefetching hints. It’s not necessarybecause with split transactions, the device canhave much more IO outstanding and in effect,perform its own prefetching.

4.4 DMA Block Size (PCI Only)

DMA Block Size tells the chip set when to stopprefetching. Prefetching will continue up to theblock size boundary and resume when the firstcache line of the next block is requested.

This hint also does not apply to PCI-X busses.It’s not necessary because the PCI-XBurstTransactionsspecify the number of bytes be-ing transferred and the chip set (or OS code)doesn’t have to guess when to stop prefetching.

I didn’t expect block size to matter much insingle unit testing. It would be interesting toknow how much it matters when testing a fullyloaded system.

5 Case Study: BCM5701 (PCI)

In 2002, around the same time HP ZX1 prod-ucts became available, HP started shippingTigon3 NICs (designed and tested by HP).The BCM5701 NIC supported by HP-UX isshipped operating in PCI mode.

Test used is:/opt/netperf/netperf -l 60 \-H 10.0.30.0 -t UDP_STREAM -- \-m 1024 -s 131072 -S 131072

UDP_STREAM is useful for testing output ifthe host networking stack only sends what theNIC can consume. I’m told this is the case forLinux. And while some applications really dorun on top of UDP, I also ran TCP_STREAMtest to get an idea of the workloads I’m familiarwith.

Client (HP RX2600, 1GHz) was running2.4.20-em19 + tg3 v1.5 over the built-inBCM5701.Server (HP RX2600, 900MHz) was runningthe same kernel, same built-in BCM5701.

NICs were connected via cross-over cable andset to either 1500 or 9000 bytes MTU.

Firmware sets the default PCI Command Hintto 3 cache lines prefetch, Relaxed Orderingdisabled, 4k block size, Read Current enabled.

I then varied the DMA Hints on theClientwho was sending packets. While this soundsbackwards, it’s the netperf point of view. Wewant to observe the netperf “client” send per-formance.


Relaxed Ordering Hint is (on) enforced by de-fault. I’ve turned it off selectively for MR,MRL and MRM PCI transactions in Table 1.Runs with 2.4.20+tg3 v1.5 only showed about4% improvement with 1k messages.

Previous experience with UDP_STREAM test-ing on RHAS 2.1 (IA64, e.25?, using tg3v0.99) demonstrated nearly 10% performanceimprovement with 1080 byte messages. With-out ordering enforced, netperf reported 862Mb/s3 vs. around 775 Mb/s when ordering wasenforced (default behavior).

TCP Stream test showed a smaller, but similar,sensitivity to this parameter. Clearing RelaxedOrdering hint in the MRM hint for StreamingDMA resulted in 778 Mb/s (vs. ~758 nor-mally) using 1024 byte message and 1500 byteMTU.

3Or 848 Mb/s when the netperf client ran on the sameCPU as the one interrupts were directed at.

Linux Symposium 226

PCI Cmd Consistent StreamingNONE 759.61 758.74

MR 759.09 760.67MRL 759.99 759.70MRM 759.68 797.19ALL 758.99 800.65

Table 1: BCM5701 UDP_STREAM RelaxedOrdering, 1k Msg

5.2 Cache line Prefetching

Neither TCP nor UDP showed any statisticallysignificant differences as I varied the cacheline prefetching for MR, MRL, or MRM com-mands. This was true for both 1k (1500 byteMTU) and 8k (9000 byte MTU) message sizes.

I suspect this is primarily because I’m measur-ing what the card is buffering, not PCI bus uti-lization. The card’s ability to buffer is not af-fected by how inefficient the PCI bus is used.Unfortunately, I don’t have the tools to mea-sure PCI Bus utilization on the RX2600.

Again, like learning PCI-X obsoletes cacheline prefetching, this is a disappointing but use-ful result.

5.3 DMA Block Size

Unlike cache line prefetching, I didn’t expectmuch difference between the various blocksizes. Once I knew prefetching makes no dif-ference for the BCM5701, the fact that varyingBlock size hint also makes no difference wasno surprise.

5.4 Read Current

Disabling Read Current Hint for Consistentmappings will crash the system. It’s not clearto me why. I talked with the HW designers andit is clearly not an expected result due to how

the read/write paths are implemented.

I wasn’t expecting any measurable perfor-mance difference with (vs. without) Read Cur-rent for Streaming DMA. And in fact, I didn’tsee any.

6 Case Study: BCM5704S (PCI-X)

Since I only have one BCM5704S,4 I ended upconnecting both ports of the BCM5704S (tg3v1.5) to the 82546EB (e1000 4.4.12-k1) in theother machine.

It’s worth mentioning the BCM5704S sitsbehind an IBM PCI-X to PCI-X bridge:

...+-[80]-+-01.0-[81]--+-04.0 QLA2312| | +-04.1 QLA2312| | +-06.0 BCM5704S| | \-06.1 BCM5704S| \-1e.0 PCI Bus Controller...

The bridge plays a bigger role in performancethan people expect. For grins, I sent 4k mes-sages out the client through both BCM5704Sports at the same time using default hints.Throughput was 990± 0.1 Mb/s for eachport (total 1980 Mb/s). vmstat reports ~25%CPU (dual CPU systems) on the server (e1000driver) and about 33% on the client (tg3driver). Why was throughput so good? TheIBM PCI-X bridge is prefetching data forthe chip and also supports split transactions.The prefetching caused some heartburn for theIOMMU code since the IBM bridge ended upprefetching past page boundaries on early pro-totypes. Changes were made to the ZX1 PCI

4HP has no plans for productizing anything withBCM5704 on IA64 at this time. It happens to work andprovides a nice comparison to the BCM5701 case study(PCI-X vs. PCI).

Linux Symposium 227

PCI Cmd Consistent StreamingALL 949.35 950.93MR 952.97 951.79

MRL 952.64 952.66MRM 953.97 952.58NONE 951.19 945.10

Table 2: BCM5704 TCP_STREAM 8k Msg

Bus Controller (aka LBA) to stop the prefetch-ing behavior.


With 4k messages, running TCP_STREAMgave consistent results around 990± 0.1 Mb/s.This wasn’t the case for 8k messages. I’m notsure why since the MTU should have been 9kwhen running the tests. Table 2 is included foryour amusement only.

It’s irritating I don’t know why the results arelower than with 4k messages or what’s causingthe variability. For the record, UDP_STREAMtest was able to send 984.18± 0.01 Mb/s using4k messages and 992.04± 0.01 Mb/s for 8kmessages.

7 Case Study: 82546EB (PCI-X)

This is Intel’s “4th Generation Gigabit MACdesign with fully integrated, physical-layer cir-cuitry to provide two standard IEEE 802.3 Eth-ernet interfaces. . . .”5 Same setup as with theBCM5701 except an Intel add-on NIC is inboth netperf client and server. Both NICs areconfigured to use 9000 byte MTU.

Table 3 shows results for the driver Intel of-5HP has no plans for productizing 82546EB on IA64

at this time. 82546EB only happens to work under Linuxbecause e1000 driver uses I/O Port space. This chip hasserious bugs when using MMIO space to access regis-ters.

Msg Size UDP TX UDP RX1024 937.77 492.271024 937.76 477.224096 983.70 959.554096 983.70 959.598192 991.80 991.808192 991.80 991.80

Table 3: e1000 v4.3.15 UDP (Mb/s)

Msg Size TCP UDP TX UDP RX1024 976.75 937.76 501.254096 978.16 983.72 983.708192 907.69 991.80 991.80

Table 4: e1000 v4.4.12 TCP/UDP (Mb/s)

fered on their web site download area: e1000v4.3.15 driver. But as Table 3 shows, TCP re-sults varied from 639 to 660 Mb/s (1k mes-sages) and got worse (540-565 Mb/s) for 8kmessages. UDP results for smaller messageswere very poor as well. Something is clearlywrong.

In contrast, TCP Streaming performance forv4.4.12-k1 e1000 driver was quite good. Withdefault hints, both ports combined could sendabout 1770 Mb/s using 8k message.


Disabling ordering enforcement did not changeperformance in any statistically significantway. In fact, UDP results were identical to Ta-ble 4 except for slightly higher UDP RX result.TCP results also showed the same 70 Mb/sdrop for 8192 byte messages. And combinedport throughput stayed around 1770 Mb/s for8k message size.

For grins, combined throughput with 4k mes-sage size achieved 1936± 1 Mb/s. Definitelyan impressive result given both ports are shar-

Linux Symposium 228

Consistent StreamingPCI Cmd Order Current Order Current

MR 223 NA 221 220MRL 220 NA 221 214MRM 220 NA 221 208

NONE6 222 NA 219 217

Table 5: 53c1010 Ordering/Current Hints(MB/s)

ing the PCI-X bus.

7.2 Read Current

Clearing Read Current bit for either Consistentor Streaming DMA resulted in a slight drop(890 Mb/s) for TCP Streaming test comparedto Table 4. I’m suspicious of this result be-cause afterwards, I could consistently only get960 Mb/s (8 Mb/s less) for TCP Streaming us-ing 4K messages.

I didn’t run UDP tests for Read Current Hint.

8 Case Study: LSI 53c1010 (PCI)

LSI’s 53c1010 (Ultra3 LVD) is pretty widelyused along with 53c896 (Ultra2 LVD). Bothare driven by the sym53c8xx_2 SCSI driver.

Since parallel SCSI busses are not duplex, test-ing this was fairly straightforward. I setup aMD RAID0 across both channels (alternatingdisks) with 10 odd-ball Ultra3 disks (9, 18,36GB, mix of vendors). Then ran:dd if=/dev/zero of=/dev/md4 \

bs=64k count=200000

I learned later that running RAID0 was notsuch a good idea. More on this in the u320(53c1030) case study.

Consistent DMAPrefetch Depth 0 1 2 3MR 223 219 219 224MRL 224 223 221 224MRM 104 168 220 223Block Size 512 1024 2048 4096MR 223 223 222 223MRL 220 218 218 221MRM 220 221 219 220

Streaming DMAPrefetch Depth 0 1 2 3MR 218 214 223 219MRL 221 223 216 219MRM 216 221 214 220Block Size 512 1024 2048 4096MR 216 220 208 219MRL 215 221 221 222MRM 218 210 224 223

Table 6: 53c1010 Cache line Prefetching,MB/s

8.1 Relaxed Ordering and Read Current

I’ve globbed both Relaxed Ordering & ReadCurrent into Table 5 only because they areboth boolean values. Differences of less than3 MB/s are probably not significant.

Disabling Read Current for Streaming DMAclearly reduces performance for MRL andMRM transactions. I thought the “NONE”(217 MB/s) result is a weighted average of allthree types of transactions but that is a logi-cal fallacy. This result can’t be better than theworst case unless some other interaction is tak-ing place.

8.2 Cache line Prefetching

Of particular interest in Table 6 is the extentConsistent MRM prefetching affects through-put. I guessed this is because the 53c1010

6Well, this should really be “ALL” for Relaxed Or-dering hint since all the bits are set.

Linux Symposium 229

“scripts” are kept in host memory (but cachedlocally) and all IO grinds to a halt when an un-cached portion of script is not available. JamesBottomley suggested the entire script fit in on-board RAM and was loaded under Host CPUcontrol at init time. If true, then the scriptsthemselves are sequentially fetching controldata and getting hurt badly by not having thecontrol data available immediately.

9 Case Study: LSI 53c1030 (PCI-X)

Using the same methods (and the same u160disk drives) as for 53c1010 didn’t work. Theresults varied from 160 MB/s to 185 MB/s re-gardless of hint settings. I expected at leastequivalent performance to the 53c1010 andsuspect whatever is causing the variability isalso limiting performance.

Trying a different method suggested by JamesBottomley led to an interesting result. He wasappalled I was using RAID0 because of is-sues with MD layer not coalescing IO requestsagain at the disk level. But using a 64k chunk,aka stride, I thought would provide big enoughblocks.

To avoid RAID0, James suggested checkingif multiple copies ofsg_dd would (one perdisk) would work. Well, I’d like to see multipleIOs outstanding per spindle. And fortunatelysgp_dd man page suggests exactly that. Nice.

While the advantage of this method is it by-passes lots of kernel code related to buffercache, the drawback is it also bypasses all thestatistics gathering in the kernel. Neither vm-stat nor iostat sees any of this disk activity.The solution is to measure the throughput ofeach disk individually (23 to 48MB/s) and thenadjust the number of blocks transfered suchthat all 10 disks finished their sgp_dd process

PCI Cmd Consistent StreamingALL 268421 266666MR 266666 266666

MRL 266666 264935MRM 268421 266666NONE 264935 266666

Table 7: 53c1030 Relaxed Ordering, KB/s

within about 1 second of each other. Thendate +%s could time the cumulative I/O.Add up how much data each sgp_dd copied anddivide by total time. This worked better thanI expected and Table 7 shows how consistentthat data was. The accuracy of the data is± 1second of 153 second (average, 266666 KB/s)run times. In retrospect, clearly a better methodthan using RAID0 and suggests roughly theperformance RAID0 should be getting.

The bad news is that despite contortions to col-lect reliable data, neither Read Current nor Re-laxed Ordering hints made a statistical differ-ence for the configuration I had. I still won-der if I misunderstand what the hint bits meanin the context of PCI-X. But I couldn’t findanything in the chip set documentation to indi-cate otherwise. I worry the ZX1 chip set might“allow” (logical And) the ZX1 DMA Hint andPCI-X command attribute bits vs. “forcing on”(logical Or). My expectation was the latterbased on documentation.

10 Case Study: qla2312 (PCI-X)

The qla2312 is a Qlogic PCI-X, dual port,2Gb/s FC chip. Qlogic sells this chip for bothdual port and single port FC HBAs. A singleport is theoretically capable of 2 Gb/s (about200MB/s) output and input (full duplex). Thedual port HBA is theoretical capable 800 MB/sthroughput. I tested the qla2312 in two con-figurations: with IBM PCI-X bridge and again

Linux Symposium 230

without (thinking the PCI-X bridge was sub-stantially impacting results).

I used the same methodology as for the53c1030 Case Study with one of the two ports.Unfortunately, I didn’t get a second DS2405enclosure until much too late. And then I foundout the second FC port on the card with thePCI-X Board was disfunctional. I was onlyable to run a few tests through both ports ona QLA2342 FC HBA (uses qla2312 chip).

The qla2312 HBA was running in PCI-X modewith Firmware version 3.01.18. Same 2.4.20-em19 kernel as before with qla2300 v6.04.00driver.

10.1 Outbound vs. Inbound IO

To cut to the chase, setting Relaxed Orderingor disabling Read Current hints did not affectperformance. With 8 disks,sgp_dd was con-sistently writing 190± 1 MB/s. Two thingsmight have contributed to this result: No in-bound load was saturating IO path or PCI-X toPCI-X bridge was “hiding” the effect.

Alone, sgp_dd inbound (IO reads) workloadwould get 198 MB/s consistently. Combinedwith the same outbound (IO writes) work-load as above, the inbound rate drops to about145 MB/s and the outbound workload hoversaround 51 MB/s± 1 MB/s.7 Again, RelaxedOrdering and Read Current Hints made no dif-ference.

Switching to the other RX2600 (900MHz)which had a qla2312 connected to the same setof disks, I reproduced the 198412 KB/s on theinbound-only workload as well. Bidirectionalthroughput was about the same: 143 MB/s inand 53 MB/s out (9% CPU utilization).

7Given the 3:1 bias of inbound:outbound throughput,I tried 6:7 (inbound:outbound) and 5:8 disks—yieldedbasically the same results.

Having spent several days on this, I started todoubt this HBA was operating in full duplexmode despite all the marketing literature mak-ing such a claim. Scrounging through the 300line qla2x00_nvram_config() functionsuggests full duplex mode is intentionally dis-abled:

...

/*

* Setup driver firmware options.

*/

icb->firmware_options.enable_full_duplex = 0;

icb->firmware_options.enable_target_mode = 0;

...

Settingenable_full_duplex to 1 did nothelp.

10.2 Dual Port

I tried the samesgp_dd workload on bothports. Unfortunately, the 7 disks in the DS2405I was loaned were ST336605FC (10k RPM)and not ST336753FC (15K RPM). This meantI had to compensate by adjusting the amount ofdata written to various disks again.

The bottom line is varying Read Current andRelaxed Ordering hints didn’t matter for thisworkload. The outboundsgp_dd tasks man-aged 370 MB/s consistently.8

10.3 Summary of Lessons learned

The quote about the journey being more impor-tant than the destination comes to mind. Sev-eral things learned on this journey:

1. Firmware teams will compensate forstupid OSs. In this case performance

8I can’t help but wonder if I’ve got some piece ofthe puzzle wrong. But I’ve reviewed everything severaltimes and if something is wrong, it’s not obvious to me.I’ll update the paper if I learn otherwise.

Linux Symposium 231

gains aren’t what I expected becausefirmware was already setting aggressivecache line prefetching. On fully loadedsystems performance could be worse. . . but haven’t measured that yet. How-ever, Firmware couldn’t useunsafe hints(e.g. Relaxed Ordering).

2. IO Card Vendors will compensate forstupid chip sets. It didn’t initially occurto me high performance IO cards wouldbuffer IO in order to compensate. But thetradeoff is latency.

3. PCI-X is a different bus protocol com-pared to PCI, not just a speedup. The dif-ferences in bus protocol obsoleted the keything I was hoping to measure (cache lineprefetching for DMA Reads).

4. Don’t start by testing an adaptive driver.An adaptive driver will adjust its operat-ing parameters after a period of time tooptimize for the given workload. I wastedtime trying to figure out why my tg3 per-formance measurements varied in unpre-dictable ways. Adding “sleep 30” be-tween scripted test runs helped solve thatproblem.

5. The major weakness of this paper ismethodology. I didn’t know what Iwas measuring until I started investigat-ing why I didn’t get expected results. Ineed a PCI/PCI-X logic analyzer whichcan accurately measure the bus utilization.I believe HP has several such analyzers onsite; they just won’t fit in the RX2600. Iwould have to chop open the sheet metalso IO cards could stick out. I’m not will-ing to do that because airflow would be,uhm, dramatically altered.

10.4 Future Work

Several things come to mind that are still out-standing:

1. Test fully loaded systemsThe busier thememory controller is, the higher the la-tency memory fetches will be (2x-4x). Wedon’t want to waste memory bandwidth(prefetching too much) or IO bandwidth(prefetch too little). Just enough to com-pensate for average latency.

2. PARISC implementation only supportsPCI. Memory controller latencies areslower as is the IO MMU. It should bene-fit more from DMA Hints than IA64 does.I will update this paper (and remove this“Future work” item) with PARISC resultswhen I have them.

3. PCI-X DMA Hints Not as much to dohere but still worth exploring. Understandhow different chip sets implement PCI-XDMA support.

4. PCI/PCI-X Logic Analyzer Perhaps inthe future I can get access to an RX5670with logic analyzer card installed and re-run the tests. Logistically it’s non-trivialsince RX5670 is not a machine I can walkaround with under my arm.

5. More 2Gb/s FC disks would be useful.Need to figure out how to stress input andoutput at the same time. Maybe stripeacross both controllers, two RAID0 mddevices; one for reading and the other forwriting.

10.5 And thanks to. . .

A fair number of people contributed to this pa-per. They provided support, ideas, or reviewedcontent. In no particular order:

Linux Symposium 232

Alan C. Meyer, James Bottomley, Erin Hand-gen, Thomas Bogendörfer, Kevin Carson,Stephane Eranian, David Mosberger, AlexWilliamson, Dave Miller, Joe Cowan, FredWorley, Mike Krause, Matthew Wilcox.

My apologies if I omitted other contributors.

References

[1] http://ftp.parisc-linux.org/docs/astro_intro.ps

[2] http://ftp.parisc-linux.org/docs/elroy_ers.ps

[3] http://iou.parisc-linux.org/ols2002/

[4] http://www.hp.com/products1/itanium/chipset/index.html

[5] http://cvs.parisc-linux.org/*checkout*/linux/Documentation/DMA-mapping.txt?rev=HEAD&content-type=text/plain

[6] http://h21007.www2.hp.com/dspp/files/unprotected/linux/zx1-mio.pdf

[7] http://h21007.www2.hp.com/dspp/files/unprotected/linux/zx1-ioa-mercury_ers.pdf

[8] http://www.pcisig.com/

[9] http://ftp.parisc-linux.org/docs/astro_runway.ps

[10] http://www.hp.com/products1/itanium/performance/architecture/lmbench.html

[11] http://lists.parisc-linux.org/pipermail/parisc-linux/2002-March/015966.html

A 2.5 Page Clustering Implementation

William Lee Irwin IIIIBM Linux Technology Center

[email protected] | [email protected]

Abstract

Page clustering is a form of “large pages” thatincreases the kernel’s minimum allocation unitfor physical memory (base page size). Thereare several good reasons to do this. One is aform of prefaulting accomplished by instanti-ating groups of PTE’s mapping a given basepage. Another is a constant factor reductionof the number of objects the kernel must tra-verse in order to manipulate a given collec-tion of pages. The increase inPAGE_SIZEalso implies an increase inPAGE_CACHE_SIZE , which enables the use of filesystemswith larger blocksizes. Last, but not least,the constant factor reduction of lowmem con-sumed bymem_mapis crucial for the perfor-mance of 64GB i386 machines.

Page clustering has a number of technical chal-lenges involved in a 2.5 counterpart of the 2.4.7implementation. First, highpte poses unusualdifficulties, as neither sub-PAGE_SIZE high-mem allocations nor sub-PAGE_SIZE kmap-ping were supported in the original imple-mentation. Rmap also poses challenges, as itmakes direct assumptions about PTE’s beingof sizePAGE_SIZE. Finally, arch code aboveall makes many assumptions aboutPAGE_SIZE ’s relationship to the area mapped byPTE’s, particularly in arch support and VM ini-tialization code.

In summary, the author will describe the prob-lems that arose during his implementation ofpage clustering for 2.5 along with their solu-

tions, for an audience of kernel programmers.

1 What is page clustering?

1.1 Background

Memory present in a system is described byphysical addresses. Most (if not all) mod-ern machines are byte addressable, but theMMU usually operates at a lower “resolution,”and its finest resolution is what page clus-tering refers to asMMUPAGE_SIZE. Whenthe MMU’s translations are set up, be theyin hardware-interpreted data structures or insoftware-programmable TLB’s, they refer to“page frames” of that size or larger, which ef-fectively are a unit of measurement for mem-ory. Similarly one may refer to virtual mem-ory in those units, and arbitrary relationshipsbetween virtual page frames and physical pageframes are constructible with a combination ofhardware and software translation tables.

Demand-paged virtual memory systems, whena task takes a TLB miss not resolvable via thekernel’s software translation tables (which maybe interpreted directly by hardware) are thenfaced with the task of finding a physical pageframe to back a virtual page frame with.

Without page clustering, the kernel maintains adata structure, thestruct page , represent-ing each physical page frame, and another, thepage table entry, to represent each virtual pageframe. When not constrained by hardware, the

Linux Symposium 234

kernel is free to make ridiculous choices ofstructures for the page tables. For instance,each virtual page frame could (in principle) berepresented by a node in a binary search treeor a linked list. Linux® uses radix trees asmandated by hardware on i386 on all architec-tures, which are somewhat more efficient thanvarious choices, though some architectures na-tively use other structures such as inverted pagetables for them.

The page tables are assisted by a binary searchtree of virtual extents representing either ex-tents of files or zero-filled regions, whosenodes are calledvma’s. The physical page ischosen so as to arrange virtual contiguity offile pages in tandem with file offset contigu-ity, or otherwise to fetch largely arbitrary pagesand zero them out before mapping them. Whenthe relationship of a pagetable entry to a physi-cal page frame and its correspondingstructpage data structure, is restricted to within asinglevma it is a 1:1 relationship.

A system for tracking memory in use and not inuse is built around this relationship, and so theMMUPAGE_SIZEbecame the allocation unitfor memory. PAGE_SIZE is used to simulta-neously refer to the notion of the MMU’s finestgranularity and the memory allocator’s finestgranularity in preexisting Linux ® code.

1.2 How page clustering differs

First, one should observe that if the MMU’sfinest granularity isMMUPAGE_SIZE, onemay simulate an MMU with a granular-ity of any power of two multiple (PAGE_MMUCOUNT) by simply instantiatingPAGE_MMUCOUNTPTE’s at a time, and making eachstruct page refer to PAGE_MMUCOUNTcontiguous and aligned native page frames.Also, if the pagetables are not constrainedby hardware, one can easily alter their struc-ture to only have one PTE for eachPAGE_

SIZE instead ofMMUPAGE_SIZEand by sodoing reduce their space consumption. Ad-ditionally, if the MMU supports translationsof sizePAGE_SIZE one can simply performone TLB insertion (or PTE instantiation ifhardware-interpreted) for eachPAGE_SIZEarea mapped by the pagetables.

One could say this is a “weak form” of pageclustering. It has the undesirable side-effectof breaking binary compatibility and hencenot being transparent, but has several advan-tages. The port of Linux ® to the IA64 pro-cessor already uses this low code impact tech-nique for performance reasons, as it reducesTLB misses and the overhead of manipulat-ing large collections of pages by a factor ofPAGE_MMUCOUNT. Some performance bene-fits for I/O are possible, as physical contiguityis better preserved so larger scatter/gather listsare possible, though this is offset by a largercost of preparing buffers for small I/O transac-tions. BSD’s VAX port did it this way.

The binary incompatibility inherent in theabove approach makes it unsuitable for prac-tical deployment on systems with significantpreexisting userbases. For instance, ELF ex-ecutables are linked in ways mandating dif-fering protections within what could poten-tially be a single PAGE_SIZE virtual re-gion, and mmap() is often performed at off-sets that are notPAGE_SIZE-aligned or inlengths divisible byPAGE_SIZE. To addressthe mmap() granularity issue, the 1:1 rela-tionship between virtual page frames and ac-counting structures for physical memory mustbe extended toPAGE_MMUCOUNT:1. Thereis also a very invasive audit required to en-force the newly introduced distinction betweenMMUPAGE_SIZEand PAGE_SIZE by pro-gramming dimensional analysis into variousaddress and index calculations. This could becalled the “strong form” of page clustering.

Linux Symposium 235

The solution, in high-level terms, essentiallyhas two cases for userspace. The first, whichis easier, is file-backed memory. The unitof memory cacheing file contents isPAGE_CACHE_SIZE, which (for 2.5) is identical toPAGE_SIZE. An index in units of PAGE_SIZE is usable for recovering the struct pagerepresenting the area of the file that would needto be faulted in. However, to preserve mappingsemantics one must also recover an offset intothe area represented by thestruct page inunits ofMMUPAGE_SIZE. The second case isanonymous memory, which is not forced to besimultaneously virtually and physically con-tiguous by virtue of its contents. Userspacedemands oneMMUPAGE_SIZEunit of mem-ory but receivesPAGE_SIZEunit of memory,and so to prevent very noticeable amounts ofwaste, one scans nearby PTE’s for other vir-tual pages anonymizing faults, that is, writefaults on COW file pages or on the zero page,could be taken on. These are candidate pagesfor copying (the zero page is special cased touse faster zeroing algorithms on most architec-tures). The anonymizing case results in a com-plex relationship between the virtual pages in aprocess and the anonymous page.

In summary, page clustering divorces the ker-nel’s internal allocation unit, or the size of anarea represented by a struct page, from the no-tion of the MMU’s mapping granularity withthe constraint that the allocation unit be larger.

1.3 Why page clustering?

Page clustering introduces several advantages.The first is that by using a larger unit forcacheing file pages, one can support filesys-tems with larger block sizes. The second is thatthe additional physical contiguity introducedby the larger allocation unit allows one to con-struct larger scatter gather lists for I/O (againwith the proviso about preparing write buffers).The third is that the number of objects in vari-

ous collections of pages is reduced for a linearspeedup of the algorithms. The fourth is thatthe page faults may be batched, reducing thepage fault rate.

The fifth, which is the primary reason why thisproject to resurrect the 2.4.7 page clusteringpatch was carried out, is largely specific to i386PAE, though possibly also applicable to 32-bitkernels running on large memory 64-bit ma-chines. sizeof(struct page)/PAGE_SIZE is the constant of proportionality for thefraction of memory consumed by thestructpage ’s required to account for all the phys-ical memory in the system. On 32-bit sys-tems with extended addressing or when thekernel runs in 32-bit mode, this is irrespectiveof virtualspace and the total memory consumedmay be larger than kernel virtualspace. Forinstance, with a fully-populated 40-bit physi-cal address space, a 32-bit virtualspace, a 4KBPAGE_SIZE, and a 64Bsizeof(structpage) , the coremap is 16GB in size, which isinfeasible to simultaneously map. Page cluster-ing reduces this space overhead by a factor ofPAGE_MMUCOUNT, which is arbitrary (withinthe constraints of the quality of implementa-tion), and so renders the coremap’s space over-headO(1) with respect to physical memory.

2 Implementation

2.1 Early boot

The issues encountered in early boot werelargely simple, but widespread. Early boot de-bugging was done on a 16 processor NUMA-Q ® with 16GB RAM. First, pagetables andvarious fragments of memory that were for-merly assumed to be 4KB but described withPAGE_SIZE needed to be updated, includingmappings for the IO-APIC, numerous pageta-bles, and structures like the idle threads’ stacks.Then memory detection required various kinds

Linux Symposium 236

of dimensional analysis to properly calculatecoremap indices from page frame numbers andvice-versa.

Numerous index calculations and indexing op-erations into the coremap were broken. Theyhad no counterpart in 2.4.x, but didn’t requiremuch thought to correct:

#define pfn_to_page(pfn) (mem_map + (pfn))

#define page_to_pfn(page) \

((unsigned long)((page) − mem_map))

#define pfn_valid(pfn) ((pfn) < max_mapnr)

became

#define pfn_to_page(pfn) \

(&mem_map[(pfn)/PAGE_MMUCOUNT])

#define page_to_mapnr(page) \

((unsigned long)((page) − mem_map))

#define page_to_pfn(page) \

(PAGE_MMUCOUNT∗page_to_mapnr(page))

#define pfn_valid(pfn) \

((pfn) < max_mapnr∗PAGE_MMUCOUNT)

and so on.

Of the issues, kmap_pte and pkmap_page_table were particularly troublesome;to get booting, they were removed in favor ofwalking kernel pagetables, but are to be rein-stated in the near future. The issue was thatthey were allocated 4KB at a time using thebootmem allocator, but were assumed to pointat contiguous pagetables capable of mappingthe entire permanent kmap and atomic kmaparenas, which had grown to where they re-quired multiple pagetables each.

An unusual issue arose from maintaining an8KB stack size while raisingPAGE_SIZE toarbitrary sizes.fork_init() first receiveda divide by zero from the following code frag-ment:

/ ∗ create a slab on which task_structs can be allocated ∗/

task_struct_cachep =

kmem_cache_create("task_struct",

sizeof(struct task_struct),0,

SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);

if (!task_struct_cachep)

panic("fork_init(): cannot create

task_struct SLAB cache");

/ ∗

∗ The default maximum number of threads is set to a safe

∗ value: the thread structures can take up at most half

∗ of memory.

∗/

max_threads = mempages /

(THREAD_SIZE/PAGE_SIZE) / 8;

This was clearly due to THREAD_SIZE/PAGE_SIZE vanishing. But thenunusual errors arose for unclear reasons. Asit turned out, I’d changed kernel stacks to beslab allocated, but, the kernel stacks were sosmall compared toPAGE_SIZE they usedon-slab slab management and so failed to be8KB-aligned. Changing the threshold to useoff-slab slab management for objects largerthanMMUPAGE_SIZEin addition to the othercriteria sufficed, with zero runtime impact onthePAGE_SIZE == MMUPAGE_SIZEcase.

Next, the placement of vmallocspace relativeto fixmapspace and the overrunning of vmal-locspace by fixmapspace for unusually largevalues ofPAGE_SIZE became issues. Thiswas resolved by some painful compile-timemechanics to shove the kmap and permanentkmap windows into vmallocspace, dynami-cally size them with respect toPAGE_SIZE,and make poor guesses in assembly as to theboundaries between vmallocspace andZONE_NORMAL. The boundaries out to be safe be-cause the assumptions were not truly usedapart from an indirect reference to them viaMAXMEM. Specifically:

Linux Symposium 237

#define VMALLOC_END

(FIXADDR_START−2∗MMUPAGE_SIZE)

#define __VMALLOC_START \

(VMALLOC_END− VMALLOC_RESERVE\

− 2∗MMUPAGE_SIZE)

#define VMALLOC_START \

(high_memory \

? max(__VMALLOC_START, \

(unsigned long)high_memory) \

: __VMALLOC_START \

)

#define __MAXMEM \

((VMALLOC_START − 2∗MMUPAGE_SIZE \

− __PAGE_OFFSET) \

& LARGE_PAGE_MASK)

#define MAXMEM \

__pa((VMALLOC_START−2∗MMUPAGE_SIZE)\

& LARGE_PAGE_MASK)

This is actually a side effect of a design de-cision which makes the virtualspace layoutchange dynamically withPAGE_SIZE. Thatis, virtual mapping windows raise an issue nowthat the area callers want to map is usuallyPAGE_SIZE in size, and to make it the sizethey expect, fixmapspace must grow. There isan alternative design possible, which is to keepfixmapspace a fixed size or at most some fixedsize, and to have “sliding windows into a par-tial page.” Using such an API would proceedsomething like the following:

This is somewhat more invasive, but more ef-ficient with respect to virtualspace. AsPAGE_SIZE grows in a 32-bit environment withprogressively more extended physical mem-ory, such measures become progressively moreprudent. However, the need to take such mea-sures may be significantly mitigated by elim-inating the permanent kmap pool in combi-nation with using per-cpu pagetables for thekmap windows, as the typical targets needingthe largestPAGE_SIZE values have a maxi-

int k;

void ∗old, ∗new;

old = kmap_atomic_start(old_page, KM_USER0);

new = kmap_atomic_start(new_page, KM_USER1);

for (k = 0; k < PAGE_KMAP_COUNT; ++k){

memcpy(new, old, KMAP_SIZE);

old = kmap_atomic(old_page, KM_USER0, k);

new = kmap_atomic(new_page, KM_USER1, k);

}

kmap_atomic_end(old_page, old, KM_USER0);

kmap_atomic_end(new_page, new, KM_USER1);

mum of 32 or 64 cpus. Eliminating the perma-nent kmap pool has the additional advantageof preventing deadlocks caused by a number oftasks each attempting to acquire multiple per-manent kmaps but acquiring fewer than desiredby the time the pool is exhausted.

2.2 coremap initialization

The coremap initialization is worthy of its owndiscussion. First, in order to satisfy the earlyboot setup code, the coremap must be laid outso it mapsPAGE_SIZE units of memory toproperly aligned positions in the coremap. Si-multaneously, most (if not all) of the calcula-tions are done with pfn’s so various bits of di-mensional analysis must be programmed in.

First, zone->zone_start_pfn andzone->spanned_pages need to be treatedconsistently. Then,zones_sizes[] needsto be converted to passPAGE_SIZEunits andfree_area_init_core() fixed up to in-crement its pfn counter byPAGE_MMUCOUNT.bad_range() must then be adjusted for unitconversion before doing its bounds checks.Finally, the page allocator need not keep somany orders around to satisfy allocations ofa given size, so useMAX_ORDER - PAGE_MMUSHIFTinstead ofMAX_ORDER.

Bootmem also needed large adjustments; they

Linux Symposium 238

were largely done to make its own internal ac-counting based onMMUPAGE_SIZEand theninterface it with the page allocator which has aPAGE_SIZEgranularity. This ended up beingrather invasive.

2.3 Kernel pagetables

vmalloc() usage was too widespread toundergo a full audit for space conserva-tion. The choice was between usingPAGE_SIZE or MMUPAGE_SIZEas the unit ofvmalloc() mapping and allocation, and IchoseMMUPAGE_SIZE. This is transparent touserspace, so either would be legitimate, buton i386 PAE vmallocspace is too constrainedto take internal fragmentation hits. This meantthat instead of a full audit for space conserva-tion, a full audit for misuse ofPAGE_SIZE isneeded. This turned out not to be very prob-lematic at all, as few drivers needed the conver-sion, and those that did had mild failure modes,failing only to probe.

page_table_range_init() and rela-tives greatly disliked the change of units andthe ambiguous location ofkmap_pte andpkmap_page_table . It proved infeasiblerapidly bring up the system while preservingthem intact, so they were removed and the codegreatly simplified at the expense of a very largediff.

2.4 Process pagetables

User pagetable manipulations consisted largelyof straightforward substitutions in pagetablecode. Of course, something was missed. Itappeared that an unusual binary compatibilitybug arose with respect to shared libraries thatwas very difficult to trigger. The cause of thiswas that there was only one caller ofpte_modify() in the core VM in a corner caseof mprotect() . This passed a first pass ofinspection because_PAGE_CHG_MASKdidn’t

trip grep, but it turned out to rely onPTE_MASK, which by virtue of the macro indirec-tion, also slipped past grep. After over aweek of chasing it, the substitution that slippedthrough my fingers was finally carried out.

The next “interesting” binary compatibilitybug was that core dumps were corrupted.The get_user_pages() calling conven-tion had become lossy. It was returningstruct page ’s to refer to the areas mappedby PTE’s, and it along withfollow_page()was the only area of the kernel exibiting thisparticular kind of confusion. The solution wasto return pfn’s and notstruct page ’s, andwas highly successful. Badari Pulavarty as-sisted in implementing the portion relevant todirect I/O.

The most interesting bug of all was actuallythe first, which prevented userspace from run-ning at all. /sbin/init would be stuck in aloop somewhere in userspace, and it could onlyvery rarely be caught in the kernel. What even-tually had to be done to track down the issuewas to log all page faults. What was eventu-ally discovered was that pid 1 has a special sta-tus in the kernel, and loops when taking invalidfaults instead of being delivered SIGSEGV. Af-ter some poking around, it became evident theywere always anonymous pagefaults.

So at first, the workaround was to fragmentanonymous pages. But this could only be tem-porary in order to meet the performance goals.The issue was resurrected when it came timeto attempt to fully utilize anonymous pagesfor performance reasons. It took some timeto come around to examining the contents ofthe purportedly zeroed memory, but eventu-ally divining the page pointed to by the PTEtaking the fault, which pointed to the zeropage. And the fact a nonzero address was be-ing fetched from the zero page prompted theexamination of its contents. By an unusual

Linux Symposium 239

coincidence the author had been implement-ing the GDT setup for an i386 executive ear-lier that day, and noticed a very clear resem-blance to the contents of the supposed zeropage. Very shortly thereafter it was discov-ered that theempty_zero_page[] used oni386 as backing memory for the zero page wasa 4KB array followed immediately by the ker-nel’s GDT. The bug was resolved by using acustom-allocated and zeroed page instead ofthe struct page trackingempty_zero_page[] .

Finally, userspace pagetables required fixupsin order to prevent extremely wasteful frag-mentation. The code turned out to be some-what hairy, as it required reference countingpagetable pages and some scanning of PMDentries in an alignedPAGE_MMUCOUNT-sizedgroup. Furthermore, in order to interoper-ate with highpte, significantly more complexdefinitions of pte_offset_map() , pmd_populate() and relatives were required.

#define pte_offset_map(dir, address) \

((pte_t ∗) \

kmap_atomic(pmd_page( ∗(dir)),KM_PTE0) \

+ (PTRS_PER_PTE \

∗ ((pmd_val( ∗(dir))/MMUPAGE_SIZE) \

% PAGE_MMUCOUNT) \

+ pte_index(address)) \

)

pmd_populate() became too large to pastehere because it had to deal with several is-sues to recover from partial unmappings of thePAGE_MMUCOUNTPMD group and PTE pagerefcounting. For the wary, it collapses to itsprior size whenPAGE_MMUCOUNT == 1.

2.5 file-backed memory

Handling userspace faulting semantics for file-backed memory was actually trivial. The most

unsophisticated fault handling scheme imagin-able suffices.

There was a small issue withsys_remap_file_pages() where the populate methodsused theinstall_page() API internally toperform the dirty work of walking the pageta-bles down to the PTE to edit, and as it referredto the location to map by thestruct page ,lost the offset into the page to map. This wastrivially corrected with an additional argumentwith the offset.

2.6 Swap-backed memory

Swap faults are not truly worth optimizing withpagetable scanning; they don’t fragment likefreshly zeroed anonymous pages because theswapcache is an effective lookup structure anduserspace can fetch things just fine. Insteadthey are faulted in one by one, and that simpli-fied things at least temporarily while the scan-ning code wasn’t in place.

The organization of the swap map differs fromthe 2.4.7 patch, which created a swap map en-try for eachMMUPAGE_SIZEpiece of a page,and so had to account for reference counts onthe page held by multiple swap entries. The 2.5page clustering implementation instead usesa single swap map entry for everyPAGE_SIZE -sized page, and so simplified swap ref-erence count semantics, reduces the vmalloc-space consumption of the swap map by a factorof PAGE_MMUCOUNT, and reduces the searchspace for swapoff. Some differences thereare also visible with the encoding ofswp_entry_t ’s, which directly play with swapmap indices and offsets into pages in variouspoints throughout the core VM where before-hand they didn’t need to..

Linux Symposium 240

2.7 anonymizing faults

There is a problem to solve caused by the factthat a process faulting on anonymous mem-ory requestsMMUPAGE_SIZEbytes of mem-ory but is grantedPAGE_SIZEbytes of mem-ory. Again, there is more than one way to dealwith this.

The first, not used here, is to maintain a onePAGE_SIZEarea as a “ready list” and serviceanonymous faults until it’s exhausted.

The second is to speculatively prefault neigh-boring anonymous pages in order to utilize theentire anonymous page. Scanning neighboringPTE’s for zero-mapped or COW pages (i.e. tobe anonymized). This has the potential to re-duce the fault rate for some loads at the costof not guaranteeing full utilization. Initial in-dications appear to be that even heuristics thatappear relatively weak in comparison to thoseof the 2.4 patch suffice.

The logic is relatively complex, and some ad-ditional complexity as compared to the 2.4.xcode was added by simultaneously scanningPTE’s both upward and downward. Someadditional code is required to crossvmaboundaries and detect whether a given pageis anonymous or COW. Crossing pagetablepage boundaries was not implemented, for thebasic reason thatPAGE_MMUCOUNT*PMD_SIZE is enough virtualspace to scan to mit-igate most of the fragmentation, and also toremain future-compatible with pagetable shar-ing, which is somewhat adverse to crossingpagetable pages. Additionally, totally un-bounded scanning could result in some over-head.

When the scanning code is done, what it hasdone is assembled a vector of pfn’s for all themmupages it has to copy, and they are by nomeans contiguous. In order not to be grosslyTLB-inefficient, an interface is provided to

map vectors of pfn’s,kmap_atomic_sg() .The use of it is obvious, as it maps each compo-nent of the pfn vector to a virtually contiguousPAGE_SIZE virtual area in its correspondingpiece of the virtual page, and the only non-straight-line code in copying is checking forthe zero page.

2.8 I/O

I/O by and large had relatively simple is-sues. The i386 PCI DMA API had someaddress calculations in need of minor sub-stitutions, and the block layer was largelyimmune to the whole affair apart from di-rect I/O and SCSI ioctl’s usingget_user_pages() . An unfortunate limitation ex-ists in that the block layer is incapableof dealing with512*q->max_sectors <PAGE_SIZE. I didn’t produce a fix for this, asit’s a somewhat obscure condition that can onlyoccur when particularly crippled devices meetparticularly large value ofPAGE_SIZE. I feelthat it should eventually be handled as part ofthe implementation.

IDE had a small issue in that its PRD tableswere sized in terms ofPAGE_SIZE, which itappears to expect not to vary from 4KB. AGPalso had an unusual issue involving mappingits aperture. But most drivers that failed sim-ply performed avmalloc() or ioremap()of an area sized proportionally toPAGE_SIZEduring initialization and failed to probe, whichwas harmless apart from failing to providefunctionality (i.e. no data corruption) and veryeasy to correct. The starfire ethernet adapterfell in this category.

3 Trademarks


IBM is a trademark of International Business Ma-

Linux Symposium 241

chines Corporation.

Linux® is a trademark of Linus Torvalds.

Other company, product or service names may betrademarks or service marks of others.

Ugly DucklingsResurrecting unmaintained code

Dave JonesSuSE Labs

[email protected]

Abstract

Throughout the development of the 2.5 kernel,a number of drivers and pieces of infrastruc-ture that had been left to stagnate finally got along overdue cleanup. In some cases, code thathadn’t been touched for several years got over-hauled. Each time another area got the cleanuptreatment, patterns started to emerge.

This paper attempts to document some of thesepatterns so that hopefully by keeping them inmind, future driver authors don’t fall into thepitfalls that some of these have fixed up such asover-abstracting, and massive duplication. Byway of examples, it covers several areas thatgot cleaned up in the 2.5 series, but focuses onthe bulk of the work the paper author did on theagpgart driver.

1 Introduction

It has been a long-standing philosophy thatbending an existing driver to work on a newpiece of hardware is much favoured over a newimplementation which ends up with 99% thesame code as the old. The typical life-cycle ofa driver is as follows.

• Driver is written for hardware vendors A’snew widget

• Vendor B makes a compatible widget.

• Id’s for Vendor B’s product get added tothe driver

• (Repeat for several other vendors/otherregister compatible widgets)

• Slightly different widgets start appearing,which are still mostly compatible. Driverstarts to take on new form where it needsto special case certain widgets in differentcode paths.

• Repeat for several more new widgets.

• Driver is now 100K+ of spaghetti.

• Original driver maintainer moves on tonew project, leaving driver in currentstate.

• New widget Id’s get added.

• Most people too scared to change toomuch of the code in fear of subtly break-ing support for other widgets.

2 Cleanups overview

Throughout the development of the 2.5 kernel,a number of drivers and pieces of infrastruc-ture that had been left to stagnate finally got along overdue cleanup. Each time another areagot the cleanup treatment, patterns started toemerge.

Linux Symposium 243

2.1 The splitting up of multiple instances

The “support all hardware all in the same driver.c file” approach is flawed. If you want tochange how vendor B’s products work, youshouldn’t be touching any code for other ven-dors devices. With hundreds or thousands oflines of irrelevant code, it’s also a pain to nav-igate your way around the source. By split-ting the driver into multiple vendor.c files,you also start to notice patterns such as “thisfunction is duplicated in all vendor files, so be-longs in a generic.c file.” Sometimes, how-ever, things go the other way. In 2.4, we haveseveral separate RNG (Random Number Gen-erator) drivers. Jeff Garzik found that by merg-ing all of these to the same file, lots of codeduplication got removed. Whilst the number ofRNG drivers is quite low, if there comes a daywhen the driver supports many more, it maymake sense to abstract them back out into sep-arate files again.

2.2 Reorganising directory structures

The whole idea of directories is to keep sim-ilar things together. One simple cleanup thathappened in 2.5 was the introduction of thedrivers/char/watchdog directory. Pre-viously, drivers/char contained over 200.c and .h files. By introducing the watch-dog subdirectory, you can instantly find all rel-evant drivers. Useful when you have to makechanges that affect all watchdog drivers (as wasthe case in 2.4 when a security bug had beencopied through all drivers).

2.3 Simplification of abstraction layers

Sometimes, after introducing support for mul-tiple widgets to a driver, people over-abstract.A prime example of this was the agpgart cacheflush routine, which ended up calling through4–5 function pointers before it actually got to

do anything useful.

2.4 Moving to new APIs to decrease LOC

With code lying dormant and unmaintained forseveral years, it tends to miss the opportunity totake advantage of easier or faster ways of call-ing kernel-supplied functionality. As helperfunctions get continually added, the number oflines of code needed to be duplicated in driversgoes down.

3 Case study #1: IA32 CPU setuproutines

This started out life as arch/i386/kernel/setup.c , and in 2.4, currentlystands at 84KB of code which handles settingup of the CPU in terms of working around er-rata, enabling CPU specific features, and doingsome detective work such as finding out thecache sizes. Initially supporting Intel CPUs,the clones started to follow. Today it supportsdozens of different types of x86 CPU, from 10different vendors. In 2.5, Patrick Mochel splitthis file up into per-vendor support files, anda few generic files.arch/i386/kernel/cpu/ is a much simpler place to navigate, andis a lot nicer to hack on than its predecessor asa result.

4 Case study #2: IA32 MTRRdriver

This monster has been around a few years,and it shows. 71KB of monolithic code, withmultiple implementations built on top of eachother. Each time, with the abstraction layerbent and twisted into something new. Ini-tially supporting generic Intel MTRRs, it wasbent into shape to deal with AMD K6’s vari-ants. Then Cyrix’s ARR’s. Then a myriad

Linux Symposium 244

of other clones which did things slightly dif-ferent. Again, chopping this into per-vendorpieces makes things a lot simpler, and reducesthe chance of breaking one vendor when fixingsomething for another (which has been the casein the past on more than one occasion).

5 Case study #3: Bluesmoke

Bluesmoke is the IA32 machine check excep-tion handling support. As usual, it first onlysupported Intel P5 and P6 CPUs. Over time,things were changed to support AMD proces-sors, Intel Pentium 4, IDT Winchips, and someadditional features such as background check-ing. This all started to blow up the file size, andit became a pain to find your way around a filewith a half dozen similarly named functions.Time for the split-up treatment. 2.5 now has 7separate C files for the implementations, witha central ‘generic’ file which calls the specificper-vendor/model implementations.

Whilst it’s theoretically possible that you couldhack the Makefiles now to only build in (forexample) the Intel code if you don’t own anynon-Intel parts, the added complexity and re-duction in functionality for the net-gain of justa few KB of object wasn’t deemed worth it. Abigger challenge which would benefit from thischange came in the form of the final case study.

6 Case study #4: AGPGART

6.1 History of AGPGART

AGP support was added to Linux back in1999. Subsequent updates were somewhat in-frequent. The bulk of the code never reallychanged much. Each update just added PCIID’s of new devices, or occasionally a new agp-gart implementation when things were just toodifferent to the existing agpgarts.

6.2 How I got involved

During 2002, I was asked by SuSE to im-plement AGP support for the AMD x86-64.Thinking this would be easy basing assump-tions on what I’d previously seen happen toagpgart (thinking it would be just adding somenew PCI idents or the likes), I (foolishly?)agreed to do it. Shortly afterwards, I dis-covered the GART I was writing support forwas unlike anything Linux currently supported.Firstly, the north-bridge was on-CPU, whichmeant on an SMP box, there would be morethan one of them, and they would have to bekept coherent with each other. Secondly, it wasthe first GART to support version 3.0 of theAGP standard. Whilst this is backwards com-patible for the most part, there are some ad-ditional features that need to be taken care of(such as the transfer speed selector working incompletely different ways to how it did in pre-vious versions of the standard). This was quitea lot to take on board, so I started staring at the134KB of agpgart back-end code (there’s also25KB of front-end code).

6.3 More problems. . .

Getting up to speed on a driver of this size,which supports over 50 different AGP chipsets,is not a task that happens overnight. Lots ofthose implementations are either the same, orvery similar, but it still leaves around a dozenor so separate code paths. Now to find outwhich one is most similar to the GART I’mwriting for. I eventually gave up trying tofind one similar enough, and just started fromscratch. My mails for “help” to the originalmaintainer of agpgart went to/dev/null ,which meant I had to figure out how a lot ofit worked, the hard way. After finally gettingthings working, I had decided that enough wasenough, and for 2.5, I was going to give thiscode a major overhaul.

Linux Symposium 245

6.4 How things were cleaned up

• As usual, first things first, splitdrivers/char/agp/agpgart_be.c (134KB) into lots of smaller sourcefiles. One per chipset vendor. This was aninstant cleanup, which had no problemsbeing merged. Shortly afterwards, GregKroah-Hartman converted the chipsetprobing routines over partially to someof the ‘new’ PCI API, killing off a bunchmore useless, ugly code.

• With everything in per-vendor files now,things were a lot cleaner, but there wasstill some real bad mess that needed clean-ing. The agpgart_be.c file still ex-isted, which acted as a generic part whichhad all the bits to call the routines inthe per-vendor files. One particular uglythat stuck out was the 350-line struct thatmatched known PCI IDs to init routines.The redundancy in this struct was reallybad.

static struct {unsigned short device_id;unsigned short vendor_id;enum chipset_type chipset;const char *vendor_name;const char *chipset_name;int (*chipset_setup) (struct

pci_dev *pdev);} agp_bridge_info[]

__initdata = {\#ifdef CONFIG_AGP_ALI

{ PCI_DEVICE_ID_AL_M1541_0,PCI_VENDOR_ID_AL,ALI_M1541,"ALi","M1541",ali_generic_setup },

... (Continue for dozensmore entries) ...

With this wasteful struct, if 20 out of those50 entries are for Intel GARTs, we dupli-cate the vendor ID, vendor name string,

and in a lot of cases, the setup routine too.This was cleaned up in several steps.

– Split the structs out fromagpgart_be.c to $vendor.h(I.e., move all the ALi entries toali.h , AMD entries to amd.h ,etc.)

– Remove all duplication from each ofthese structs.

– Replace the duplication with a‘header struct’ containing the ven-dor ID, vendor name string, and apointer to the remaining data.

– Replace the struct inagpgart_be.c with a struct that points tothe various split out structures in the$vendor.h files.

6.5 The “new” PCI API

Somewhat pleased with myself, I mailed offthe changes to Linus, who told me to startagain, this time using the pci_driver function-ality. As GregKH had done part of the workhere already, it wasn’t actually that much workto bend what I had already into shape. Thisdid, however, bring about a big change over2.4’s agpgart. With each of the per-chipsetdrivers now containing a pci_driver struct, theyworked independently of the agpgart core asstand-alone modules. I wasn’t initially happywith this, but Linus liked it, so it stayed thatway. It did, however, mean a rewrite in modulelocking was needed, which nicely coincidedwith Rusty Russell rewriting how module lock-ing worked.

6.6 Maintainership

By this point, I had completely gutted the waythe agpgart backends worked. I felt I had madesignificant enough change to adopt the code,

Linux Symposium 246

and make an entry for myself in the MAIN-TAINERS file. Which was probably my sec-ond biggest mistake so far. Within just a fewdays of doing so, my mailbox was flooded withbug reports, stagnant patches, thank-you’s, andinsults. One thing that I hadn’t anticipatedwas just how far-reaching this code was. Notonly did I now have to follow and understandwhat was going on in the agpgart code, butalso found myself digging further into DRI tofollow its interaction with AGPGART. Sub-sequently, even parts of XFree86 came underscrutiny, and even FreeBSD (which interest-ingly did the ‘separate-file-per-vendor’ thingfrom Day One) to see just how much I couldor couldn’t change without breaking things toomuch from a userspace point of view.

6.7 Taking AGPGART forward: AGP 3.0 sup-port

After getting on top of the various patches,and fixing the various problems the new codebrought about, AGPGART had been draggedkicking and screaming into something that re-sembled a modern driver. Well, almost. I thenmoved on to start tackling the next big thing foragpgart: generic AGP 3.0 support. MatthewTolentino from Intel had come up with a patchfor Intel’s AGP3.0 chipset (the I7505), and hadre-implemented a bunch of code that I had writ-ten for the x86-64. After factoring out thecommon parts, this got to a state where thingslooked just fine.

6.8 Return of the previous maintainer

Just when things were beginning to go quiet(apart from additional AGP3 GARTs turn-ing up needing implementing), Jeff Hartmann,the original maintainer of the 2.4 AGPGART,reappeared with a 130KB patch against theoriginal 2.4 code. It offered various function-ality, supporting AGP3, and cleaning up a lotof code in the process. In a lot of other ways,

however, it was a huge step backwards. Split-ting Jeff’s huge patch into smaller pieces wasa massive job. Bits of it went in, and Linus re-jected a bunch of them, but there was worseto come (more diffs). At the time of writ-ing, Jeff’s outstanding diffs vs. 2.5.59 is around380KB. A lot of this is unlikely to be mergedbefore 2.6 without considerable rewriting.

6.9 Useless abstractions

Furthering the cleanup mantra, agpgart codehas been described in many ways by many peo-ple (including shit by Linus himself). Pre-cleanup, however, my pet-name for this mon-ster was “abstraction hell.” As an excellentHOWNOTTO in abstraction, here’s how agp-gart used to flush the cache.

• At strategic parts of the code there areCACHE_FLUSH(); calls.

• CACHE_FLUSH turns out to bea macro which expands toagp_bridge.cache_flush

• agp_bridge.cache_flush in 99%of cases, points toglobal_cache_flush . The remaining case could havebeen special-cased inglobal_cache_flush .

• On SMP,global_cache_flush is adefine forsmp_flush_cache . On UP,it’s a define forflush_cache .

• smp_flush_cache just does ansmp_call_function on flush_cache .

• Finally, flush_cache does a “wbinvd”on IA32/X86_64, “mb” on IA64, or#error s on anything else.

Linux Symposium 247

7 Future directions

7.1 AGPGART

There is still a lot of work to be done on AGP-GART. All the work so far concentrated on theback end (which is where all the chipset magichappens).

• The front-end of the driver (ioctl interface,etc.) is almost as crufty, and needs a lotof work to rid it of silly things like open-coded list handling routines instead of us-ing the genericlist.h routines. (Yetmore proof that duplicating functionalityis a bad thing: it gets its double-linked listimplementation horribly wrong.)

• More work on making the AGP3.0 sup-port transparent.

• Inevitably more support for additionalchipsets.

• Multiple AGP bridge support.

• sysfs migration to get away from thehorrible ioctl interface. This will unfortu-nately make the Linux AGPGART com-pletely incompatible with the FreeBSDimplementation. The only people thiscauses concern for are XFree86 develop-ers, who have to support an additional in-terface.

7.2 Other kernel work

• APIC drivers. The IA32 APIC code isquite horrible, and quite fragile. It sup-ports a lot of different types of setup,from lots of different generic PCs, tothe weird and wonderful bigger machineslike NUMA-Q, Summit, and more. Thex86 sub-architecture support cleaned upsome of this by introducing the possibility

for each sub-arch to implement their ownAPIC code, but it hasn’t really improvedreadability or maintainability of the APICcode to any great length.

• Watchdogs. Small scale cleanup oc-curred already in 2.5, which was tojust group all the watchdog drivers fromdrivers/char into a new subdirectorycalled imaginativelywatchdog/ . A lotof these drivers are duplicating lots ofcode, sometimes subtly differently, whenthey should be using the same code. For2.7 a nice cleanup would be to abstract outthe generic parts of this to a layer abovethe watchdog drivers in a similar way towhat happened with AGPGART. In 2.4there was a security hole which meant ev-ery single watchdog driver needed to beaudited and fixed. By moving all thisfunctionality out of the drivers, this couldhave been fixed in a single place.

8 Summary

• Split out multiple implementations totheir own files unless they are small and/orsimilar enough to the existing implemen-tation.

• Don’t re-implement code unnecessarily,even if you think you may need some-thing extra that the generic code doesn’tgive you. Build on top of the generic coderather than re-implementing.

• Use modern interfaces where possible.This isn’t always easy if you want yourdriver to compile on earlier kernel ver-sions as well (especially true for out-of-tree drivers).

• Before abstracting something out, thinkabout why you actually need it abstracted.What will the callers of the abstraction doin the common case?

Linux Symposium 248

• Directories are there to keep similar thingstogether. Use them. (Obviously, onlywhen they make sense; a directory for 2–3drivers is perhaps going too far).

• Don’t disappear for four years and reap-pear with a 380KB patch against the lastcode you maintained. You may find that alot has changed whilst you were gone, andmerging will be anightmare—especiallyif you didn’t keep individual per-changechangesets.

udev – A Userspace Implementation of devfs

Greg Kroah-Hartman∗

IBM Corp.Linux Technology Center


Abstract

Starting with the 2.5 kernel, all physicaland virtual devices in a system are visibleto userspace in a hierarchal fashion throughsysfs . /sbin/hotplug provides a noti-fication to userspace when any device is addedor removed from the system. Using these twofeatures, a userspace implementation of a dy-namic/dev is now possible that can provide avery flexible device naming policy.

This paper will discussudev , a program thatreplaces the functionality ofdevfs (only pro-viding /dev entries for devices that are in thesystem at any moment in time), and allows forfeatures that were previously not able to bedone throughdevfs alone, such as:

• Persistent naming for devices when theymove around the device tree.

• Notification of external systems of devicechanges.

• A flexible device naming scheme.

• Allow the kernel to use dynamic majorand minor numbers

• Move all naming policy out of the kernel.

∗This work represents the view of the author anddoes not necessarily represent the view of IBM.

This paper will describe why such a userspaceprogram is superior to a kernel baseddevfs ,and detail the design decisions that went intoits creation. The paper will also describe howudev works, how to write plugins that ex-tend the functionality of it (different namingschemes, etc.), and different trade offs thatwere made in order to provide a working sys-tem.

1 Introduction

The /dev directory on a Linux machine iswhere all of the device files for the systemshould be located.[2] A device file is how auser program can access a specific hardwaredevice or function. For example, the devicefile /dev/hda is traditionally used to repre-sent the first IDE drive in the system. The namehda corresponds to both a major and a minornumber, which is used by the kernel to deter-mine what hardware device to talk to. Cur-rently a very wide range of names that matchup to different major and minor numbers havebeen defined.

All major and minor numbers are assigneda name that matches up with a type ofdevice. This allocation is done by TheLinux Assigned Names And Numbers Au-thority (LANANA)[4] and the current de-vice list can be always be found on theirweb site athttp://www.lanana.org/docs/device-list/devices.txt

Linux Symposium 250

As Linux gains support for new kinds of de-vices, they need to be assigned a major and mi-nor number range in order for the user to beable to access them through the/dev direc-tory (one alternative to this is to provide ac-cess through a filesystem [3]). In the kernelversions 2.4 and earlier, the valid range of ma-jor numbers was 1-255 and minor numbers was1-255. Because of this limited range, a freezewas placed on allocating new major and mi-nor numbers during the 2.3 development cycle.This freeze has since been lifted, and the 2.6kernel should see an increase in the range ofmajor and minor numbers available for use.

2 Problems with current scheme

2.1 What /dev entry is which device

When the kernel finds a new piece of hard-ware, it typically assigns the next major/minorpair for that kind of hardware to the device.So, on boot, the first USB printer found wouldbe assigned the major number 180 and mi-nor number 0 which is referenced in/dev as/dev/usb/lp0 . The second USB printerwould be assigned major number 180 and mi-nor number 1 which is referenced in/devas /dev/usb/lp1 . If the user rearrangesthe USB topology, perhaps adding a USB hubin order to support more USB devices in thesystem, the USB probing order of the print-ers might change the next time the computeris booted, reversing the assignment of the dif-ferent minor number to the two printers.

This same situation holds true for almost anykind of device that can be removed or addedwhile the computer is powered up. With theadvent of PCI hotplug enabled systems, andhot-pluggable busses like IEEE1394, USB, andCardBus, almost all devices have this problem.

With the advent of thesysfs filesystem

in the 2.5 kernel, the problem of deter-mining which device minor is assigned towhich physical device is now much eas-ier to determine. For a system with twodifferent USB printers plugged into it, thesysfs /sys/class/usb directory tree canlook like Figure 1. Within the individ-ual USB device directories pointed to by thelp0/device and lp1/device symboliclinks, a lot of USB specific information can bedetermined, such as the manufacturer of the de-vice, and the (hopefully unique) serial number.

As can be seen by the serial files in Fig-ure 1, the /dev/usb/lp0 device file isassociated with the USB printer with se-rial number HXOLL0012202323480 , andthe /dev/usb/lp1 device file is associ-ated with the USB printer with serial numberW09090207101241330 .

If these printers are moved around, by placingthem both behind a USB hub, they might getrenamed, as they are probed in a different orderon startup.

In Figure 2, /dev/usb/lp0 is assignedto the USB printer with the serial numberW09090207101241330 due to this differentprobing order.

sysfs now enables a user to determine whichdevice has been assigned by the kernel towhich device file. This is a very power-ful association that has not been previouslyeasily available. However, a user gener-ally does not care that/dev/usb/lp0 and/dev/usb/lp1 are now reversed and shouldbe changed in some configuration file some-where, they just want to always be able to printto the proper printer, no matter where it is inthe USB device tree.

Linux Symposium 251

/sys/class/usb/|-- lp0| |-- dev| |-- device -> ../../../devices/pci0/00:09.0/usb1/1-1/1-1:0| ‘-- driver -> ../../../bus/usb/drivers/usblp‘-- lp1

|-- dev|-- device -> ../../../devices/pci0/00:0d.0/usb3/3-1/3-1:0‘-- driver -> ../../../bus/usb/drivers/usblp

$ cat /sys/class/usb/lp0/device/serialHXOLL0012202323480$ cat /sys/class/usb/lp1/device/serialW09090207101241330

Figure 1: Two USB printers plugged into different USB busses

$ tree /sys/class/usb//sys/class/usb/|-- lp0| |-- dev| |-- device -> ../../../devices/pci0/00:09.0/usb1/1-1/1-1.1/1-1.1:0| ‘-- driver -> ../../../bus/usb/drivers/usblp‘-- lp1

|-- dev|-- device -> ../../../devices/pci0/00:09.0/usb1/1-1/1-1.4/1-1.4:0‘-- driver -> ../../../bus/usb/drivers/usblp

$ cat /sys/class/usb/lp0/device/serialW09090207101241330$ cat /sys/class/usb/lp1/device/serialHXOLL0012202323480

Figure 2: Same USB printers plugged into a USB hub

2.2 Not enough numbers

The current range of allowed major and minornumbers is 8 bits (0-255). Currently there arevery few major numbers left for new charac-ter devices, and about half the number of ma-jor numbers available for block devices (blockand character devices can use the same num-bers, but the kernel treats them separately, giv-ing the whole range for both types of devices.)

This seems like a lot of free numbers, but thereare users who want to use a very large numberof disks all at the same time, for which the 8 bitscheme is too small. A common goal of somecompanies is to connect about 4,000 disks toa single system. For every disk, the kernel re-serves 16 minor numbers, due to the possibilityof there being up to 16 partitions on every disk.So 4,000 disks would require 64,000 differentdevice files, needing at least 250 major num-

Linux Symposium 252

bers to handle all of them. This would workin the 8 bit scheme, as they all would fit, but,the majority of the current major numbers arealready reserved. The disk subsystem can notjust steal major numbers from other subsys-tems very easily. And if it did so, userspacewould have to know which previously reservedmajor numbers are now being used by the SCSIdisk subsystem.

Because of this limitation, a lot of people arepushing for an increase in the size of the majorand minor number range, which would be re-quired to be able to support more than 4,000disks (some users talk of connecting 10,000disks at once.) It looks like this work will gointo the 2.6 kernel; however, the large problemremains of how to notify userspace which ma-jor and minor numbers are being used.

Even if the major number range is increased,the requirement of reserving major numberranges for different types of subsystems is stillpresent. It still requires an external namingauthority, and possibly we could run out ofranges sometime in the far future. If the kernelwere to switch to a dynamic method of allocat-ing major and minor numbers, for only the de-vices that were currently connected to the sys-tem, this authority would no longer be needed,and the range of numbers would never run out.The biggest problem with dynamic allocationis again, userspace has no idea which devicesare assigned to which major and minor num-bers.

A few kernel subsystems currently do allocateminor numbers dynamically. The USB to Se-rial subsystem has been doing this since the 2.2kernel series, with great success. The biggestproblem still is the user does not know what de-vice is assigned to what number, and has to relyon looking in a kernel log to make that deter-mination. Withsysfs in the 2.5 kernel, thisis much easier.

2.3 /dev is too big

Not all device files in the/dev directory ofmost distributions match up to a physical de-vice that is currently connected to the computerat that time. Instead, the/dev directory is cre-ated when the operating system is initializedon the machine, populating the/dev directorywith all known possible names. On a machinerunning Red Hat 9, the/dev directory holdsover 18 thousand different entries. This largenumber of different entries soon becomes veryunwieldy for users to try to determine exactlywhat devices are currently present.

Because of the large numbers of device filesin the /dev directory, a number of operatingsystems have moved over to having the ker-nel itself manage the/dev directory, as thekernel always knows exactly what devices arepresent in the system. It does this by creatinga ram based filesystem calleddevfs . Linuxalso has this option, and it has become popularover time with a number of different distribu-tions (the Gentoo distribution being one of themore notable ones.)

2.4 devfs

A number of other Unix-like operating systemshave solved a lot of the previously mentionedproblems by using a kernel-baseddevfsfilesystem. Linux also has adevfs filesystem,and for a number of people, this solves theirimmediate needs. However, the Linux-baseddevfs implementation still has a number ofproblems unsolved.

devfs does only show exactly what devicesare currently in the system at any point in time,solving the “/dev is too big” issue. How-ever, the names used bydevfs are not thenames that the LANANA authority has issued.Because of this, switching between adevfssystem, and a static/dev system is a bit dif-

Linux Symposium 253

ficult, due to the number of different config-uration files that need to be modified. Thedevfs authors have tried to address this prob-lem and have provided some compatibility lay-ers to emulate the/dev names.

Even with devfs running in compatibilitymode, the Linux kernel is imposing a set nam-ing policy on userspace. It is saying that thefirst IDE drive is going to be called/dev/hdaor /dev/ide/hd/c0b0t0u0 and there isnothing that a user can do about this. Gener-ally, the Linux kernel developers do not likeforcing any policies on userspace, when it canbe helped. This naming policy should bemoved out of the kernel, so that the kerneldriver developers can focus not on naming ar-guments (of which thedevfs naming argu-ments consumed many man years of time). Inshort, the kernel should not care what a userwants to call a device, but ifdevfs is used,this is not possible.

devfs also does not allow devices to be boundto major and minor numbers dynamically. Thecurrent devfs implementation still uses thesame major and minor numbers that are as-signed by LANANA.devfs can be modifiedto do dynamic allocation; however, no one hasdone so yet.

devfs also forces all of the device names andthe naming database into kernel memory. Ker-nel memory can not be swapped out, and isalways resident. For very large amounts ofdevices (like the previously mentioned 4,000disks), the overhead of keeping all of the devicenames in kernel memory is not unsubstantial.During some testing of a wider major num-ber range, one developer ran into memory star-vation issues on a 32 bit Intel processor, justwith a static/dev system. Add the overheadof 4,000 different disk names and structures tomanage those names, and even less memorywould be available for user programs to use.

3 udev ’s goals

So, in light of all of the previously mentionedproblems, theudev project was started. Itsgoals are the following:

• Run in userspace

• Create a dynamic/dev .

• Provide consistent device naming, ifwanted.

• Provide a userspace API to access infoabout current system devices.

The first item, “run in userspace,” iseasily done by harnessing the fact that/sbin/hotplug generates an event forevery device that is added or removed from thesystem, combined with the ability ofsysfs toshow all needed information about all devices.

The rest of the goals enable theudev projectto be split into three separate subsystems:

1. namedev – handles all device naming

2. libsysfs – a standard library for ac-cessing device information on the system.

3. udev – dynamic replacement for/dev

3.1 namedev

Due to the need for different naming schemesfor devices, the device naming portion of udevhas been moved into its own subsystem. Thiswas done to move the naming policy deci-sion out of the udev binary, allowing plug-gable naming schemes to be developed by dif-ferent groups. This device naming subsys-tem, namedev, presents a standard interfacethat udev can call to name a specific device.

Linux Symposium 254

With the initial releases ofudev , the namedevlogic is still provided in a few source filesthat get linked into theudev binary. Thereis currently only one naming scheme imple-mented, the one specified by LANANA[4].This scheme is quite simple, as generally thesysfs representation of the device uses the samename, and will be suitable for the majority ofcurrent Linux users.

As the current kerneldevfs provides a com-peting naming schema from LANANA, therehas been some interest in providing a modulethat contains this, but this is currently unavail-able due to lack of interest by the primary de-velopers.

Part of the goal for theudev project is to pro-vide a way for users to name devices basedon a set of policies. The current version ofnamedev provides the user with a five stepsequence for determining the name of a givendevice. These steps are consulted in order, andif the device’s name can be determined at anystep, that name is used. The existing steps areas follows:

1. label or serial number

2. bus device number

3. topology on bus

4. replace name

5. kernel name

In the first step, the device that is added to thesystem is checked to see if it has a unique iden-tifier, based on that type of device. For ex-ample, on USB devices, the USB serial num-ber is checked; for SCSI devices, the UUID ischecked; for block devices, the filesystem labelis checked. If this matches a identifier providedby the user (in a configuration file), the result-ing name (again specified in the configurationfile) is used.

The second step checks on the device’s busnumber. For a lot of busses, this number gen-erally does not change over time, and all busnumbers are guaranteed to be unique at anyone point in time in the system. A good ex-ample of this is PCI bus numbers, which rarelychange on the majority of systems (however,BIOS upgrades, or hotplug PCI controllers, canrenumber the PCI bus number the next time themachine is booted.) Again, if the bus numbermatches an identifier provided by the user, theresulting name is assigned to the device.

The third step checks the position of the deviceon the bus. For example, a USB device can bedescribed as residing in the 3rd hub port of thehub plugged into the first port on the root hub.This topology will not change, unless the userphysically moves the devices around, and is in-dependent of any bus numbering changes thatmight occur between reboots of a machine. Ifthe topology position on the bus matches theposition provided by the user, the requestedname is assigned to the device.

The fourth step is a simple string replacement.If the kernel name for a device matches thename specified here, the requested new namewill be used in its place. This is useful fordevices that users always know will have thesame kernel name, but wish to name somethingdifferent.

The fifth step is the catch-all step. If none ofthe previous steps have provided a name forthis device, the default kernel name will beused for this device. For the majority of de-vices in a system, this is the rule that will beused, as it matches the way devices are namedon a Linux system withoutdevfs or udev .

Figure 3 shows an examplenamedev con-figuration file. This configuration file showshow the four different ways of overriding thedefault kernel naming scheme can be changed.The first two entries show how to specify a se-

Linux Symposium 255

# USB Epson printer to be called lp_epsonLABEL, BUS="usb", serial="HXOLL0012202323480", NAME="lp_epson"

# USB HP printer to be called lp_hp,LABEL, BUS="usb", serial="W09090207101241330", NAME="lp_hp"

# sound card with PCI bus id 00:0b.0 to be the first sound cardNUMBER, BUS="pci", id="00:0b.0", NAME="dsp"

# sound card with PCI bus id 00:07.1 to be the second sound cardNUMBER, BUS="pci", id="00:07.1", NAME="dsp1"

# USB mouse plugged into the third port of the first hub to be# called mouse0TOPOLOGY, BUS="usb", place="1.3", NAME="mouse0"

# USB tablet plugged into the second port of the second hub to be# called mouse1TOPOLOGY, BUS="usb", place="2.2", NAME="mouse1"

# ttyUSB1 should always be called visorREPLACE, KERNEL="ttyUSB1", NAME="visor"

Figure 3: Examplenamedev configuration file

rial number of a device to control what that de-vice should be named. The third and fourthentries show how to override the bus probingorder and name a device based on the specificbus id. The fifth and sixth entries show how theUSB topology can be used to specify a devicename, and the seventh entry shows how to do asimple name substitution.

3.2 libsysfs

There is a need for a common API to access de-vice information insysfs by a number of var-ied programs, not just theudev project. Thedevice naming subsystem and theudev sub-system need to query a wide range of deviceinformation from asysfs represented device.Instead of duplicating this logic around in dif-ferent projects, splitting this logic ofsysfscalls into a separate library that will sit on topof sysfs makes more sense.sysfs repre-sentations of different devices are not standard(PCI devices have different attributes from

USB devices, etc.) so this is another reasonfor creating a common and standard library in-terface for querying device information.

Right now the currentudev codebase is us-ing an initial version oflibsysfs , and thelibsysfs codebase is under active develop-ment.

3.3 udev

The udev program will be responsiblefor talking to both the namedev andlibsysfs libraries to accomplish the de-vice naming policy that has been speci-fied. The udev program is run whenever/sbin/hotplug is called by the kernel. Itdoes this by adding a symlink to itself inthe/etc/hotplug.d/default directory,which is searched by the/sbin/hotplugmultiplexer script.

The /sbin/hotplug invocation by the ker-

Linux Symposium 256

nel exports a lot of device specific informa-tion on what action just happened (add or re-move), what device type the action took placefor (USB, PCI, etc.), and what device in thesysfs tree did the action.udev takes thisinformation, callsnamedev to determine thename it should give for this device (or the namethat has already been given to this device if it isa remove event). If this is a new device that hasbeen added,udev useslibsysfs to deter-mine the major and minor number that shouldbe used for the device file for this device, andthen creates the device file in the/dev di-rectory with the proper name and major/minornumber. If this is a device that has been re-moved, then the device file in the/dev direc-tory that had previously been created for thisdevice will be removed.

4 Enhancements

There are a number of different enhancementsthat different users have asked for, that can beadded to the existingudev implementation.

A lot of userspace programs want to be notifiedwhen a new device has been added or removedfrom the system. Gnome and KDE both wantto add a new icon if a disk has been added,or possibly launch a sync program if a USBPalm device has been attached. The D-BUSproject[1] has been created to help provide asimple way for applications to talk to one an-other using messages. It has been proposedthat theudev program create a D-BUS mes-sage after it has created or removed a devicefile, so that any listening applications can actupon this event.

Currently,namedev uses a very simple con-figuration file, creating a simple ram baseddatabase that it uses to store all current deviceinformation, and device naming rules. It hasbeen proposed that this database (if it can even

really be called such a thing), be moved to areal, backing-store type database, in order tostore a persistent view of the system, or to pro-vide a more complex naming scheme forudevto use.

5 Thanks

The author would like to thank Daniel Stekloffof IBM who has helped shape the design ofudev in many ways. Without his persever-ance,udev might not even be working. Healso provided the initial design documents forhow udev could be split up into differentpieces, allowing pluggable naming schemesand has been instrumental in the developmentof libsysfs . Also, without Pat Mochel’ssysfs and driver model core,udev wouldnot even have been possible to implement. Theauthor is indebted to him for undertaking whatmost thought as an impossible task, and for al-lowing others to easily build on his commonframework, allowing all users to see the “webwoven by a spider on drugs”[5] that the kernelkeeps track of.

6 Legal Statement

IBM is a registered trademark of International Busi-ness Machines in the United States and/or othercountries.

Linux is a registered trademark of Linus Torvalds

UNIX is a registered trademark of The Open Groupin the United States and other countries.

Intel is a registered trademark of Intel Corporationin the United States and/or other countries.


Linux Symposium 257

References

[1] D-BUS project.http://www.freedesktop.org/software/dbus/ .

[2] Linux Filesystem Hierarchy Standard.http://www.pathname.com/fhs/2.2/ .

[3] Greg Kroah-Hartman. Putting a filesysteminto a device driver. InLinux.conf.au, Perth,Australia, January 2003.

[4] The Linux Assigned Names And NumbersAuthority. http://www.lanana.org/ .

[5] Linux Weekly News.http://lwn.net/Articles/31185/ .

Reliable NAS from Dirt Cheap CommodityHardware

Benjamin C.R. [email protected]

Abstract

To many of us, the integrity of our data isparamount. Unfortunately, the most cost ef-fective commodity hardware available tends toomit important features like ECC memory, oronly provides it at a substantial cost premium.To address this, an approach that makes useof concepts taken from NAS and clustering,combined with a novel trick to obtain a CRCon data blocks from existing ethernet NICsis used. The resulting code is built on topof the existing Linux Network Block Device,network drivers and async io. Details rang-ing from design considerations to performancetuning and implementation provide an interest-ing story on how to attain a much easier, andcheaper, reliable storage solution.

1 Introduction

Anyone who uses computers can tell you thatthey will inevitably break. From software bugs,to hardware errata, or just plain old age, thecomponents of systems will always fail at somepoint in their deployed lifespan. Hardwarevendors try to address various subsets of thesefailures, but frequently only in high end sys-tems. Even in the case of hardware specificallydesigned to cope with a common mode of fail-ure, RAID controllers are lost when a failedSCSI device locks the bus.

Over the last few years, low end hardware

has come to provide capabilities well beyondthe needs of many server systems. Relativelycheap IDE drives cost less than a third theamount for a comperable SCSI device. As anend user, that difference in the price perfor-mance ratio raises more than a few eyebrows.Several companies sell products into this niche,yet there are a number of failure modes whichtheir devices fail to address. Data corruptionis still possible when disks make use of non-error correcting memory, or firmware bugs ex-ist (what body of software is bug-free?). Sowhat is a user that wishes to make use of theprice point of commodity hardware to do iftheir data is truely valuable?

Take one kernel hacker, random bits of hard-ware, several discussions over beer at confer-ences, this problem, and mix. The result isnetmd.o .

2 What is Netmd?

Netmd at its core is a hybrid between the LinuxRAID implementation (known as md) and thenetwork block device (known as nbd), and isquite similar to DRBD. DRBD allows users tomirror disks over a network, yet it runs underthe assumption that the underlying hardware isreliable and will detect any errors which cor-rupt user data. Netmd changes this by makinguse of an end to end CRC on sectors to checkif any of the BEs have unknowingly corruptedread or write data. All told, netmd is much eas-

Linux Symposium 259

ier to use than either drbd or nbd while provid-ing guarantees about data integrity.

3 Design Criteria

The front end system runs the netmd kernelmodule. To provide the greatest likelyhood ofthe software being correct, simplicity is stronggoal for the resulting code.

suffered from a simple single bit error, or moregross errors resulting in so called fracturedblocks (ie, the case where the sector number it-self is corrupt and the wrong data is read backfrom disk). The basic assumption netmd makesis that your front end hardware is reliable andthe bulk of errors will come from the back endsystems which house all of the disks. A frontend system running NetMD looks very muchlike an NBD, but provides reliability guaran-tees.

The simplest configuration for NetMD (seeFigure 3) consists of a front end system andthree back end systems networked via a switchor hub. A minimum of three back ends are re-quired for NetMD to establish quorum and pro-vide the ability to hot swap one back end at atime. Additional BEs can be added to improvethe performance and reliability of a NetMDsetup (such as in Figure 3).

Unlike nbd, which operates over TCP, NetMDinstead operates over UDP to take advantageof the packet nature of the underlying network.This allows writes to be optimized using mul-ticast to deliver data to all BEs simultaneously.BEs may also snoop read requests to rapidly re-ply if the data is still available in its local cache.

In fact, the main data integrity feature ofNetMD comes from the ethernet packets whichare used to transmit read and write requestsover the network. All ethernet frames are pro-tected from transmission errors by a trailing

32 bit CRC. Many ethernet cards are able torecord the CRC of an incoming packet. Bycompensating for the header of each packet,the data CRC can be extracted at low CPU cost.

Improving the Linux Test Project with Kernel CodeCoverage Analysis

Paul LarsonIBM Linux Technology Center

Nigel Hinds, Rajan Ravindran, Hubertus FrankeIBM Thomas J. Watson Research Center

{plars, nhinds, rajancr, frankeh}@us.ibm.com

Abstract

Coverage analysis measures how much of thetarget code is run during a test and is a usefulmechanism for evaluating the effectiveness of asystem test or benchmark. In order to improvethe quality of the Linux® kernel, we are uti-lizing GCOV, a test coverage program which ispart of GNU CC, to show how much of the ker-nel code is being exercised by test suites suchas the Linux Test Project. This paper will dis-cuss the issues that make code coverage analy-sis a complicated task and how those issues arebeing addressed. We will describe tools thathave been developed to facilitate analysis andhow these tools can be utilized for kernel, aswell as application code coverage analysis.

1 Introduction

The Linux Test Project (LTP) is a test suite fortesting the Linux kernel. Over the years, theLTP has become a focal point for Linux testingand Linux test tools. Just as the Linux kernelis constantly under development, the LTP testsuite must also be constantly updated, modi-fied, and improved upon to keep up with thechanges occurring in the kernel. As the Linux

Test Project test suite matures, its developersconstantly look for ways to improve its useful-ness. Some of the considerations currently be-ing addressed are what areas of the Linux ker-nel test development efforts should focus on,and proving that tests added to the LTP aretesting more than what was there before. Inorder to gather data to support decisions forthese questions, analysis should be performedto show what areas of the kernel are being exe-cuted by any given test.

The design of a process for doing this analysistook the following features into consideration:

1. Use existing tools as much as possible

2. Take a snapshot of coverage data at anytime

3. Clear coverage counters before a test ex-ecution so just the coverage from the testcould be isolated

4. Provide output that looks nice and is easyto read and compare

5. Show coverage percentages at any level ofdirectory or file

Linux Symposium 261

6. Show execution counts for every instru-mented line of every file

This paper is outlined as follows: Section 2provides a general description of code cover-age analysis. Sections 3 and 4 discuss GCOVand other code coverage tools. Section 5 de-scribes Linux kernel modifications necessaryto provide kernel code coverage. Test resultsare presented in Section 6. Section 7 presentsthe LCOV toolkit for processing and display-ing GCOV results. Section 8 describes how theresults of kernel code coverage analysis can beused to improve the Linux Test Project. Fu-ture work planned on this project is discussedin Section 9. Finally, conclusions are presentedin Section 10.

2 How code coverage analysisworks

Before examining possible methods for thetask of kernel code coverage analysis, it is im-portant to first discuss some general conceptsof code coverage analysis.

Code coverage analysis shows what percentageof an application has been executed by the testprocess. The metrics derived from code cover-age analysis can be used to measure the effec-tiveness of the test process [Perry].

Statement coverage analysis [Cornett] breaksthe code down into basic blocks that exist be-tween branches. By design, it is clear that ifany line in a basic block is executed a givennumber of times, then every instruction withinthe basic block would have been executed thesame number of times. This provides for a con-venient way of showing how many times eachline of code in a program has been executed.

There are, however, a number of deficienciesin this approach. The following piece of code

exhibits one of the main issues with statementcoverage analysis:

int *iptr = NULL;if(conditional)

iptr = &i;*iptr = j*10;

The above code may show 100% code cover-age but will obviously fail miserably if the con-ditional is ever false.

To deal with situations such as the onedemonstrated above, branch coverage analy-sis [Marick] is required. Rather than focusingon the basic blocks and the lines of code exe-cuted, branch coverage looks at possible pathsa conditional can take and how often each paththrough the conditional is taken. This willaccount for problems such as those describedabove because it will clearly show that the con-ditional was never evaluated to be false. An-other advantage of branch coverage analysis isthat knowledge of how many times each line ofcode was executed can be derived by knowinghow many times each conditional was evalu-ated for each possible outcome.

Although branch coverage analysis providesa more accurate view of code coverage thanstatement coverage, it is still not perfect. Togain a better understanding of the path takenthrough the entire conditional, the outcome ofeach test in a complex conditional must be de-termined. For instance:

struct somestructure* p = NULL;

if(i == 0 || (j == 1 && p->j == 10))printf("got here\n");

If i is ever non-zero whenj is 1, then the NULLpointerp will be dereferenced in the last partof the conditional. There are paths through thisbranch in both the positive and negative case

Linux Symposium 262

that will never expose the potential segmenta-tion fault.

Branch coverage would show how many timesthe entire conditional was evaluated to true orfalse, but it would not show the outcome ofeach test that led to the decision.

There are many other types of coverage fordealing with a variety of corner cases. How-ever, statement coverage and branch coverageare the two types that will be focused on in thispaper since they are the only ones supported byGCOV.

3 Methods considered

This section discusses some of the more pop-ular techniques for collecting statement andbranch coverage data in a running system.

3.1 Estimation

At a most basic level, estimating techniquescould be used to collect code coverage withina given program. Obviously, this method hasno redeeming value at all with regards to accu-racy or reproducibility. The advantage to thismethod, and the only reason it is mentioned, isthat it requires the least amount of effort. Givena developer with sufficient knowledge of thecode, a reasonable estimate of code coveragemay be possible to estimate in a short amountof time.

3.2 Profiling

Performance profilers are well known, easy touse, and can be used to track coverage; how-ever these uses are not the purpose for whichperformance profilers were intended. Perfor-mance profilers use statistical profiling ratherthan exact profiling.

Readprofile is a simple statistical profiler thatstores the kernel PC value into a scaled his-togram buffer on every timer tick. Readprofileprofiles only the kernel image, not user spaceor kernel modules. It is also incapable of pro-filing code where interrupts are disabled.

Oprofile, another statistical profiler, leveragesthe hardware performance counters of the CPUto enable the profiling of a wide variety of in-teresting statistics which can also be used forbasic time-spent profiling. Oprofile is capa-ble of profiling hardware and software inter-rupt handlers, the kernel, shared libraries, andapplications.

All of these statistical profilers can be usedindirectly to gather coverage information, butbecause they approximate event distributionthrough periodic sampling via an interrupt,they cannot be considered accurate for truecoverage analysis.

3.3 User-Mode Linux

User-mode Linux is an implementation of theLinux kernel that runs completely in usermode. Because it runs as a user mode appli-cation, the possibility exists to utilize analy-sis tools that were normally intended to runagainst applications. These tools would notnormally work on kernel code, but User-modeLinux makes it possible. To make use of thisfeature in User-mode Linux, the kernel needsto be configured to compile the kernel with thenecessary flags to turn on coverage tracking.To do this, the “Enable GCOV support” optionmust be selected under “Kernel Hacking” fromthe configuration menu.

When GCOV support is turned on for User-mode Linux, the necessary files for usingGCOV are created, and coverage is trackedfrom boot until shutdown. However, interme-diate results cannot be gathered, and counters

Linux Symposium 263

cannot be reset during the run. So the resultingcoverage information will represent the bootand shutdown process and everything executedduring the UML session. It is not possible todirectly gather just the coverage results of asingle test on the kernel.

3.4 GCOV-Kernel Patch

In early 2002, Hubertus Franke and RajanRavindran published a patch to the Linux ker-nel that allows the user space GCOV tool tobe used on a real, running kernel. This patchmeets the requirement to use existing tools. Italso provides the ability to view a snapshot ofcoverage data at any time. One potential issuewith this approach is that it requires the use ofa kernel patch that must be maintained and up-dated for the latest Linux kernel before cover-age data can be gathered for that kernel.

The GCOV-kernel patch and GCOV user modetool were chosen to capture and analyze codecoverage data. The other methods listed abovewere considered. However, the GCOV-kernelpatch was chosen because it allowed platformspecific analysis as well as the ability to easilyisolate the coverage provided by a single test.

4 GCOV Coverage Tool

Having decided to use the GCOV tool for LTP,we first describe how GCOV works in userspace applications and then derive what modi-fications need to be made to the kernel to sup-port that functionality.

GCOV works in four phases:

1. Code instrumentation during compilation

2. Data collection during code execution

3. Data extraction at program exit time

4. Coverage analysis and presentation post-mortem (GCOV program)

In order to turn on code coverage infor-mation, GCC must be used to compile theprogram with the flags “-fprofile-arcs-ftest-coverage ”. When a file iscompiled with GCC and code coverage isrequested, the compiler instruments the codeto count program arcs. The GCC compilerbegins by constructing a program flow graphfor each function. Optimization is performedto minimize the number of arcs that mustbe counted. Then a subset of program basicblocks are identified for instrumentation. Withthe counts from this subset GCOV can recon-struct program arcs and line execution counts.Each instrumented basic block is assigned anID, blockno . GCC allocates a basic blockcounter vectorcounts that is indexed bythe basic block ID. Within each basic blockthe compiler generates code to increment itsrelatedcounts [blockno ] entry. GCC alsoallocates a data-objectstruct bb bbobj toidentify the name of the compiled file, the sizeof thecounts vector and a pointer to the vec-tor. GCC further creates a constructor function_GLOBAL_.I.FirstfunctionnameGCOV

that invokes a global function__bb_init_func(struct bb*) withthe bbobj passed as an argument. A pointerto the constructor function is placed intothe “.ctors” section of the object code. Thelinker will collocate all constructor functionpointers that exist in the various *.o files(including from other programming paradigms(e.g. C++)) into a single “.ctors” section ofthe final executable. The instrumentationand transformation of code is illustrated inFigure 1.

In order to relate the various source code lineinformation with the program flow and thecounter vector, GCC also generates two out-put files for eachsourcefile.c compiled:

Linux Symposium 264

———————- file1.c ———————–

→ static ulong counts[numbbs];→ static struct bbobj =→ { numbbs, &counts,“file1.c”};→ static void _GLOBAL_.I.fooBarGCOV()→ { __bb_init_func(&bbobj); }

void fooBar(void){→ counts[i]++;

<bb-i>if (condition) {

→ counts[j]++;<bb-j>

} else {<bb-k>

}}

→SECTION(“.ctors”’)→ { &_GLOBAL_.I.fooBarGCOV }——————————————————–”→” indicates the lines of code inserted by thecompiler and<bb-x> denotes the x-th basic blockof code in this file and italic/bold code is added bythe GCC compiler. Notice that the arc from theifstatement into bb-k can be derived by subtractingcounts[j] from counts[i].

Figure 1: Code modification example

(i) sourcefile.bb , which contains a list ofsource files (including headers) and functionswithin those files and line numbers correspond-ing to basic blocks in the source file, and (ii)sourcefile.bbg contains a list of the pro-gram flow arcs for each function which in com-bination with the*.bb file enables GCOV toreconstruct the program flow.

The glibc (C runtime library) linked withthe executable provides the glue and ini-tialization invocation and is found in thelibgcc2.c . More specifically: it provides

the __bb_init_func(struct bb*)function that links an object passed as anargument to a globalbbobj list bb_head .The runtime library also invokes all functionpointers found in the “.ctors” section, whichwill result in all bbobj s being linked to thebb_head list, as part of the_start wrapperfunction.

At runtime, the counter vector entries are in-cremented every time an instrumented basicblock is entered. At program exit time, theGCOV enabled main wrapper function walksthebb_head list and for eachbbobj encoun-tered, it creates a filesourcefile.da (notethe source file name was stored in the structure)and populates the file with the size of the vectorand the counters of the vector itself.

In the post mortem phase, the GCOV util-ity integrates and relates the information ofthe *.bbg, *.bb, and the *.da to produce the*.gcov files containing per line coverage out-put as shown in Figure 2. For instrumentedlines that are executed at least once, GCOVprefaces the text with the total number of timesthe line was executed. For instrumented linesthat were never executed, the text is prefacedby the string###### . For any lines of codethat were not instrumented, such as comments,nothing is added to the line by GCOV.

5 GCOV Kernel Support

The GCOV tool does not work as-is in con-junction with the kernel. There are several rea-sons for this. Though the compilation of thekernel files also generates thebbobj s con-structor, and the “.ctors” section, the resultingkernel has no wrapper function to call the con-structors in .ctors. Second, the kernel never ex-its such that the *.da files can be created. Oneof the initial overriding requirements was tonot modify the GCOV tool and the compiler.

Linux Symposium 265

——————– example.c ———————

int main() {1 int i;

1 printf("starting example\n");11 for(i=0;i<10;i++) {10 printf("Counter is at %d\n",i);10 }

/* this is a comment */

1 if(i==1000) {###### printf("This line should "

"never run\n");###### }

1 return 0;1 }

Figure 2: GCOV output

The first problem was solved by explic-itly defining a “.ctors” section in the ker-nel code. The new symbol __CTOR_LIST__is located at the beginning of this sec-tion and is made globally visible to thelinker. When the kernel image is linked,__CTOR_LIST__ has the array of constructorfunction pointers. This entire array is delim-ited by the ctors_start = &__CTOR_LIST__; ctors_end = &__DTOR_LIST__ variables,where __DTOR_LIST__ is the starting addressof the ".dtors" section, which in turn containsall destructor functions. By default the countervectors are placed in zero-filled memory anddo not need to be explicitly initialized. Thedata collection starts immediately with the OSboot. However, the constructors are not calledat initialization time but deferred until the datais accessed (see below).

The second problem is the generation of the*.da files. This problem is solved by the in-troduction of a /proc/gcov file system that canbe accessed from “user space” at any time.The initialization, creation, and access of the/proc/gcov filesystem is provided as a load-able module “gcov-proc” in order to mini-mize modifications to the kernel source codeitself. When the module is loaded, the con-

structor array is called and thebbobj s arelinked together. After that thebb_headlist is traversed and a/proc/gcov filesys-tem entry is created for eachbbobj encoun-tered. It is accomplished as follows: the ker-nel source tree’s basename (e.g. /usr/src/linux-2.5.xx) is stripped from the filename indicatedin the bbobj . The program extension (e.g.*.c) of the resulting relative file path (e.g.arch/i386/kernel/mm.c) is changed to (*.da)to form the path name of the entry. Next,this pathname is traversed under the /proc/gcovfilesystem and, for newly encountered levels(directory), a directory node is created. Thisprocess is accomplished by maintaining an in-ternal tree data structure that linksbbobj s andproc_entries together. Because GCOVrequires all *.c, *.bb, *.bbg, and *.da files to belocated in the same directory, three symboliclinks for the *.{c | bb | bbg } files are createdin their appropriate kernel source path. This isshown in Figure 3.

Reading from the *.da file gives access tothe underlying vector of the correspondingfile in the format expected by the GCOVtool. Though the data is accessed on aper file basis, a /proc/gcov/vmlinux file todump the entire state of all vectors is pro-vided. However, reading the full data from/proc/gcov/vmlinux would require a change ofGCOV. Resetting the counter values for the en-tire kernel is supported through writing to the/proc/gcov/vmlinux file using‘echo “0” >/proc/gcov/vmlinux’ . Individual filecounters can be reset as well by writing toits respective *.da file under the /proc/gcovfilesystem.

5.1 Dynamic Module Support

Dynamic modules are an important feature thatreduces the size of a running kernel by com-piling many device drivers separately and onlyloading code into the kernel when a specific

Linux Symposium 266

sibling

child

parent

af_unix.c

time.c

garbage.c

sched.c

leaf with *.da, *.bb, *.bbg

proc_entries

/proc

unix

kernelnet

gcov

directory withproc_dir_entry

crossref

sched.c

bb_head

time.c

garbage.c

af_unix.c

kernel gcov-proc module

LEGEND

Figure 3: Data structures maintained by the gcov-proc loadable module

driver is needed. Like the kernel, the mod-ule compile process inserts the coverage codemaking counter collection automatic throughcode execution. In order to access the GCOVresults through /proc/, the “.ctors” section ofthe module code needs to be executed dur-ing the dynamic load. As stated above, theGCOV enabled kernel locates the array of con-structor routine function pointers starting fromctors_start to ctors_end . Due to thedynamic nature of a module, its constructorsare not present in this section. Instead, to ob-tain the constructor address of the dynamicmodule, two more variables were added to“struct module” (ie. ctors start and end) thedata structure which is passed along by themodutils to the kernel each time a module isloaded. When the dynamic module is loaded,the kernel invokes the constructor which resultsin the linkage of thebbobj to the tail of thebb_head . If the gcov-proc module has al-ready been loaded, then the /proc/gcov filesys-tem entries are added. If the gcov-proc moduleis not loaded, their creation is deferred until thegcov-proc is loaded. Once when gcov-proc isloaded, the /proc/gcov filesystem is created bytraversing thebb_head list.

Dynamic modules will create their proc filesystem under /proc/gcov/module/{path to thedynamic module source}. A module’s .da andsymbolic links *.{c | bb | bbg} will be auto-matically removed at the time of module un-loading.

Up to Linux 2.5.48, all the section address ma-nipulation and module loading part are per-formed by modutils. Accordingly we modi-fied the modutils source to store the construc-tor address of the module which is loaded andpass that to the kernel as an argument. Startingwith Linux 2.5.48, Rusty Russell implementedan in-kernel module loader, which allows thekernel to obtain the constructor instead of re-quiring modifications to the modutils.

5.2 SMP Issues

One of the first concerns with this approachwas data accuracy on symmetric multiproces-sors (SMPs). SMP environments introduce thepossibility of update races leading to inaccu-rate GCOV counter data. The machine levelincrement operation used to update the GCOVcounter is not atomic. In CISC and RISCarchitectures such as Intel® and PowerPC®,

Linux Symposium 267

memory increment is effectively implementedas load, add, & store. In SMP environments,this can result in race conditions where multi-ple readers will load the same original value,then each will perform the addition and storethe same new value. Even if the operation isatomic on a uniprocessor, caching on SMPscan lead to unpredictable results. For exam-ple, in the case of three readers, each wouldread the value n, increment and store the valuen+1. After all three readers were complete thevalue would be n+1. The correct value, afterserializing the readers, should be n+3. Thisproblem exists on SMPs and multi-threadedenvironments using GCOV in kernel or userspace. Newer versions of GCC are experiment-ing with solutions to this issue for user space byduplicating the counter array for each thread.The viability of this approach is questionable inthe kernel where GCOV would consume about2 megabytes per process. We discuss this issuefurther in the Future Work section (10).

A test was constructed to observe the anomaly.A normally unused system call was identifiedand was used so the exact expected value ofthe coverage count would be known. Two userspace threads running on different CPUs madea total of 100K calls to thesethostnamesystem call. This test was performed 35 timesand the coverage count was captured. On av-erage, the coverage count for the lines in thesethostname kernel code was 3% less thanthe total number of times the system call wasmade.

The simplest solution was to ensure thatcounter addition was implemented as atomicoperations across all CPUs. This approach wastested on the x86 architecture where aLOCKinstruction prefix was available. The LOCKprefix uses the cache coherency mechanism, orlocks the system bus, to ensure that the originaloperation is atomic system-wide. An Assem-bler pre-processor was constructed to search

the assembly code for GCOV counter operands(LPBX2 labels). When found, the LOCK pre-fix would be added to these instructions. AGCC machine dependentaddition operationwas also constructed for x86 to accomplish thesame task.

The results of the sethostname test on thenew locking gcov kernel (2.5.50-lock-gcov)showed no difference between measured cov-erage and expected coverage.

5.3 Data Capture Accuracy

Another effect on data accuracy is that captur-ing GCOV data is not instantaneous. When westart capturing the data (first item on bb_head)it may have valuevorig. By the time the lastitem on bb_head is read, the first item may nowhave valuevnew. Even if no other work is beingperformed, reminiscent of the Heisenberg ef-fect, the act of capturing the GCOV data mod-ifies the data itself. In Section 8 we observedthat 14% of the kernel is covered just from run-ning the data capture. So 14% coverage is thelowest result we would expect to see from run-ning any test.

5.4 GCOV-Kernel Change Summary

In summary, the following changes were madeto the linux kernel:

(a) modified head.S to include symbols whichidentify the start and end of the .ctors sec-tion

(b) introduced a new file kernel/gcov.c wherethe __bb_init_func(void) func-tion is defined

(c) added two options to the configure file, (i)to enable GCOV support and (ii) enablethe GCOV compilation for the entire ker-nel. If the latter is not desired, then it is

Linux Symposium 268

the responsibility of the developer to ex-plicitly modify the Makefile to specify thefile(s). For this purpose we introducedGCOV_FLAGS as a global macro.

(d) Modified kernel/module.c to deal with dy-namic modules.

(e) An assembler pre-processor was written toaddress issues with inaccurate counts onSMP systems

Everything else is kept in the gcov-proc mod-ule.

6 GCOV patch Evaluation

This section summarizes several experimentswe conducted to evaluate the GCOV-kernel im-plementation. Unless otherwise stated all testswere conducted on a generic hardware nodewith a two-way SMP, 400 MHz Pentium® II,256 MB RAM. The Red Hat 7.1 distribution(with 2.5.50 Linux kernel) for a workstationwas running the typical software (e.g. In-etd, Postfix), but remained lightly loaded. TheGCC 2.96 compiler was used.

The total number of counters in all the .dafiles produced by GCOV was calculated for2.5.50. 336,660 counters * 4 bytes/counter⇒ 1.3 Mbytes of counter space in the ker-nel1. The average number of counters perfile is 718. drivers/scsi/sym53c8xx.chad themost lines instrumented with 4,115. Andin-clude/linux/prefetch.hwas one of 28 files thathad only one line instrumented.

6.1 LOCK Overhead

A micro-benchmark was performed to esti-mate the overhead of using the LOCK pre-fix. From two separate loops, multiple incre-ment and locked increment instructions were

1in GCC 3.1 and higher counters are 8 bytes.

Test kernel runtime (sec)

2.5.50 596.52.5.50-gcov 613.02.5.50-gcov-lock 614.0

Table 1: Results: System and User times (inseconds) for 2.5.50 kernel compiles using threetest kernels.

called. The runtime for the loops were mea-sured and an estimate of the average time perinstruction was calculated. The increment in-struction took 0.015µs, and the locked incre-ment instruction took 0.103µs. This representsa 586% instruction runtime increase when us-ing the LOCK prefix. At first this might seemlike a frightening and unacceptable overhead.However, if the actual effect on sethostnameruntime is measured, it is 19.5µs per systemcall with locking and 13.9µs per call withoutlocking. This results in a more acceptable 40%increase. However, further testing is requiredto ensure results do not vary depending on lockcontention.

6.2 Performance Results

Finally, a larger scale GCOV test was per-formed by comparing compile times for theLinux kernel. Specifically each test kernel wasbooted and the time required to compile (ex-ecutemake bzImage) the Linux 2.5.50 kernelwas measured. Table 1 shows runtimes (in sec-onds) as measured by/bin/time for the threekernels tested. These results show a 3% over-head for using the GCOV kernel and a verysmall additional overhead from using the lock-ing GCOV kernel.

Linux Symposium 269

Figure 4: Partial LCOV output for fs/jfs

Figure 5: Example LCOV output for a file

7 Using GCOV and LCOV to gen-erate output

Once the .da files exist, GCOV is capableof analyzing the files and producing coverageoutput. To see the coverage data for a userspace program, one may simply run ‘gcov pro-gram.c’ where program.c is the source file. The.da file must also exist in the same directory.

The raw output from GCOV is useful, but notvery readable. In order to produce nicer look-ing output, a utility called LCOV was writ-ten. LCOV automates the process of extractingthe coverage data using GCOV and producingHTML results based on that data [Lcovsite].

The LCOV tool first calls GCOV to generate

the coverage data, then calls a script called gen-info to collect that data and produce a .info fileto represent the coverage data. The .info file isa plaintext file with the following format:

TN: [test name]SF: [path and filename of the \

source file]FN: [function start line number],\

[function name]DA: [line number],[execution count]LH: [number of lines with count > 0]LF: [number of instrumented lines]end_of_record

Once the .info file has been created, the gen-html script may be used to create html outputfor all of the coverage data collected. The gen-html script will generate both output at a direc-tory level as illustrated in Figure 4 and output

Linux Symposium 270

HTML

genhtml.pl

.info file

geninfo.pl

.da files

RunningCode

Figure 6: lcov flow

on a file level as illustrated in Figure 5. The ba-sic process flow of the LCOV tool is illustratedin Figure 6. The commands used to generatekernel results are as follows:

1. Load the gcov-proc kernel module.

2. lcov -zerocountersThis resets the counters, essentially call-ing ‘echo “0” > /proc/gcov/vmlinux’

3. Run the test.

4. lcov -c -o kerneltest.infoThis produces the .info file describedabove.

5. genhtml -o [outputdir]kerneltest.infoThe final step produces HTML outputbased on the data collected in step 3.

The original intent of the LCOV tool was forkernel coverage analysis, so it contains somefeatures, such as resetting the counters through/proc, that are only for use with the gcov-proc module. However, the LCOV tool canalso be used for analyzing the coverage of userspace applications, and producing HTML out-put based on the data collected. The processfor using LCOV with applications is mostly thesame. In order for any coverage analysis totake place, the application must be compiledwith GCC using the-fprofile-arcs and-ftest-coverage flags. The applicationshould be executed from the directory it wascompiled in, and for GCOV to work on userspace programs, the application must exit nor-mally. The process for using LCOV with anapplication is as follows:

1. Run the program following your test pro-cedure, and exit.

2. lcov -directory [dirname]-c -o application.info

Linux Symposium 271

3. genhtml -o [outputdir]application.info

LCOV can also be used to manipulate .infofiles. Information describing coverage dataabout files in a .info file may be extractedor removed. Runninglcov -extractfile.info PATTERN will show just thedata in the .info file matching the given pattern.Running lcov -remove file.infoPATTERNwill show all data in the given.info file except for the records matchingthe pattern, thus removing that record. Datacontained in .info files may also be combinedby running lcov -a file1.info -afile2.info -a file3.info... -ooutput.info producing a single file (out-put.info) containing the data from all filespassed to lcov with the-a option.

8 Applying results to test develop-ment

The real benefit of code coverage analysis isthat it can be used to analyze and improvethe coverage provided by a test suite. In thiscase, the test suite being improved is the LinuxTest Project. 100% code coverage is neces-sary (but not sufficient) for complete functionaltesting. It is an important goal of the LTP toimprove coverage by identifying untested re-gions of kernel code. This section provides ameasure of LTP coverage and how the GCOV-kernel can improve future versions of the LTPsuite.

Results were gathered by running the LTP un-der a GCOV enabled kernel. There are a fewthings that must be taken into consideration.

It is obviously not possible to cover all of thecode. This is especially true for architectureand driver specific code. Careful considera-tion and planning must be done to create a ker-

ID Test KC DC TC

LTP LTP 35.1 0.0 100.0CP Capture 14.4 0.3 97.7MK Mkbench 17.0 0.0 99.9LM Lmbench 22.0 0.3 98.5SJ SPECjAppServer 23.6 1.2 95.1X Xserver 24.7 1.7 93.0

XMLS X+MK+LM+SJ 29.4 3.1 89.5B Boot 40.1 16.2 59.5

Table 2: Benchmarks: KC: kernel coverage;DC: delta coverage; TC: test coverage

nel configuration that will accurately describewhich kernel features are to be tested. Thisusually means turning on all or most things andturning off drivers that do not exist on the sys-tem under test.

In order to take an accurate measurement forthe whole test suite, a script should be writtento perform the following steps:

1. Load the gcov-proc kernel module

2. ‘echo “0” > /proc/gcov/vmlinux’ to resetthe counters

3. Execute a script to run the test suite withdesired options

4. Gather the results using LCOV

Using a script to do this will minimize variancecaused by unnecessary user interaction. Reset-ting the counters is important to isolating justthe coverage that was caused by running thetest suite.

Alternatively, a single test program can be ana-lyzed using the same method. This is mostlyuseful when validating that a new test hasadded coverage in an area of the kernel that waspreviously uncovered.

Coverage tests were performed on a IBM®Netfinity 8500R 8-way x 700 MHz SMP run-

Linux Symposium 272

ning the 2.5.50 Linux kernel. LTP is a test suitefor validating the Linux kernel. The version ofLTP used in this test was ltp-20030404. LTPhits 47,172 of the 134,396 lines of kernel codeinstrumented by GCOV-kernel (35%)2. Thisfigure alone suggests significant room for LTPimprovement. However, to get an accurate pic-ture of LTP usefulness it is also important tomake some attempt to categorize the kernelcode. This way we can determine how muchof the importantkernel code is being coveredby LTP. For the purposes of this work we havecollected coverage information from a set ofbenchmarks. This set is intended to representone reasonable set of important kernel code.The code is important because it is used by ourset of popular benchmarks. Table 2 shows thecoverage for LTP and our suite of tests.Bootcaptures the kernel coverage during the bootprocess.Xservercaptures the kernel coveragefor starting the Xserver.Captureis the cover-age resulting from a counter reset immediatelyfollowed by a counter capture. The columnKC(kernel coverage)specifies the percentage ofkernel code covered by the test.DC (delta cov-erage)is the percent of kernel code covered bythe test, but not covered by LTP.

Although LTP coverage is not yet complete, wesee that it exercises all but 3.1% of the kernelcode used by the combined test XMLS. An-other useful view of the data is the code cov-ered by LTP divided by the code covered bythe test,percent of test coverage covered byLTP. The columnTC (test coverage)showsthese results. For example, LTP covered (hit)99.9% of the kernel code covered (used) bymkbench. This figure is particularly useful be-cause it does not depend on the amount of ker-nel code instrumented. LTP provides almost90% coverage of the important code used by

2Results are based on one run of the LTP test suite.Although the number of times any one line is hit canvary randomly, we would expect much smaller variancewhen considering hits greater than zero.

our combined test, XMLS. This result showsvery good LTP coverage for the given test suite.Boot should be considered separately becausemuch of the code used during boot is neverused again once the operating system is run-ning. But even here, LTP covers 60% of thecode used during the boot process.

Another benefit of code coverage analysis inthe Linux Test Project is that it provides amethod of measuring the improvement of theLTP over time. Coverage data can be gener-ated for each version of LTP released and com-pared against previous versions. Great caremust be taken to ensure stable test conditionsfor comparison. The same machine and samekernel config options are used each time, how-ever the most recent Linux kernel and the mostrecent version of LTP is used. Changing both isnecessary to prove that the code coverage pro-vided by LTP is increasing with respect to thechanges that are occurring in the kernel itself.Since features, and thus code, are more oftenadded to the kernel than removed, this meansthat tests must constantly be added to the LTPin order to improve the effectiveness of LTP asa test suite.

9 Future Work

Obviously, the tools and methods describedin this paper are useful in their current state.However, there is some functionality that isstill required. LCOV should be modified toprocess and present branch coverage data cur-rently produced by GCOV. This informationwould give a better picture of kernel coverage.Many other enhancements are currently beingdesigned or developed against the existing setof tools.

It would be very useful to know kernel cover-age for a specific process or group of processes.This could be done by introducing a level of in-

Linux Symposium 273

direction, so process-specific counters can beaccessed without changing the code. This ap-proach could utilize Thread Local Store (TLS)which is gaining popularity[Drepper]. TLSuses an entry in the Global Descriptor Tableand a free segment register as the selector toprovide separate initialized and uninitializedglobal data for each thread. GCC would bemodified to reference the TLS segment registerwhen accessing global variables. For a GCOVkernel, the storage would have to be in ker-nel space (Thread Kernel Storage) The chal-lenge here is to efficiently provide the indirec-tion. Even if the cost of accessing TLS/TKS islow, it is probably infeasible to gather coveragedata for all processes running on a system. Thestorage required would be considerable. So, amechanism for controlling counter replicationwould be necessary.

Some parts of the kernel code are rarely used.For example,init code is used once and re-moved from the running image after the kernelboots. Also, a percentage of the kernel codeis devoted to fault management and is oftennot tested. The result of this rarely used codeis to reduce the practically achievable cover-age. Categorizing the kernel code could iso-late this code, provide a more realistic view ofLTP results, and help focus test developmenton important areas in the kernel. The kernelcode could be divided into at least three cate-gories. Init and error path code being two sep-arate categories. The rest of the kernel codecould be considered mainline code. Cover-age would then be relative to the three cate-gories. A challenge to some of this categoriza-tion work would be to automate code identi-fication, so new kernels could be quickly pro-cessed. In Section 8 we attempted to defineim-portantkernel code based on lines hit by somepopular benchmarks. This approach could alsobe pursued as a technique for categorizing ker-nel code.

Finally, a nice feature to have with the systemwould be the ability to quickly identify the cov-erage specific to a particular patch. This couldbe accomplished with a modified version ofthe diff program that is GCOV output formataware. Such a system would perform normalcode coverage on a patched piece of code, thendo a reversal of the patch against GCOV outputto see the number of times each line of code inthe patch was hit. This would allow a patch de-veloper to see how much of a patch was beingtested by the current test tools and, if necessary,develop additional tests to prove the validity ofthe patch.

10 Conclusions

This paper described the design and implemen-tation of a new toolkit for analyzing Linux ker-nel code coverage. The GCOV-kernel patchimplements GCOV functionality in the kerneland modules, and provides access to resultsdata through a /proc/gcov filesystem. LCOVprocesses the results and creates user friendlyHTML output. The test results show minimaloverhead when using the basic GCOV-kernel.The overhead of an SMP-safe extension can beavoided when precise counter data is not neces-sary. This paper also described how this toolkitis being used to measure and improve the ef-fectiveness of a test suite such as the Linux TestProject.

This system is already sufficient for use in pro-ducing data that will expose the areas of thekernel that must be targeted when writing newtests. It will then help to ensure that the newtests written for the Linux Test Project are suc-cessful in providing coverage for those sectionsof code that were not covered before. The ac-tual implementation of the methods and toolsdescribed above as a means of improving theLinux Test Project is just beginning as of thetime this paper was written. However, several

Linux Symposium 274

tests such as mmap09, as well as LVM and de-vice mapper test suites have been written basedsolely on coverage analysis. Many other testsare currently in development.

References

[Cornett] Code Coverage Analysishttp://www.bullseye.com/coverage.html (2002)

[Drepper] Urlich Drepper,ELF Handling ForThread Local Storage.http://people.redhat.com/drepper/tls.pdf (2003)

[Lcovsite] LTP Gcov Extensionhttp://ltp.sourceforge.net/coverage/lcov.php (2002)

[Marick] Brian Marick,The Craft of SoftwareTesting, Prentice Hall, 1995.

[Perry] William E. Perry,Effective Methodsfor Software Testing, Second Edition,Wiley Computer Publishing, 2000.

Legal Statement


The following terms are registered trademarks ofInternational Business Machines Corporation in theUnited States and/or other countries: PowerPC.

Intel and Pentium are trademarks of Intel Corpora-tion in the United States, other countries, or both.



Effective HPC hardware management and Failureprediction strategy using IPMI

Richard [email protected]

Abstract

Intelligent Power Management Interface(IPMI) defines common interfaces to “intel-ligent” hardware used to monitor a server’sphysical health characteristics, such as tem-perature, voltage, fans, power supplies andchassis. These capabilities provide infor-mation that enables system management,recovery, and asset tracking which help drivedown the total cost of ownership (TCO) andincrease reliability in today’s HPC market.The new interfaces in IPMI v1.5 facilitatethe management of rack-mounted HPCservers and systems in remote environmentover serial, modem and LAN connections.New capabilities combined with the remotemanagement functionality allow HPC ITmanagers to manage their servers and systems,regardless of system health, power stateor supported communication media. IPMIcompliant servers essentially eliminate theneed for external hardware to perform thesame function, thus saving costs. This paperwill introduce the specification, the benefits ofIPMI with respect to HPC and other clustersand how it could be used to generate alarms toa monitoring system before hardware failuresbecome severe enough to cause cluster failure.

1 Introduction

If we were to look at what hinders large scalecluster deployments, only a handful of barriers

come to mind. There is, of course, always thebarrier of limited bandwidth, be it in the frontside bus, the interconnect or any other formof I/O. Others include space, power, cooling,monitoring and management. This paper willfocus on monitoring, management and predic-tive analysis: essentially overall HPC healthand ‘sickness prevention.’ Intelligent PowerManagement Interface (hereafter referred toas IPMI) is an abstracted hardware layer thatprovides power control, alerting, sensor moni-toring, Field Replaceable Unit (FRU) storage,Sensor Data Record (SDR) storage, customiza-tion, configuration all through multiple meth-ods of secure access. The total cost of a hard-ware and software package has come downfrom US$30/node to sub-US$10, making man-agement a viable solution to reduce the totalcost of ownership (TCO).

The IPMI specification itself is a mere 450pages, and the other supporting documents(four of them [Reference section: C, D, F & H],plus 23 referenced documents or RFCs) makefor a good, long weekend of reading. This pa-per is a brief summary on how to use IPMI toget the most out managing large scale HPC de-ployments, and be able to use it to predict fail-ures before they happen, thereby reducing theTCO of the entire system. The assumption ismade in this paper that IPMI is implementedwith strict consideration to the specifications.

Linux Symposium 276

2 Power Control

Hardware management of large scale HPC de-ployments today is quite possible if you haveaccess to many people, infinite resources, or amagic wand. Since most of us have either noneof these (in the case of the magic wand), orlimited amounts of them (in the case of manypeople and infinite amounts of money), man-aging large scale deployments of anything isquite a chore. What do we really mean by’managing?’ Most think of managing a clus-ter as isolated systems: hardware and soft-ware. I would submit that it is a combinationof both. If hardware fails, so does software.If software fails, hardware becomes a reallyexpensive room heater. So, management be-comes the ability to control software by look-ing at hardware as one failure point, or havethe software control the hardware if thresholdsincrease beyond set limits. These control meth-ods include powering off a server, powering ona server, resetting a server, and obtaining statusof the power condition.

Resetting the server from a remote locationseems trivial to most. “Why not just execute acommand such a remote ‘init 6’ (Soft reset)?”or something similar depending on the operat-ing system some might ask. Ah, but what hap-pens if the operating system is hung? Oh, justsend one of your people down and hit the resetbutton. In the case of the TeraGrid project, thatmay not be possible since there are at least foursites where the HPC might be at. So, a systemreset function has gone from being a trivial ac-tion to a not-so-trivial action. Using IPMI, asystem reset is trivial with a mere command.

Powering on and off a system is not quite astrivial to achieve. Sure it is, you say. You goout and purchase a network power switch, andspend about US$500.00 in the process. Ok,so you can achieve power control this way,but not cost effectively in a large scale HPC

deployment. Let’s say for example, that atbest you could control eight machines with onenetwork power switch. At US$63 per out-let, 256 outlets would cost you US$16k, 512outlets would cost US$32k and 1000 outletswould cost US$63K. This adds only a little tothe overall cost, but hey, if it is offered withan IPMI compliant server, you could use thisadditional money to purchase more servers.There’s more on powering on and off, but itwill be covered in another section.

In addition to being able to reset, power onand off, having the ability to query the serverpower status is also valuable. Instead of hav-ing to force power off (if on and vice versa),potentially losing a job or data, you can queryto see if the server’s power is on or off. Thisseems rather trivial at times, but in some in-stances when there is no OS available and theserver is not in any proximity to you, it is niceto have this feature.

Lastly, one nice control feature is telling theUID light what to do, and how long to do itfor. This is the little blue light found on thefront and rear of most servers that turns on toidentify it from other servers in cases of largescale HPC build-outs.

3 Alerting

Power control is essential, but from a predic-tive failure analysis standpoint, alerting is thedriving force of determining if any one pointof the HPC is suffering. Think of the alert-ing component as pain identification. Thisidentification breaks down into event filtering,trapping (via SNMP), and policy management.Some aspects will not be mentioned fully, asthey will be covered in other sections.

Without going into too much depth, whenthe Baseboard Management Controller (BMC,a.k.a. the service processor) observes an event

Linux Symposium 277

that occurs (see sensor section for types ofevents that might be interesting), it logs it intonon-volatile space into a Server Event Log(SEL). Events generated either externally orinternally are run through a filter within theBMC that can perform actions based on poli-cies (more on policies later). These typesof events might indicate impending danger ofcatastrophic events on any one machine, or inmore serious cases, could indicate rack-wideimplications, such as a fire. The ability to fil-ter events combined with policy managementgives the HPC administrator a powerful tool inpredicting impending failure.

One of the keys to being able to predict (andlater analyze) whether there is an impendingfailure, is knowing about it. The trapping com-ponent forwards on messages via SNMP to a“health master” that could look at the event(and compare it against a set of policies), de-termine if it was critical enough to ruin jobs onthat node, and take appropriate action(s).

Policies can be set up to read either discreteor threshold-based values and define an actionbased on the type of event it is. For example,if a server in your HPC had a fan failure, themost likely thing you would see from an enduser point of view is to hear the other fans rampup. That action is an event that was driven bythe BMC through a policy that was set eitherby a factory setting, or specified by an admin-istrator programmatically. Once the other fanswere ramped up to compensate for the missingfan, alerts can be trapped and sent via any num-ber of methods to the “health master,” which inturn, can determine (possibly by its own set ofpolicies) if the event warrants moving data orjobs from one machine to a spare. Data canbe recorded and analyzed later for recurringtrends or even external environment conditionsthat might be causing failures. One example ofthis might be as follows: One rack might be sit-ting directly under a cooling source. The racks

adjoining it would read higher ambient chas-sis temperature, and show skewed temperaturedata, but with the information, a stealthy sysad-min could correlate the HVAC ‘on’ times witha drop in ambient temperature inside the rackthat gets hit with a blast of cool air. Althoughthis condition in and of itself is not bad, oneset of servers in the HPC might run faster dueto being slightly cooler and cause other bottle-necks.

4 Common Sensor Model

So, what can IPMI in its present state touch andfeel? Presently, it includes voltage, tempera-ture, fan speeds, and presence.

Servers based on IPMI give the ability to mon-itor voltages on the baseboard and power sup-ply. In cases where there is a lot of solar spotactivity, this data could be useful. For exam-ple, too many voltage changes, and memorymight start to have single bit errors, and toomany of those might generate a multi-bit errorthat is not recoverable. From an industrial andcommercial perspective, if the server sufferedthis, jobs would either be lost, or potentiallyhave silent data corruption. Having the abilityto monitor and log this data could be useful tosee if there were external environment condi-tions that would impact server stability. In thisparticular example, a sysadmin might want tospec out Telco grade hardware that runs on DCpower, thereby eliminating the external condi-tion and increasing reliability.

With the tight thermal tolerances these days,a fan failure in server chassis can result ina catastrophic loss, especially in the smallerform factors (i.e. 1U). Having the ability tomonitor fan speeds can greatly increase thechance of keeping jobs that are running on onemember of the HPC and moving them to an-other, provided the event deems it worthy of

Linux Symposium 278

doing so. This does not just include chassisfans, but can be expanded to include processor,memory, and hard drive bay fans.

Temperature is a key aspect to monitor for pre-dicting failure in an HPC. Logging events, us-ing stats to show trends, and reacting to gener-ated alerts are the ways to decrease a sysad-min’s possibilities of downtime. It will alsoincrease the ability to determine external envi-ronment conditions that would affect HPC per-formance, since heat is a critical aspect.

The last aspect of the common sensor model isthe ability to detect presence. At times it wouldbe nice to detect if the server chassis lid wasopened. The system might log an entry into theSEL based on chassis intrusion. This might beuseful information later on to determine causesof ambient temperature changes, or even to de-termine what time the processors and memorydisappeared out of the best server.

5 Access Methods

Ok, so we are able to monitor all these coolthings on our server, but how does the sysad-min or developerreally get to them? Thereare multiple methods to get the informationyou need from an IPMI-based server. They in-clude KCS, LAN, serial, modem, I2C, IPMBand ICMB. None of these methods will be dis-cussed in too much detail here due to space andscope constraints.

One of the most common methods to get to theBMC is through an interface called KeyboardController Style (KCS). For IPMI laymen, thismeans a driver that is loaded and controlledthrough the OS. This is the preferred method toaccess the Intelligent Power Management Bus(IPMB), but certainly not the only one, and cer-tainly in cases where the OS has gone south forthe winter.

If the OS does go south, another nice method isusing the Remote Management Control Proto-col (RMCP) over the LAN. IPMI-based servershave 3.3 volts standby power that provides theBMC with enough juice to live on. With thecost of network interface cards (NIC) com-ing down, and many board manufacturers plac-ing them on their server boards, this becomesthe avenue on which to send UDP packets di-rected to port 623 that can issue IPMI com-mands. This becomes a close second to theKCS in that it replaces legacy serial concentra-tion servers, allowing the sysadmin or IT man-ager to spend their US$50K (for every 40 portsand US$1.2M for 1000 servers) elsewhere onthe HPC.

Other methods of access include direct serialusing a serial concentration server. The down-side to this interface is the cost for the serialconcentration server. One nice feature worthmentioning in a Linux environment is SerialOver LAN (SOL). This allows the HPC sysad-min to enable serial redirect and have it pipedover the network interface, thereby eliminat-ing the need for a Keyboard-Video-Mouse con-troller (KVM). Costs are exorbitant and ca-bling becomes a nightmare for large scale HPCdeployments. (Other features include privatemanagement bus (private I2C bus), and abilityto add above board remote management cards.)

One nice feature of IPMI is the ability to for-ward information on to other IPMI compliantservers. Essentially, each IPMI server can be-come a proxy to relay messages to anotherIPMB through the Intelligent Chassis Manage-ment Bus (ICMB). A sysadmin or network ar-chitect could design an ICMB network (reallyan RS-845 multiport serial bus) that would al-low up to 64 nodes per network. Manage-ment packets could be routed over this networkinstead of tying up the LAN channel, or incases when the LAN was down, routing overICMB. RS-845 also allows broadcasts, so the

Linux Symposium 279

sysadmin could invoke a broadcast to a bankof servers. Imagine it, a command that said:“Hello, rack of servers, shut down.”

6 Authentication

All the methods of access are excellent to have,but how are they protected against intrusion?This section briefly discusses the channel priv-ilege levels and encryption methods used to au-thenticate to the BMC.

There are four privilege levels to the BMC asdefined by the IPMI specification: callback,user, operator, and administrative.

Callback is essentially irrelevant these days. Itcalls only a predefined number via a modem. Italso only supports enough commands to initi-ate a callback. If the user is not at that location,then callback is not highly useful.

One nice channel privilege is called user. Itwould allow subsystems to monitor HPCswithout being intrusive, and potentially lethalfrom a job perspective. Think of this priv-ilege as ’sensor snag’ only-essentially read-only or only benign commands are allowed.From a predictive analysis perspective, thislevel would give alerting mechanisms with thepeace of mind that power control could be as-signed to alternate members of the HPC.

Occasionally though, job scalability demandsdelegation of authority, but not total relinquish-ing of control. In these cases, the operator priv-ilege level should be chosen. It has all com-mands available to administration, except con-figuration commands that can change the be-havior of the out-of-band interfaces.

And then you have full rights-administrativeprivilege level. This privilege allows for fullcontrol over an IPMI-based server. In simple“health master” implementations, this would

be the preferred privilege level to use. Onequick word of caution: an administrative privi-lege level can even disable the channel they arecoming in over.

Encryption is tied closely to authentication.There are basically three main points to re-member on IPMI encryption. The first is thatpasswords can be sent clear-text, so from a se-curity standpoint, it is something to consider.The second is that whatever the user initiatesas an encryption algorithm is what will be re-turned, provided it falls into the third category.That is that there are generally two supportedencryption algorithms, MD2 and MD5.

7 Field Replaceable Units (FRU)

An enterprise-class system will typically haveFRU information for each major system board(e.g. processor board, memory board, I/Oboard, etc.). The FRU data can include in-formation such as serial number, part num-ber, model, and asset tag. What is this reallygood for in a failure prediction situation? Well,the “health master” could be programmed suchthat it could detect that a remote server’s part isgoing to have a failure (let’s just say overheat-ing CPU), look up the part in a parts database,and alert the sysadmin which part they need totake with them to service the unit, what the unitserial number is, and possibly when it was lastserviced. It would minimize downtime. By theway, FRU information can even be availablewhen the system is powered down. Anotheruseful example of FRUs being used is auto-mated remote inventory. “IPMI does not seekto replace other FRU or inventory data mech-anisms, such as those provided by SM BIOS,and PCI Vital Product Data. Rather, IPMI FRUinformation is typically used to complementthat information or to provide information ac-cess out-of-band or under ‘system down’ con-ditions.”

Linux Symposium 280

8 Sensor Data Records

IPMI was created with extensibility and scala-bility in mind. Unfortunately, with that capa-bility comes a myriad of different componentsthat can be monitored and controlled in differ-ent fashions. Sensor Data Records (SDRs) pro-vide system management software the abilityto retrieve information from the platform, andautomatically configure itself to meet the capa-bilities of the platform-essentially an abstrac-tion layer.

SDRs exist primarily to describe to server man-agement software what a sensor configura-tion should look like, and to tell software topay special attention to certain sub functions.SDRs are mostly not available to end users,but could potentially, especially if purchasing aboard and chassis separately. For the most part,SDRs are made specifically for certain config-urations of board/chassis combinations.

SDRs also define thresholds and actions basedon those thresholds. For example, there mightbe an upper critical threshold on a board ther-mal sensor, and that might be tied via an SDRto an action of ramping up fans if excessiveheat was detected. From a failure predictionstandpoint, the “health master” could detectthese changes, and ramp fans accordingly. Thisis just one small case, but you can see that thepermutations get quite large with more serversin your HPC. What kind of sensor types arethere? Two main categories exist-analog ordigital, or fan, voltage, temperature. SDRs tellthe BMC what the sensor type is in order toknow how to process it (i.e. analog fan situ-ation: you actually get a voltage back whenpolling it. The software will need to convertit via a mathematical function (a slope func-tion: logarithmic, square root, quadratic, sinand many more defined in the IPMI spec.)

Realistically, the types of information that

SDRs can store configuration on are: CPU sen-sors, chassis intrusion, power supply monitor-ing, fan speeds, fan presence, board voltages,board temperatures, bus errors, memory errors,and even possibly ASF progress codes (wherethe subsystem is in coming up-essentially aPOST code). “Sensor Data Records are kept ina single, centralized non-volatile storage areathat is managed by the BMC. This storageis called the Sensor Data Record Repository(SDR Repository). Implementing the SDRRepository via the BMC provides a mechanismthat allows SDRs to be retrieved via ‘out-of-band’ interfaces, such as the ICMB, a RemoteManagement Card, or other device connectedto the IPMB. Like most Intelligent PlatformManagement features, this allows SDR infor-mation to be obtained independent of the mainprocessors, BIOS, system management soft-ware, and the OS.”

9 System Event Log

Every IPMI compliant server has a SystemEvent Log (SEL) which is a centralized, non-volatile repository for all events generated.Think of this not only as a BMC journal. Anyauthenticated user can enter a SEL entry.

Common events that might be stored are re-boots, processors offline, memory (both singleand multiple bit) errors, sensors that go beyondset thresholds, PCI parity errors (PERR), Non-maskable interrupts (NMI), and many more.The “health master” could periodically poll se-lect system event logs from the HPC, and deter-mine if there are potential problems that couldbe coming down the wire, and mitigate a re-sponse to those problems.

Alerting is done off the SEL as well. Whenan event is written to the SEL, it is checkedagainst policies and threshold values, and per-forms actions based on those policies. Too

Linux Symposium 281

much of a health conscious sysadmin couldput in place tighter thresholds, and decreasethe possibility of downtime, while at the sametime possibly risking more time checking outfalse alarms. A truly health conscious sysad-min would determine standard thresholds towork with safely, while keeping in mind that afalse alarm here and there might keep them ontheir toes and possibly prevent a catastrophicfailure on a node (or more in some cases).

Clearing the SEL is possible as well, in fact,once a SEL is queried and data is stored on the“health master,” it is advisable to clear the SEL.The main reason for this is that if the SEL fillsup, new entries are dropped. This could cause asysadmin to miss critical events that could leadeither to extensive downtime, or a catastrophe.

10 Configuration

IPMI gives such flexibility that many aspectscan be configured to fit the end-user’s needs.

These parameters include: BMC policies,LAN configuration (such as static or dynamicIP address, mask, and gateway), privilege lev-els, alerting functions (such as whom to for-ward alerts to), sensor polling rates, etc.

Imagine for a moment that your “health mas-ter” detects that it one of the nodes in the HPCis going down due to some failure, and you areable to retrieve the critical data off the harddrive. When you bring up a spare node in itsplace, you can dump the hard drive data backdown, right? Sure, but what about BMC con-figuration? This is where IPMI shines throughif implemented properly. With most IPMIcompliant servers today, a sysadmin can re-trieve the BMC information. But stuffing itback onto another machine is a little trickier,especially if it is over the LAN interface.

11 Questions Users Might Ask

A few questions might remain in the reader’smind still after this brief whetting. One suchquestion might be, “What is the relationshipor difference between IPMI and Alert StandardForum (ASF)?” “While somewhat of an over-simplification, ASF may be considered to bescoped for ‘desktop/mobile’ class systems, andIPMI for ‘servers’ where the additional IPMIcapabilities such as event logging, multipleusers, remote authentication, multiple trans-ports, management extension busses, sensoraccess, etc., are valued. However there are norestrictions in either specification as to the classof system that the specification can be used.[(i.e.)] you can use IPMI for desktop and mo-bile systems and ASF for servers if the level ofmanageability fits your requirements.”

Another might be, I don’t have IPMI capableservers today, can I add in an IPMI card? Theshort answer to that question is possibly, but itdepends on your server base board. Provided ithas an interface to IPMB, the possibilities in-crease. But, the best thing to do is to weigh thecosts associated with not having IPMI (as men-tioned earlier-no KVM, network power switch,serial concentration server) and see what bene-fits it can bring to predict an impending failurebefore it happens.

12 Summary

The key to using IPMI for failure predictionanalysis is the polling and listening software,and how intelligent it is at analyzing the datato make predictions as to where problems lie.

As most HPC sysadmins know, one largescale cluster deployment barrier is manage-ment, monitoring, fault isolation and failureprediction. IPMI enables a sysadmin a greatway to reduce the TCO of HPCs by monitor-

Linux Symposium 282

ing overall HPC health and ‘prevent sickness’by providing an abstracted hardware layer thatprovides power control, alerting, sensor mon-itoring, Field Replaceable Unit (FRU) stor-age, Sensory Data Record (SDR) storage, cus-tomization, configuration all through multiplemethods of secure access-all through software.

13 Call to Action

When writing RFQs, make sure to includeIPMI as a required feature in order to reduceTCO and increase manageability in HPC de-ployments.

14 References

A. Alert Standard Format v1.0 Specification,©2001, Distributed Management TaskForce.http://www.dmtf.org

B. The I2C Bus And How To Use It, ©1995,Philips Semiconductors. This documentprovides the timing and electrical specifi-cations for I2C busses.

C. Intelligent Chassis Management BusBridge Specification v1.0, rev. 1.2,©2000 Intel Corporation. Providesthe electrical, transport protocol,and specific command specificationsfor the ICMB and information onthe creation of management con-trollers that connect to the ICMB.http://developer.intel.com

/design/servers/ipmi

D. Intelligent Platform Management BusCommunications Protocol Specifica-tion v1.0, ©1998 Intel Corporation,Hewlett-Packard Company, NEC Corpo-ration, and Dell Computer Corporation.This document provides the electri-cal, transport protocol, and specific

command specifications for the IPMB.http://developer.intel.com


E. Intelligent Power Management In-terface Specification v1.5; rev. 1.1,©2002 Intel Corporation, Hewlett-Packard Company, NEC Corpora-tion, and Dell Computer Corporation.http://developer.intel.com


F. IPMI Platform Event Trap Format Spec-ification v1.0, ©1998, Intel Corporation,Hewlett-Packard Company, NEC Corpo-ration, and Dell Computer Corporation.This document specifies a common for-mat for SNMP Traps for platform events.

G. Proposal for Callback Control Protocol(CBCP), draft-ietf-pppext-callback-cp-02.txt, N. Gidwani, Microsoft,July 19, 1994. As of this writ-ing, the specification is available viathe Microsoft Corporation web site:http://www.microsoft.com

H. Platform Management FRU Informa-tion Storage Definition v1.0, ©1999Intel Corporation, Hewlett-PackardCompany, NEC Corporation, and DellComputer Corporation. Provides thefield definitions and format of FieldReplaceable Unit (FRU) information.http://developer.intel.com


I. RFC 1319, The MD2 Message-Digest Al-gorithm, B. Kaliski, RSA Laboratories,April 1992.

J. RFC 1321, The MD5 Message-Digest Al-gorithm, R. Rivest, MIT Laboratory forComputer Science and RSA Data Secu-rity, Inc. April, 1992.

Linux Symposium 283

K. System Management BIOS Specification,Version 2.3.1, ©1997, 1999 AmericanMegatrends Inc., Award Software Inter-national, Compaq Computer Corporation,Dell Computer Corporation, Hewlett-Packard Company, Intel Corporation, In-ternational Business Machines Corpora-tion, Phoenix Technologies Limited, andSystemSoft Corporation.

L. System Management Bus (SMBus) Spec-ification, Version 2.0, ©2000, Dura-cell Inc., Fujitsu Personal Systems Inc.,Intel Corporation, Linear TechnologyCorporation, Maxim Integrated Prod-ucts, Mitsubishi Electric Corporation,Moltech Power Systems, PowerSmartInc., Toshiba Battery Co., Ltd., UnitrodeCorporation, USAR Systems.

M. The TeraGrid Project, National Center forSupercomputing Applications, San DiegoSupercomputer Center, Argonne NationalLaboratory and Pittsburgh Supercomput-ing Center; http://www.teragrid.org

N. Wired for Management Baseline Version2.0 Release, ©1998, Intel Corporation.Attachment A, UUIDs and GUIDs, pro-vides information specifying the format-ting of the IPMI Device GUID and FRUGUID and the System Management BIOS(SM BIOS) UUID unique IDs.

15 Glossary of Terms

BMC Baseboard Management Controller

CMOS In terms of this specification, this describes thePC-AT compatible region of battery-backed 128bytes of memory, which normally resides on thebaseboard.

Diagnostic Interrupt A non-maskable interrupt or sig-nal for generating diagnostic traces and ‘coredumps’ from the operating system. Typically NMIon IA-32 systems, and an INIT on Itanium®-basedsystems.

EFI Extensible Firmware Interface. A new model forthe interface between operating systems and plat-form firmware. The interface consists of datatables that contain platform-related information,plus boot and runtime service calls that are avail-able to the operating system and its loader. To-gether, these provide a standard environment forbooting an operating system and running pre-bootapplications.

FRB Fault Resilient Booting. A term used to describesystem features and algorithms that improve thelikelihood of the detection of, and recovery from,processor failures in a multiprocessor system.

FRU Field Replaceable Unit. A module or componentwhich will typically be replaced in its entirety aspart of a field service repair operation.

Hard Reset A reset event in the system that initializesall components and invalidates caches.

HPC High Performance Computing, commonly com-putationally intense.

I2C Inter-Integrated Circuit bus. A multi-master, 2-wire, serial bus used as the basis for the IntelligentPlatform Management Bus.

ICMB Intelligent Chassis Management Bus. A se-rial, differential bus designed for IPMI messag-ing between host and peripheral chassis. Refer to[ICMB] for more information.

I/O Input / Output. Typically refers to I/O subsystemssuch as PCI (and variants), memory, and CPUbuses.

IPM Intelligent Platform Management.

IPMB Intelligent Platform Management Bus. Namefor the architecture, protocol, and implementationof a special bus that interconnects the baseboardand chassis electronics and provides a communi-cations media for system platform management in-formation. The bus is built on I2C and provides acommunications path between ’management con-trollers’ such as the BMC, FPC, HSC, PBC, andPSC.

NMI Non-maskable Interrupt. The highest priority in-terrupt in the system, after SMI. This interrupt hastraditionally been used to notify the operating sys-tem fatal system hardware error conditions, suchas parity errors and unrecoverable bus errors. It isalso used as a Diagnostic Interrupt for generatingdiagnostic traces and ‘core dumps’ from the oper-ating system.

Linux Symposium 284

MD2 RSA Data Security, Inc. MD2 Message-DigestAlgorithm. An algorithm for forming a 128-bitdigital signature for a set of input data.

MD5 RSA Data Security, Inc. MD5 Message-DigestAlgorithm. An algorithm for forming a 128-bitdigital signature for a set of input data. Improvedover earlier algorithms such as MD2.

PEF Platform Event Filtering. The name of the col-lection of IPMI interfaces in the IPMI v1.5 spec-ification that define how an IPMI Event can becompared against ’filter table’ entries that, whenmatched, trigger a selectable action such as a sys-tem reset, power off, alert, etc.

PERR Parity Error. A signal on the PCI bus that indi-cates a parity error on the bus.

PET Platform Event Trap. A specific format of SNMPTrap used for system management alerting. Usedfor IPMI Alerting as well as alerts using the ASFspecification. The trap format is defined in thePET specification. See [PET] and [ASF] for moreinformation.

POST Power On Self Test.

RFQ Request for Quote.

SDR Sensor Data Record. A data record that providesplatform management sensor type, locations, eventgeneration, and access information.

SEL System Event Log. A non-volatile storage areaand associated interfaces for storing system plat-form event information for later retrieval.

SERR System Error. A signal on the PCI bus that indi-cates a ’fatal’ error on the bus.

Soft Reset A reset event in the system which forcesCPUs to execute from the boot address, but doesnot change the state of any caches or peripheraldevices.

TCO Total Cost of Ownership. All of the possiblecosts involved in the purchase, installation, man-agement, support and use of the IT infrastructurewithin an organization throughout a product’s lifecycle, from acquisition to disposal.

Interactive Kernel PerformanceKernel Performance in Desktop and Real-time Applications

Robert LoveMontaVista Software

[email protected], http://tech9.net/rml

Abstract

The 2.5 development kernel introduced multi-ple changes intent on improving the interactiveperformance of Linux. Unfortunately, the term“interactive performance” is rather vague andlacks proper metrics with which to measure.Instead, we can focus on five key elements:

• fairness

• scheduling latency

• interrupt latency

• process scheduler decisions

• I/O scheduler decisions

In short, these attributes help constitute the feelof Linux on the desktop and the performanceof Linux in real-time applications. As Linuxrapidly gains market share both on the desk-top and in embedded solutions, quick systemresponse is growing in importance. This paperwill discuss these attributes and their effect oninteractive performance.

Then, this paper will look at the responses tothese issues introduced in the 2.5 developmentkernel:

• O(1) scheduler

• Anticipatory/Deadline I/O Scheduler

• Preemptive Kernel

• Improved Core Kernel Algorithms

Along the way, we will look at the current in-teractive performance of 2.5.

1 Introduction

Interactive performance is important to a wideselection of computing classes: desktop, multi-media, gamer, embedded, and real-time. Theseapplication types benefit from quick system re-sponse and deterministic or bounded behav-ior. They are generally characterized as havingexplicit timing constraints and are often I/O-bound. The range of applications representedin these classes, however, varies greatly—wordprocessors, video players, Quake, cell phoneinterfaces, and data acquisition systems areall very different applications. But they alldemand specific response times (although tovarying degrees) to various stimuli (whether itsthe user at the console or data from a device)and all of these applications find good interac-tive performance important.

But whatis interactive performance? How canwe say whether a kernel has good interactiveperformance or not? For real-time applica-tions, it is easy: “do we meet our timing con-straints or not?” For multimedia applications,the task is harder but still possible: for ex-ample, “does our audio or video skip?” For

Linux Symposium 286

desktop users, the job is even harder. Howdoes one express the interactive performanceof their text editor or mailer? Worse, how doesone quantify such performance? All too of-ten, interactive performance is judged by theseat of one’s pants. While actually perceiv-ing good interactive performance is an impor-tant part of their actually existing good inter-active performance, a qualitative experience isnot regimented enough for extensive study andrisks suffering from the placebo effect.

For the purpose of this paper, we break downthe vague term “interactive performance” anddefine five attributes of a kernel which benefitthe aforementioned types of applications: fair-ness, scheduling latency, interrupt latency, pro-cess scheduler decisions, and I/O scheduler de-cisions.

We then look at four major additions to the 2.5kernel which improve these qualities: the O(1)scheduler, the deadline and anticipatory I/Oschedulers, kernel preemption, and improvedkernel algorithms.

2 Interactive Performance

2.1 Fairness

Fairness describes the ability of tasks to allmake not only forward progress but to doso relatively evenly. If a given task fails tomake any forward progress, we say the taskis starved. Starvation is the worst example ofa lack of fairness, but any situation in whichsome tasks make a relatively greater percent-age of progress than other tasks lacks fairness.

Fairness is often a hard attribute to justifymaintaining because it is often a tradeoff be-tween overall global performance and local-ized performance. For example, in an effortto provide maximum disk throughput, the 2.4block I/O scheduler may starve older requests

in order to continue processing newer requestsat the current disk head position. This mini-mizes seeks and thus provides maximum over-all disk throughput—at the expense of fairnessto all requests.

Since the starved task may be interactive orotherwise timing-sensitive, ensuring fairness toall tasks (or at least all tasks of a given im-portance) is a very important quality of goodinteractive performance. Improving fairnessthroughout the kernel is one of the biggestchanges made during the 2.5 development ker-nel.

2.2 Scheduling Latency

Scheduling latency is the delay between a taskwaking up (becoming runnable) and actuallyrunning. Assuming the task is of a suffi-ciently high priority, this delay should be quitesmall: an interrupt (or other event) occurswhich wakes the task up, the scheduler is in-voked to select a new task and selects the newlywoken up task, and the task is executed. Poorscheduling latency leads to unmet timing re-quirements in real-time applications, percepti-ble lag in application response in desktop ap-plications, and dropped frames or skipped au-dio in multimedia applications.

Both maximum and average scheduling la-tency is important, and both need to be min-imized for superior interactive performance.Nearly all applications benefit from minimalaverage scheduling latency, and Linux pro-vides exceptionally good average-case perfor-mance. Worst-case performance is a differ-ent issue: it is an annoyance to desktop userswhen, for example, heavy disk I/O or odd VMoperations cause their text editor to go cata-tonic. If the event is relatively rare enough,however, they may overlook it. Both real-timeand multimedia applications, however, requirea specific bound on worst-case scheduling la-

Linux Symposium 287

tencies to ensure functionality.

The preemptive kernel and improved algo-rithms (reducing lock hold time or reducing thealgorithmic upper bound) result in a reductionin both average and worst-case scheduling la-tency in 2.5.

2.3 Interrupt Latency

Interrupt latency is the delay between a hard-ware device generating an interrupt and an in-terrupt handler running and processing the in-terrupt. High interrupt latency leads to poorsystem response as the actions of hardware arenot readily perceived by the kernel.

Interrupt latency in Linux is basically a func-tion of interrupt off time—the time in whichthe local interrupt system is disabled. This onlyoccurs inside the kernel and only for short pe-riods of time reflecting critical regions whichmust execute without risk of an interrupt han-dler running. Linux has always had compar-atively small interrupt latencies—on modernmachines, less than 100 microseconds.

Consequently, reducing interrupt latency wasnot a primary goal of 2.5, although it undoubt-edly occurred as lock hold times were reducedand the global kernel lock (cli() ) was finallyremoved.

2.4 Process Scheduler Decisions

The behavior and decisions made by the pro-cess scheduler (the subsystem of the kernelthat divides the resource of CPU time amongrunnable processes) are important to maintain-ing good interactive performance. This shouldgo without saying: poor decisions can leadto starvation and poor algorithms can lead toscheduling latency. The process scheduler alsoenforces static priorities and can issue dynamicpriorities based on a rule-set or heuristic.

The process scheduler in 2.5 provides deter-ministic scheduling latency via anO(1) algo-rithm and a new interactivity estimator whichissues priority bonuses for interactive tasks andpriority penalties for CPU-hogging tasks.

2.5 I/O Scheduler Decisions

I/O scheduler (the subsystem of the kernel thatdivides the resource of disk I/O among blockI/O requests) decisions strongly affect fairness.The primary goal of any I/O scheduler is tominimize seeks; this is done by merging andsorting requests. This maximizing of globalthroughput can directly lead to localized fair-ness issues. Request starvation, particularly ofread requests, can lead to long application de-lays.

Two new I/O schedulers available for 2.5, thedeadline and the anticipatory I/O schedulers,prevent request starvation by attempting to dis-patch I/O requests before a configurable dead-line has elapsed.

3 Process Scheduler

3.1 Introduction

The process scheduler plays an important rolein interactive performance. The Linux sched-uler offers three different scheduling policies,one for normal tasks and two for real-timetasks. For normal tasks, each task is assigneda priority by the user (the nice value). Eachtask is assigned a chunk of the processor’s time(a timeslice). Tasks with a higher priority runprior to tasks with a lower priority; tasks at thesame priority are round-robined amongst them-selves. In this manner, the scheduler preferstasks with a higher priority but ensures fairnessto those tasks at the same priority.

The kernel supports two types of real-time

Linux Symposium 288

tasks, first-in first-out (FIFO) real-time tasksand round-robin (RR) real-time tasks. Bothare assigned a static priority. FIFO tasks rununtil they voluntarily relinquish the processor.Tasks at a higher priority run prior to tasks ata lower priority; tasks at the same priority areround-robined amongst themselves. RR tasksare assigned a timeslice and run until they ex-haust their timeslice. Once all RR tasks of agiven priority level exhaust their timeslice, thetimeslices are refilled and they continue run-ning. RR tasks at a higher priority run beforetasks at a lower priority. Since real-time taskscan be scheduled unfairly, they are expected tohave a sane design which properly utilizes thesystem.

All scheduling in Linux is done preemptively,except FIFO tasks which run until completi-tion. New in 2.5, preemption of tasks can nowoccur inside the kernel.

For 2.5, the process scheduler was rewritten.The new scheduler, dubbed the O(1) sched-uler, features constant-time algorithms, per-processor runqueues, and a new interactivityestimator. The Linux scheduling policy, how-ever, is unchanged.

3.2 Interactivity Estimator

The 2.5 scheduler includes an interactivity es-timator [mingo1] which dynamically scales atask’s static priority (nice value) based on itsinteractivity. Interactive tasks receive a prior-ity bonus while tasks which excessively hogthe CPU receive a penalty. Tasks in some theo-retical neutral position (neither interactive norhoggish) receive neither a bonus nor a penalty.By default, up to five priority values are addedor removed to reflect the degree of the bonusor the penalty; note this corresponds to 25% ofthe full -20 to 19 nice value range.

Interactivity is estimated using a running sleep

average. The idea is that interactive tasks areI/O bound. They spend much of their timewaiting for user interaction or some other eventto occur. Tasks which spend much of their timesleeping are thus interactive; tasks which spendmuch of their time running (continually ex-hausting their timeslice) are CPU hogs. Theserules are surprisingly simple; indeed, they areessentially the definitions of I/O-bound andCPU-bound. The fact the heuristic is basicallyfollowing the definition lends credibility to theestimator.

The heuristic determines the actual bonus orpenalty based on the ratio of the task’s actualsleep average against a constant “maximum”sleep average. The closer the task is to themaximum, the more of the five bonus prioritylevels it can receive. Conversely, the closer thetask is to the negation of the maximum sleepaverage, the larger its penalty is.

The results of the interactivity estimator are ap-parent:

USER NI PRI %CPU STAT COMMANDrml 0 15 0.0 S vimrml 0 18 0.4 S bashrml 0 25 91.7 R infloop

First, note the kernel priority values (thePRIcolumn) correspond to a mapping of the defaultnice value of zero to the value 20. The low-est priority nice value, 19, is priority 39. Thehighest priority nice value, -20, is zero. Thus,lower priority values are higher in priority. Atext editor, vim, has received the full negativefive priority bonus. Since it initially had a nicevalue of zero, it now has a priority of 15. Con-versely, a program executing an infinite loopreceived the full positive five priority penalty;it now has a priority of 25. Bash, which is ba-sically interactive but performs computation inscripts, has received a smaller bonus and nowhas a priority of 18.

Higher priority (that is, lower priority val-

Linux Symposium 289

ued) tasks are scheduled prior to lower prior-ity (higher priority valued) tasks. They also re-ceive a larger timeslice. This implies that in-teractive tasks are usually runnable; they arescheduled first and generally have plenty oftimeslice with which to run. This ensures thatthe text editor is capable of responding to akeypress instantly, even if the system is underload.

The interactivity estimator does not apply toreal-time tasks, which occupy a fixed priorityin a higher priority range than any normal task.The estimator benefits interactive desktop pro-grams, such as a text editor or mailer.

3.3 Reinsertion of Interactive Tasks

All process schedulers implement a mech-anism of recalculating and refilling processtimeslices. In the most rudimentary of sched-ulers, this work occurs when all processes haveexhausted their timeslice and then the times-lice and priority of each process is recalculatedand reassigned. Scheduling then continues asbefore, until again all processes exhaust theirtimeslice and this work repeats.

The O(1) scheduler implements anO(1) algo-rithm for timeslice recalculation and refilling.Instead of performing a largeO(n) recalcu-lation when all processes exhaust their times-lice, the O(1) scheduler implements two arraysof tasks, the active array and the expired array.When a task exhausts its timeslice, it is movedto the expired array and its timeslice is refilled.When the active array is empty, the two ar-rays are switched (via a simple pointer swap)and the scheduler begins executing tasks out ofthe new active array. This algorithm guaran-tees a deterministic and constant-time solutionto timeslice recalculation.

Another benefit of this approach is it providesa simple prevention to processor starvation of

interactive tasks. In the 2.4 scheduler, when atask exhausts its timeslice it does not have achance to run again until the remaining tasksalso exhaust their timeslice and timeslices areglobally recalculated. This allows starvation ofthe task, which might lead to perceptable de-lays. To prevent this, the 2.5 scheduler willreinsert interactive tasks into the active arraywhen they expire their timeslice.How interac-tive a task need be in order to be reinserted intothe active array and not expired depends on thetask’s priority. To prevent indefinite starvationof non-interactive tasks in the expired array, in-teractive tasks are only reinserted into the ac-tive array so long as the tasks on the expiredarray have run recently.

3.4 Finegrained timeslice distribution

A final behavioral change in the O(1) sched-uler is a more finegrained timeslice distributionand calculation [mingo2]. Currently, this be-havior is only present in the 2.5-mm tree butwill likely be merged into the mainline 2.5 treesoon.

Normally, tasks are round-robined with othertasks of the same priority (tasks with a higherpriority are run earlier and tasks with a lowerpriority will run later). When they exhaust theirtimeslice, their priority is recalculated (takinginto effect the interactivity estimator) and theyare either placed on the expired list or insertedinto the back of the queue for their prioritylevel.

This leads to two problems. First, the schedulermay give an interactive task a large timeslice inorder to ensure it is always runnable. This isgood, but it also results in a long timeslice thatmay prevent other tasks from running. Second,since priority is recalculated only when a taskexhausts its timeslice, a task with a large times-lice may go some time without a priority re-calculation. The task’s behavior may change

Linux Symposium 290

in this time, reversing whether or not the taskis deemed interactive. Recognizing this, thescheduler was modified to split timeslices intosmall pieces—by default, 20ms chunks. Tasksdo not receive any less timeslice, instead a taskof equal priority may preempt the running taskevery 20ms. The task is then requeued to theend of the list for its priority and it continuesto run round robin with other tasks at its prior-ity level. In addition to this finer distributionof timeslices, the task’s priority is recalculatedevery 20ms as well.

3.5 An O(1) Algorithm

An important property of real-time systems isdeterministic behavior. Time-sensitive applica-tions demand consistent behavior that they canunderstand a priori. Critical algorithms, there-fore, need to operate in constant time or at leastwithin predefined bounds.

It is important that scheduling behavior (espe-cially process selection and process wake up)operate deterministically, as time-sensitive ap-plications demand minimal latency from wakeup to process selection to actual execution. If ascheduling algorithm is dependent on the totalnumber of processes (or runnable processes)even a high priority task cannot make an as-sumption about scheduling latency. Worse,with a sufficiently large number of processes,the time required to wake up and schedule atask may be far larger than acceptable (e.g.,your mp3 may skip or the nuclear power plantmay meltdown).

By introducing O(1) —constant time—algorithms for all scheduler functions, theO(1) scheduler offers not only deterministicbut constant scheduler performance. Thescheduler can wake up a task, select it to run,and execute it in the same amount of timeregardless of whether there are five or fivehundred thousand processes on the system.

More so, since the O(1) scheduler has excep-tionally quick O(1) algorithms, schedulinglatency may be reduced for a givenn overprevious scheduling algorithms. Thus, the2.5 scheduler offers deterministic, constant,and (perhaps) reduced scheduling latency overprevious Linux kernel schedulers.

4 I/O Scheduler

4.1 Introduction

The primary job of any I/O scheduler (some-times called an elevator) is to merge adjacentrequests together (to minimize requests) andto sort incoming requests seek-wise along thedisk (to minimize disk head movement). Re-ducing the number of requests and minimizingdisk head movement is critical for overall diskthroughput. Disk seeks (moving the disk headfrom one sector to another) are very slow. If thenumber and distance of seeks are minimized byreordering requests, disk transfer rates are keptcloser to their theoretical maximum.

In the interest of global throughput, however,I/O scheduler decisions can introduce localfairness problems. Sorting requests can leadto the starvation of requests that are not nearother requests on the disk. If a heavy writeoutis underway, the incoming write requests areinserted near each other in the request queueand dealt with quickly, minimizing seeks. Arequest to a far-off sector may not receive at-tention for some time. This request starvationis detrimental to system response as it is unfair.Request starvation is a shortcoming in the 2.4I/O scheduler.

The general issue of request starvation leadsto a more specific case of starvation, writes-starving-reads. Write operations can usuallyoccur whenever the I/O scheduler wishes tocommit them, asynchronous with respect to the

Linux Symposium 291

submitting application or filesystem. Read op-erations, however, almost always involve a pro-cess waiting for the read to complete—that is,read requests are usually synchronous with re-spect to the submitting application or filesys-tem. Because system response is largely un-affected by write latency (the time required tocommit a write) but is strongly affected by readlatency (the time required to commit a read),prioritizing reads over writes will prevent writerequests from starving read requests and in-crease the responsiveness of the system.

Unfortunately, minimizing seeks and prevent-ing unfairness from request starvation arelargely conflicting goals. With a proper solu-tion, however, the fairness issues are resolvablewithout a large drop in global disk throughput.

4.2 Request Starvation

To prevent starvation of requests, a new I/Oscheduler, the deadline I/O scheduler, was in-troduced [axboe1, axboe2]. The deadline I/Oscheduler works by assigning tasks an expira-tion time and trying to ensure (although notguaranteeing) that requests are dispatched be-fore they expire.

The 2.4 I/O scheduler [arcangeli] implementsa single queue, which is sorted ascendingly bysector. Requests are either merged with ad-jacent requests or sorted into the proper loca-tion in the queue; requests are appended to thetail if they have no proper insertion point. TheI/O scheduler then dispatches requests as theblock devices request them from the head ofthe queue.

The deadline I/O scheduler augments thissorted queue with two more queues, a first-infirst-out (FIFO) queue of read requests and aFIFO queue of write requests. Each requestin the FIFO queues is assigned an expirationtime. By default, this is 500 milliseconds for

read requests and 5 seconds for write requests.When a request is submitted to the deadline I/Oscheduler, it is added to both the sorted queueand the appropriate FIFO queue. In the caseof the sorted queue, the request is merged orotherwise inserted sector-wise where it fits. Inthe case of the FIFO queues, the request is as-signed an expiration value and placed at the tailof the queue.

Normally, the deadline I/O scheduler servicesrequests from the sorted queue, to minimizeseeks. If a request expires at the head of eitherFIFO queue (the requests at the head are theoldest), however, the scheduler stops dispatch-ing items from the sorted queue and begins dis-patching from the FIFO queues. This behaviorensures that, in general, seeks are minimizedand thus global throughput is maximized. Fair-ness is maintained, however, as the I/O sched-uler attempts to dispatch requests within thespecified expiration time. The deadline I/Oscheduler provides an upper bound on requestlatency—ensuring fairness—at the expense ofa small degradation in overall throughput.

4.3 Writes-Starving-Reads

Usually, read operations are synchronous whilewrites operations are asynchronous. Basi-cally, when an application issues a read re-quest, it cannot continue until the operationcompletes and the application is given the re-quested value. The completion of write opera-tions, on the other hand, usually has no bearingon the progress of the application. Aside fromworrying about power failures, an applicationis unconcerned as to whether a write commitsto disk in one second or five minutes. In fact,most applications are probably unaware if thedata is ever committed! Conversely, an appli-cation usually needs the results of a read oper-ation and will block until the data is returned.Worse, read requests are often issued en masseand each read is dependent on the previous.

Linux Symposium 292

The application or filesystem will not submitread requestN until read requestN-1 com-pletes.

In the 2.4 I/O scheduler, read and write re-quests are treated equal. The 2.4 I/O sched-uler tries to minimize seeks by sorting requestson insert. If a request is issued that is between(seek-wise) two other requests in the queue, itis inserted there. If there is no suitable placeto insert the request (perhaps because no otheroperations are occurring to the same area of thedisk), the request is appended to the end of thequeue. Consequently, something like

cat * > /dev/null

where there is even only a moderate number offiles in the current directory results in hundredsof dependent read requests. If a heavy write isunderway, each individual read request will beinserted at the tail of the queue. Assuming thequeue can hold a maximum of about one sec-ond’s worth of requests, each individual readrequest takes a second to reach the head of thequeue. That is, the heavy write operation con-tinually keeps the queue full with write opera-tions to some part of the disk. When the readrequest is submitted, there is no suitable inser-tion point so it is appended to the tail of thequeue. After a second, the read is finally at thehead of the queue, and it is dispatched. This re-peats for each and every individual read. Sinceeach read is dependent on the next, the requestsare issued in serial. Thus the previouscattakes hundreds of seconds to complete in 2.4when the system is also under write pressure.

Recognizing that the asynchrony and in-terdependency of read operations highlightstheir much stronger latency requirementsover writes, various patches were introduced[akpm1] to solve the problem. Acknowl-edging that appending reads to the tail ofthe queue is detrimental to performance,

these modifications insert reads (failing aproper insertion elsewhere) near the headof the queue. This drastically improvesapplication performance—more than ten-foldimprovements—as it prevents writes fromstarving reads.

The deadline I/O scheduler, the current de-fault I/O scheduler in 2.5, addresses this is-sue as well. The deadline I/O scheduler pro-vides a separate (generally much smaller) ex-piration timeout for read requests. Conse-quently, the I/O scheduler tries to submit readsrequests within a rather short period, ignor-ing write requests that may be adjacent to thedisk head’s current location or that have beenwaiting longer. This prevents the starvation ofreads.

Unfortunately, not all is well. While the dead-line I/O scheduler solves the read latency prob-lem, the increased attention to read requests re-sults in a seek storm. For each submitted readrequest, any pending writes are delayed, thedisk seeks to the location of the reads and per-forms the operation, and then it seeks back andcontinues with the writes. This results in twoseeks for each read request (or group of adja-cent read requests) that are issued during writeoperations.

Compounding the problem, reads are issuedin groups of dependent requests, as discussed.Not long after seeking back and continue thewrites, another read request comes in and thewhole mess is repeated.

The goal of a research interest in I/O sched-ulers, anticipatory I/O scheduling [iyer], is toprevent this seek storm. When an applicationsubmits a read request, it is handled within theusual expiration period, as usual. After the re-quest is submitted, however, the I/O schedulerdoes not immediately return to handling anypending write requests. Instead, it does nothingat all for a few milliseconds (the actual value

Linux Symposium 293

is configurable; it defaults to 6ms). In thosefew milliseconds, there is a good chance theapplication will submit another read request. Ifany read request is issued to adjacent areas ofthe disk, the I/O scheduler immediately han-dles them. In this case, the I/O scheduler pre-vented another pair of seeks. It is important tonote that the few milliseconds spent waiting iswell worth the prevention of the seeks—this isthe point of anticipatory I/O scheduling. If arequest is not issued in time, however, the I/Oscheduler times out and returns to processingany write requests. In that case, the anticipa-tory I/O scheduler loses and we lost a few mil-liseconds.

The key is properly anticipating the actionsof applications and the filesystem. If the I/Oscheduler can predict the actions of an applica-tion a sufficiently large enough percentage ofthe time, it can successfully limit seeks (whichare terrible to disk performance) and still pro-vide low read latency and high write through-put. A version of the deadline I/O scheduler,the anticipatory scheduler [piggin], is avail-able in 2.5-mm which supports anticipatory I/Oscheduling. The anticipatory I/O scheduler im-plements per-process statistics to raise the per-centage of correct anticipations.

The results are very satisfactory. Under astreaming write, such as

while true; dodd if=/dev/zero of=file bs=1M

done

a simple read of a 200MB file completes in 45seconds on 2.4.20, 40 seconds on 2.5.68-mm2with the deadline I/O scheduler, and 4.6 sec-onds on 2.5 with the anticipatory I/O scheduler.In 2.4, the streaming write results in terriblestarvation for the read requests. The anticipa-tory I/O scheduler results in nearly a ten-foldimprovement in read throughput.

In 2.4, the effect of a streaming read upon aseries of many small individual reads is alsodevastating. Perform a streaming read via:

while truedo

cat big-file > /dev/nulldone

and measure how long a read of every file inthe current kernel tree takes:

find . -type f -exec \cat ’{}’ ’;’ > /dev/null

2.4.20 required 30 minutes and 28 seconds,2.5.68-mm2 with the deadline I/O schedulerrequired 3 minutes and 30 seconds, and 2.5.68-mm2 with the anticipatory I/O scheduler re-quired a mere 15 seconds. That is a 121-timesimprovement from 2.4 to 2.5.68-mm2 with theanticipatory I/O scheduler.

How much damage does this benefit to read la-tency do to global throughput, though? It isclear that read throughput is improved, but atwhat cost to write requests and global through-put? Consider the inverse, under a streamingread such as:

while truedo

cat file > /dev/nulldone

A simple write and sync of a 200MB filetakes 7.5 seconds on 2.4.20, 8.9 seconds on2.5.68-mm2 with the deadline I/O scheduler,and 13.1 seconds on 2.5.68-mm2 with the an-ticipatory I/O scheduler. The 2.5 I/O sched-ulers are slower, but not overly so (certainly notto the degree read latency is decreased). Thistest does not show global throughput, though,

Linux Symposium 294

just write throughput in the presence of heavyreads. Since the streaming read above may op-erate much quicker, global throughput is oftenlargely unchanged.

The anticipatory I/O scheduler is currently inthe 2.5-mm tree. It is expected that it will bemerged into the mainline 2.5 tree before 2.6.

5 Preemptive Kernel

5.1 Introduction

The addition of kernel preemption in 2.5 pro-vides significantly lowered average schedulinglatency and a modest reduction in worst-casescheduling latency. More importantly, intro-ducing a preemptive kernel installs the initialframework for further lowering scheduling la-tency by allowing developers to tackle spe-cific locks as the root of scheduling latency asopposed to entire kernel call chains. Conve-niently, reducing lock hold time is also a goalfor large SMP machines

5.2 Design of a Preemptive Kernel

Evolving an existing non-preemptive kernelinto a preemptive kernel is nontrivial; the taskis greatly simplified, however, if the kernelis already safe against reentrancy and concur-recny. Therefore, in the case of the Linux ker-nel, the safety provided by existing SMP lock-ing primitives were leveraged to provide a sim-ilar protection from kernel preemptions. SMPspin locks were modified to disable kernel pre-emption in a nested fashion; aftern spin locks,kernel preemption is not again enabled until then-th unlock.

Theret_from_intr path (the architecture-dependent assembly which returns controlfrom the interrupt handler to the interruptedcode) was then modified to allow preemption

even if returning to kernel mode. Thus, a taskwoken up in an interrupt handler (a commonoccurrence) can then run at the earliest possi-ble moment, as soon as the interrupt handlerreturns. Consequently, a high priority task willpreempt a lower priority task, even if the lowerpriority task is executing inside the kernel.

The preemption does not occur on return frominterrupt, of course, if the interrupted taskholds a lock. In that case, the pending pre-emption will occur as soon as all locks arereleased—again at the earliest possible mo-ment.

5.3 Improved Core Kernel Algorithms

Changes to core kernel algorithms (primarilyin the VM and VFS primarily) were made toimprove fairness, provide a better bound ontime complexity (and thus a bound on schedul-ing latency), and reduce lock hold time to takeadvantage of kernel preemption and reduce la-tency.

Some of the most important changes were tofix fairness issues in the VM, in code pathssuch as the page allocator. These changes pre-vent VM pressure caused by one process fromunfairly affecting VM performance of otherprocesses.

Many small changes were made to kernel func-tions in known high latency code paths in thekernel. These changes involved modifying thealgorithm to have a minimized or fixed boundon time complexity and to reduce lock hold soas to allow kernel preemption sooner.

5.4 Reducing Scheduling Latency

Measurements of scheduling latency are highlydependent on both machine and workload(workload being a crucial element—one work-load may show no perceptible scheduling la-

Linux Symposium 295

tency while another may introduce horridscheduling latencies). Nonetheless, worst-casescheduling latencies of under 500 microsec-onds are commonly observed in 2.5.

Even on a 2.4 kernel patched with the preemp-tive kernel (which undoubtedly does not bene-fit from some of the algorithmic improvementsin 2.5), a recent whitepaper [williams] noted afive-fold improvement in worst-case schedul-ing latency and a 1.6-time improvement in av-erage case scheduling latency. A long-termtest, part of the same whitepaper, testing thekernel for over 12 hours (to exercise many highscheduling latency paths) showed a reductionfrom over 200 milliseconds worst-case latencyin the period to 1.5 milliseconds with a com-bination of the preemptive kernel and the low-latency patch [akpm2]. This drastic reductionin worst-case latency over a long period with acomplex workload demonstrates the ability ofthe preemptive kernel and optimal algorithmsto provide both excellent average and worst-case scheduling latency.

One useful benchmark is the Audio LatencyBenchmark [sbenno], which simulates keepingan audio buffer full under various loads. Atest of a 2.4 kernel vs. a 2.4 preemptive ker-nel shows a reduction in worst-case schedulinglatency from 17.6 milliseconds to 1.5 millisec-onds [rml]. The same test on 2.5.68 yields amaximum scheduling latency of 0.4 millisec-onds.

On a modern machine, scheduling latency islow enough to prevent any perceptible stallsduring typical desktop computing and multi-media work.

Further, the 2.5 kernel provides a base suf-ficient for guaranteeing sub one millisecondworst-case latency for demanded embeddedand real-time computing needs.

6 Acknowledgments

I would like to thank the OLS program com-mittee for providing the opportunity to writethis paper and MontaVista Software for provid-ing the means by which I work on the kernel.

Andrew Morton deserves credit for an abnor-mally large amount of the interactivity workwhich went into the 2.5 kernel. Jens Axboewas the primary developer of the deadlinescheduler. Nick Piggin was the primary de-veloper of the anticipatory scheduler, which isbased on the deadline scheduler. Ingo Molnarwas the primary developer of the O(1) sched-uler. Various others played significant roles inthe design and implementation of other relatedkernel bits.

References

[akpm1] Andrew Morton,Patch:read-latency2, http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre5/read-latency2.patch .

[akpm2] Andrew Morton,Patch: low-latency,http://www.zipworld.com.au/~akpm/linux/schedlat.html .

[arcangeli] Andrea Arcangeli and JensAxboe,Source: 2.4 Elevator, linux/drivers/block/elevator.c .

[axboe1] Jens Axboe,Email: [PATCH]deadline io scheduler,http://www.cs.helsinki.fi/linux/linux-kernel/2002-38/0912.html .

[axboe2] Jens Axboe,Source: Deadline I/OScheduler, linux/drivers/block/deadline-iosched.c .

Linux Symposium 296

[iyer] S. Iyer and P. Druschel,Anticipatoryscheduling: A disk schedulingframework to overcome deceptiveidleness in synchronous I/O. ACMSymposium on Operating SystemPrincipals (SOSP), 2001.

[mingo1] Ingo Molnar,Source: O(1)scheduler, linux/kernel/drivers/sched.c .

[mingo2] Ingo Molnar,Patch:sched-2.5.64-D3,http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.68/2.5.68-mm2/broken-out/sched-2.5.64-D3.patch .

[piggin] Nick Piggin and Jens Axboe,Source:Anticipatory I/O Scheduler, linux/drivers/block/as-iosched.c .

[rml] Robert Love,Lowering Latency inLinux: Introducing a PreemptibleKernel, Linux Journal (June 2002).

[sbenno] Benno Senoner,Audio LatencyBenchmark, http://www.gardena.net/benno/linux/audio/ .

[williams] Clark Williams,Linux SchedulerLatency, Whitepaper, Red Hat, Inc.,2002.

Machine Check Recovery for Linux on Itanium®Processors

Tony LuckIntel Corporation

Software and Solutions Group

[email protected]

Abstract

The Itanium1 processor architecture providesa machine check abort mechanism for report-ing and recovering from a variety of hardwareerrors that may be detected by the processoror chip set. Simple errors such as single bitECC may be corrected transparently to the op-erating system by hardware and firmware, butmore complex errors where data has been lostrequire OS intervention. In cases where the OScan reconstruct the lost data, then executioncan continue transparently to the applicationlayer, otherwise the OS may decide to sacri-fice affected user processes to allow the systemto continue. This paper describes how Linuxcan recover from TLB errors without affectingapplications, and also how Linux can recoverfrom certain memory errors at the expense ofterminating user processes.

1 Introduction

Server systems are not just about increasedspeed and capacity. They must also providebetter reliability than their desktop and mo-bile cousins. The Intel Itanium architecture in-

1Itanium is a registered trademark of Intel Corpo-ration or its subsidiaries in the United States and othercountries. Other names and brands may be claimed asthe property of others.

cludes a machine check architecture that pro-vides the mechanism to detect, contain and inmany cases correct processor and platform er-rors.

2 Source of errors

There are several sources of errors within acomputer system:

1. Electrical supply line fluctuations

Can be mitigated with a high qualitypower supply with surge suppression ca-pability.

2. Static electricity

Effects can be lessened by good enclosuredesign and special static reducing floorcovering.

3. Heat

Reduced by good thermal design of sys-tem enclosure and air-conditioning of thecomputer room. Hardware may also de-tect excess temperature and automaticallyswitch to mode where less power is dis-sipated (e.g. a lower clock speed and/orvoltage, or a reduction in the number ofavailable functional units for retiring in-structions).

Linux Symposium 298

4. Interaction with high energy particles dueto radioactive decay

Can be reduced by careful selection ofthe materials used to build the system(e.g. use of ultra pure dopants consistingof only stable isotopes), and addition ofshielding.

5. Interaction with high energy particlesfrom cosmic ray showers in the earth’s at-mosphere

Reduce by shielding (locate computerroom in the basement) and avoiding highaltitude locations for computer systems(intensity in Denver, altitude 5280 feet, isover four times higher than at sea level lo-cations).

Why is this important? Other aspects of hard-ware are getting more reliable, but as featuresize is reduced they become more susceptibleto particles (as lower energy particles are capa-ble of flipping bits). Software is getting morereliable too: we cannot just blame crashes onthe OS. Clusters of computers multiply the er-ror rates. A mean time between failure of sev-eral years for a processor isn’t too bad of aproblem if you only have one processor, whenyou build a cluster of several thousand proces-sors, then you have a big problem.

For each of the error sources listed above I havesuggested methods by which the error rate maybe reduced, but it may be impractical or too ex-pensive to reduce all of these errors to insignif-icant levels, hence computer systems must bedesigned to detect, isolate and recover from er-rors when they do occur.

3 Itanium Machine Check Archi-tecture

The Itanium machine check architecture pro-vides a framework in which diverse types of

errors can be handled in a logical and consis-tent way.

3.1 Error severity

Errors are divided into three categories:

1. Corrected errors are those that are re-paired by hardware or by firmware. In ei-ther case an interrupt (CMCI2 for proces-sor errors, CPEI3 for platform errors) maybe raised for logging purposes.

Examples of this type of error are a cor-rectable single-bit ECC error in the pro-cessor cache or a correctable single-bitECC error on the system bus.

2. Recoverable errors involve some loss ofstate. They require operating system in-tervention to determine whether it is pos-sible for the system to continue operation.

An example of a recoverable error is onewhere incorrect data is about to be passedto a processor register (e.g. from a loadfrom memory with a multi-bit ECC error).

3. Fatal errors cannot be corrected. A systemreboot is required.

On this kind of error, the processor gener-ates a signal that is broadcast on the sys-tem bus (called BINIT#) that causes theprocessor to discard all in-flight transac-tions to prevent error propagation. The er-ror is fatal because there is no way to re-cover the state of the discarded bus trans-action, hence the need for a system reboot.

An example of a fatal error is a processortime-out (when the processor has not re-tired any instructions after a certain timeperiod).

2Corrected Machine Check Interrupt3Corrected Platform Error Interrupt

Linux Symposium 299

Incr

easin

g Se

verit

y of

Erro

r Error Corrected

by hardware; process continues

Error Corrected by firmware;

process continues

Error Correction dependent on OS

analysis & capapilities error is signalled to local processor only

Error Correction dependent on OS

analysis & capapilities error is signalled to

all processors

OS is not stable; reboot is required; error is signalled to

all processors

Corrected, CMCI or CPEI

Recoverable, Local MCA

Recoverable, Multiple MCA

Fatal MCA

Corrected, Local MCA, CMCI or CPEI

3.2 Control flow

This diagram shows how the hardware, pro-cessor, processor abstraction layer, system ab-straction layer and operating system interact tohandle machine checks:

Operating System Software

System Abstraction Layer (SAL)

Processor Abstraction Layer (PAL)

Processor (Hardware)

Platform (Hardware)

SAL MCA Procedure

Call

PAL MCA Procedure

Call

MCA Hardware

events

MCA Handoff

MCA Handoff

At the hardware level, hardware redundancy(parity and ECC) is used to detect and possi-bly correct errors during program execution. Insome correctable cases the hardware may sim-ply fix the error on the fly and raise a corrected

error interrupt to allow logging of the event. Inother cases the processor will save all machinestate and pass control to the PAL (ProcessorAbstraction Layer) code at the PAL_MCA en-try point for analysis and further processing byfirmware. Different members of the Itaniumprocessor family may make different choicesabout whether to fix errors in hardware or passresponsibility up to firmware.

Entry to the PAL is made in physical modewith caches disabled. This provides the optionfor the PAL to handle some types of errors de-tected in the cache, which is useful since onchip caches make up so much of the die areaof modern processors that they are statisticallyone of the most likely forms of errors (from theprocessor itself. . . memory errors from a largearray of DIMMS are probably the most likelysystem-wide source of errors). If the error iscorrected at this point, then execution can beresumed, but again an interrupt is raised to al-low the error to be logged.

Next layer in the firmware stack is the SAL(System Abstraction Layer) which is respon-sible for components outside of the processoritself (e.g. chip set, memory, I/O bus bridgesetc.). The processor is still executing in phys-ical mode (MMU disabled), we may have en-abled the caches by this point (depending onwhether the SAL machine check entry point islocated in cacheable or uncacheable memory).SAL code examines the details of the error anddetermines whether it can fix it without caus-ing loss or corruption of any data. As above, ifthe error is fixed, then execution resumes witha pended interrupt to log the details.

At the highest level is the operating system.Errors that have not been corrected at lowerlevels, but which leave the processor in an in-ternally consistent state, are still recoverable.These can be passed to the operating systemif it is interested in trying to recover. An op-

Linux Symposium 300

erating system indicates it is capable and will-ing to handle machine check errors by regis-tering an entry point with the SAL using theSAL_SET_VECTORS call. Note that the op-erating system entry point must be a physi-cal address because it is possible that the errorto be handled is in the memory managementH/W, hence the MMU must still be disabledwhen the SAL transfers control to the operatingsystem entry point. The physical entry pointmeans that the code to service machine checkabort must either be position independent, orat least very aware of relocation issues since itwill not be executing at the kernel’s linked ad-dress. Also the SAL_SET_VECTORS call al-lows the operating system to provide the lengthand a simple byte checksum of the code so thatthe SAL may validate that the routine has notbeen corrupted.

4 Reporting corrected errors

As mentioned above, errors that are correctedby hardware, PAL, or SAL may be reportedto the operating system by means of a CMCI(Corrected Machine Check Interrupt) for errorsinside the processor, or a CPEI (Corrected Plat-form Error Interrupt) for system errors outsideof the processor. The operating system maychoose to disable these interrupts and period-ically poll the SAL to see if any errors havebeen corrected. It might do this to avoid be-ing swamped by corrected error interrupts (e.g.a stuck data line causing hard single-bit mem-ory errors across a wide range of physical ad-dresses, we would like to ensure that the op-erating system can continue to make forwardprogress).

Whether the operating system takes interruptsor polls, once a corrected error record is foundthe OS can retrieve the whole record, parse itto find any useful information, and then outputdetails to its own log. As a final step in error

reporting the operating system must request theSAL to clear the error record from non-volatilememory (to ensure that space is available forfuture errors to be logged).

User level tools could be written to analyze theoperating system logs to check for patterns thatmay be predictive of future hard errors in com-ponents that generate a high level of soft errors.

5 Poisoned memory

Another feature of the machine check archi-tecture is the concept of “poisoned memory.”This allows a platform the option of deferringerror processing in some circumstances. Sup-pose a modified cache line is being writtenback to memory when an uncorrectable erroris detected in the data contained in the cacheline. The current execution state of the pro-cessor probably has no connection with thisdata, so signaling a machine check abort atthis time may be an over-reaction to the situ-ation. Instead, the data can be written to mem-ory together with an indication that it is corrupt(the “poison” flag, typically indicated with badECC bits). A regular CMC interrupt is thenraised, and the OS is allowed to examine thesituation later. Deferring the error in this casemay be useful, because it is not certain that thecorrupted data will ever be needed (e.g., thepage to which the cache line belongs may al-ready have been freed by the operating system,or the word that was corrupted in the cache linemay have only been present in the cache be-cause of false sharing).

Note that in the “read” case data poisoningdoes not apply, the processor will immediatelybegin MCA processing.

If the operating system writer is concerned thatthe poisoned data may be consumed before theinterrupt is processed, there is an option to pro-mote CMCI to MCA to allow immediate ac-

Linux Symposium 301

tion to be taken (though this option applies toall CMCI, not just to those caused by poisondata).

6 Operating system examples

Here are some example cases of recoverable er-rors where the operating system can interveneto recover from an error that has been detectedand reported using the above mechanism.

6.1 TLB translation register error

The Translation lookaside buffer in the Ita-nium processor is not only divided into sepa-rate structures for instruction and data access,it is also conceptually divided into two types ofentries.

1. Translation cache entries can freely be re-placed as new mappings are added.

2. Translation register entries are locked intothe TLB. These are used to “pin” trans-lations for critical regions to ensure thata TLB miss will never occur for a virtualaddress mapped by a locked entry.

Errors in translation cache can be trivially han-dled by H/W or PAL by simply discarding cor-rupted entries from the cache. This will onlyaffect performance, if the discarded entry isneeded, the processor will simply reload usingthe normal TLB miss execution path. How-ever, if an error is detected in one of the transla-tion registers, then the fault cannot be handledby F/W, PAL or SAL, since the entry cannotjust be dropped and the firmware does not haveenough information to reconstruct the damagedentry. So the error is propagated up to the OSwhich needs to reload the errant register. Theia64 Linux implementation uses the followingTR registers:

ITR(0) maps one large kernel page4 as thekernel text (code)

DTR(0) maps one large kernel page as the ker-nel data

ITR(1) maps one granule5 for PAL

DTR(1) maps one page6 for per-cpu area

DTR(2) maps one granule for kernel stack

The first four of these are loaded during ker-nel initialization, and are never changed, so itis a simple matter to add code to save the cor-rect values for these registers in memory, sothat the MCA handler can reload when needed.The last, mapping the kernel stack, can poten-tially be reloaded on every context switch (ac-tual reloads occur when the task structure forthe new process is in a different large kernelpage). The Linux kernel uses one of the su-pervisor mode registers (ar.k4) to keep track ofwhich large kernel page is currently mapped,and this register can be used to reconstruct theDTR(2) value during the MCA handler.

Although the SAL error record generated foran error in a translation register provides in-formation on which register(s) have errors, itwould require a large amount of code to re-trieve and parse the error record to determineexactly which registers need to be reloaded.All this code would have to run in physicalmode (remember that SAL passes control tothe operating system MCA entry point in phys-ical mode, and it is not possible to transition tovirtual mode for this particular error, since weknow that one or more of the translation reg-isters are corrupt). The simple solution to thisissue is to have the MCA TLB recovery codepurge and reload all of the TR registers in the

4Kernel is always mapped with a single 64MB page5another type of large kernel page, configurable as

either 16MB or 64MB6PAGE_SIZE on 2.4, 64k on 2.5

Linux Symposium 302

physical mode code. Then we can safely tran-sition to virtual mode to retrieve and examinethe record from the SAL error log to report theactual register(s) that were affected by the er-ror.

6.2 Multi bit ECC error in memory

In this case some data has been irretrievablylost, so the operating system cannot escapefrom this situation unscathed. The basic strat-egy for the OS is to identify the address of thememory that reported the error. If the mem-ory is owned by the kernel, this will currentlybe reported as a fatal error by the Linux MCAhandler and the OS will reboot (to be strictlyaccurate the Linux MCA handler will return toSAL with a request to reboot, and the error willbe reported by Linux after the reboot when theSAL error record is retrieved from NVRAM).If the memory is allocated to one or more userprocesses, the processes can be sacrificed toallow the system to continue running (just asthe OOM killer will terminate processes whenLinux runs out of memory). Life is rarely thatsimple. In this case a complication is that ma-chine checks are not reported synchronously tothe instructions that trigger them, they may bedeferred for a long time (in the worst case untilthe values are consumed). The processor pre-cisely identifies the location at which the faultis detected, but does not provide informationabout the point at which the fault occurred. Theprocessor does not automatically7 raise ma-chine checks across privilege transitions fromuser to kernel mode and vice versa, so it is pos-sible that an error caused in one privilege statewill be reported in the other state. Easy cases:

a) fault triggered in user mode is reported inuser mode—kill process

7There is a PAL call PAL_MC_DRAIN to do this,but it would be a major performance issue to use this onevery transition

b) fault triggered in kernel mode is reportedin kernel mode—system reboot

Harder cases:

c) fault triggered in kernel mode is reportedin user mode—this is a very subtle case.At first sight it appears that our error han-dler cannot do the right thing. This caseis indistinguishable from case ‘a’ above,since we only have precise informationabout where the error was detected. Butif the error occured in some kernel data,just killing the process is not the correctaction. It would leave the kernel runningwith a corrupted data structure! However,we are saved by the fact that the erroris detected when the user process tries toconsume the data, and we know that onlya buggy kernel would leak details of a ker-nel data structure to user mode. So we canassume that any such error that happenedin kernel mode and was detected in usermode must have occurred when the ker-nel was restoring registers that belongedto the user process. Thus killing the pro-cess is sufficient to contain the error.

d) fault triggered in user mode is reportedin kernel mode—sadly this will cause areboot because it isn’t possible to distin-guish this from case ‘b’ above (though itmight be possible to eliminate many ofthese cases by a special case for faults re-ported during the code that saves user reg-isters).

6.3 PCI errors

The Itanium processor family machine checkarchitecture provides a framework for report-ing platform errors, such as PCI bus errors.From an OS perspective, these may be far morecomplex to handle. Issues are:

Linux Symposium 303

1. We would like a framework that is min-imally invasive to the existing drivermodel.

2. Linux is just starting to get support forhot-add and hot-remove of devices, but itis a big step from there to support surpriseremoval of devices when errors occur.

3. Even in the case of transient errors, recov-ery may be complicated by the firmwareon the card, which typically has been writ-ten with an expectation that the systemwill reboot after an error.

7 Acknowledgments

Thanks to all the people at Intel who spent timereviewing this paper and providing invaluablefeedback.

References

[Menyhárt] Z. Menyhárt and D. Song,OSMachine Check Recovery on ItaniumArchitecture-base Platforms, IntelDeveloper Forum, Fall 2002

[Ziegler] J.F. Ziegler,Terrestrial cosmic rayintensities, IBM Journal of Research andDevelopment, Volume 42, Number 1,1998

[SDV] Intel, Intel Itanium ArchitectureSoftware Developer’s Manual, Volume1–3

[EHG] Intel, Itanium Processor Family ErrorHandling Guide, August 2001

[SAL] Intel, Itanium Processor FamilySystem Abstraction Layer (SAL)Specification, November 2002

Low-level Optimizations in the PowerPC LinuxKernels

Paul MackerrasIBM Linux Technology Center OzLabs

[email protected]

Abstract

We examine three low-level optimizationsin the Linux® kernel for 32-bit and 64-bitPowerPC®, relating to cache flushing, mem-ory copying, and PTE (page table entry) man-agement. Benchmarking and profiling wereused to identify areas where optimizationscould be performed and to identify whether theoptimizations actually improved performance.The cache flushing and memory copying opti-mizations improved performance significantly,whilst the PTE management optimization didnot.

1 Introduction

The optimizations presented in this paper rep-resent some of the results of the continuing ef-fort to make the Linux kernel run better andfaster on PowerPC processors, both 32-bit and64-bit. The optimizations here are low-leveloptimizations that are specific to PowerPC pro-cessors, aimed at decreasing the overhead ofsome of the fundamental operations relatingto maintaining the consistency of the instruc-tion cache with memory, copying memory, andmanaging page table entries.

Subsequent sections present measurements ofperformance using benchmarking and kernelprofiling. Benchmarking is the process ofmeasuring performance by running a specific

program or set of programs to measure howquickly certain operations are performed. Twodifferent benchmarks of different styles areused here:

• LMBench™ is a micro-benchmark suiteoriginally written by Larry McVoy. Itmeasures the speed of a broad rangeof individual kernel operations such asforking processes, reading and writ-ing data to/from disk, transferring dataover a socket, etc. Seehttp://www.bitmover.com/lmbench/ fordetails.

• For an application-level benchmark tomeasure the overall speed of a process thatinvolves a range of kernel activities, weuse the process of compiling the Linuxkernel, and measure the time taken us-ing the time(1) command. We usedthe same source tree (from Linux version2.5.25) and configuration for all tests sothat the results are comparable with eachother. A kernel compilation tends to ex-ercise a range of kernel functions includ-ing forking processes, starting new pro-cesses, reading and writing files, mappingin pages of memory on demand, and soon.

Profiling measures the time spent in individualkernel procedures while the kernel is perform-ing some tasks. The form of profiling used in

Linux Symposium 305

this paper is that where a periodic interrupt isused to obtain a statistical sample of the to-tal time spent executing each instruction in thekernel. When the periodic interrupt is taken,the handler examines the instruction pointerwhere the interrupt occurred, uses that to in-dex into an array, and increments that array el-ement. Over time this builds up a histogram ofwhere the kernel is spending its time. A post-processing tool converts that histogram into atotal count for each procedure in the kernel.

This can be a very powerful tool for analysingkernel performance provided that its limita-tions are kept in mind. First, because the data isa statistical sample, it can be quite noisy. Sec-ondly, it does not measure the execution timefor code that runs with interrupts disabled (un-less some kind of non-maskable interrupt canbe used). Instead, time spent with interruptsdisabled tends to get attributed to the pointwhere interrupts are re-enabled.

Three machines were used for the measure-ments reported here:

1. An Apple® PowerBook® G3 laptop with a400MHz PowerPC 750™ processor, sep-arate 32kB level 1 data and instructioncaches, unified 1MB level 2 data cache,and 192MB of RAM. This is a 32-bit ma-chine.

2. An IBM® pSeries™ model 650 computerwith eight 1.45GHz IBM POWER4+™processors, 32kB level 1 data cache and64kB level 1 instruction cache per proces-sor, 1.5MB level 2 cache per 2 processors,and 8GB of RAM. This is a 64-bit ma-chine.

3. An IBM “Walnut” embedded evalua-tion board with a 200MHz IBM Pow-erPC 405GP processor, 8kB level 1 datacache, 16kB level 1 instruction cache and128MB of RAM. This is a 32-bit machine.

2 Cache flushing optimizations

In the PowerPC architecture, the instructioncache is not required to snoop changes to thecontents of memory, either by stores from thisor another CPU, or by DMA from an I/O de-vice. Instead, software is required to maintainthe coherency of the instruction cache, usingthedcbst —data cache block store instructionand theicbi —instruction cache block invali-date instruction. These instructions can be ex-ecuted at user level, and self-modifying codeis required to use them after it has written in-structions to memory before those instructionsare executed.

The kernel uses these instructions extensivelyto make sure that pages that are mapped intoa user process’s address space can be executedsafely. User code assumes that the instructioncache is consistent with memory for pages thatare supplied by the kernel on demand. Thusit is the kernel’s responsibility to perform thecache flushing instructions on a page of mem-ory before mapping it into a process’s addressspace if there is a possibility that the instruc-tion cache is incoherent with memory for thatpage. If this is not done properly, the symptomis usually that the process will get a segmen-tation violation or illegal instruction exception,since it is not executing the instructions that itshould.

Note that almost all PowerPC implementa-tions have caches that are effectively physicallyaddressed—usually virtually indexed, physi-cally tagged, set associative, with the set sizeno greater than the page size, so no aliasingoccurs. The IBM POWER4™ processor hasa virtually indexed direct-mapped instructioncache, but obviates the potential problems thatthis could cause by having theicbi instruc-tion clear all 16 cache blocks where a givenblock of memory could be cached. There arealso some embedded PowerPC implementa-

Linux Symposium 306

tions that have virtually indexed and tagged in-struction caches, and these require quite differ-ent cache management and are not consideredin this paper.

2.1 Initial implementation

The Linux generic virtual memory (VM) sys-tem provides a number of hooks that ar-chitecture code can define in order to doarchitecture-specific cache and TLB man-agement where necessary. One of theseis called flush_page_to_ram , and it iscalled at several points in the VM codewhen pages are mapped into a user pro-cess’s address space (e.g.,do_no_page ,do_anonymous_page etc.). In older ker-nels the flush_page_to_ram hook wasused on PowerPC to perform the cache flushon the page in order to ensure that the in-struction cache was consistent for the page.This reliably ensured that the instruction cachedid not contain stale data for the pages thatprocesses see, but had a considerable perfor-mance penalty. The “Original” column of Ta-ble 1 shows the results of a kernel profileon a kernel which usesflush_page_to_ram to ensure instruction cache coherency.These results were obtained on the 400MHzG3 PowerBook machine compiling a kernel (3times over). The kernel spends more time inflush_dcache_icache , which performsthe flushing function offlush_page_to_ram, than any other kernel procedure. Clearlyflush_page_to_ram is a good candidatefor optimization.

2.2 Optimized implementation

A large part of the reason why the kernel isspending so much time inflush_dcache_icache is that it is doing unnecessary flushes.If the same program is executed many timesin different processes, the kernel will call

Procedure Original Optimizedflush_dcache_icache 6763 2974ppc6xx_idle 2238 2468do_page_fault 857 667copy_page 537 390clear_page 523 509copy_tofrom_user 356 299do_no_page 231 129add_hash_page 220 92flush_hash_page 195 191do_anonymous_page 194 224

Table 1: Kernel profiles before and after cache-flush optimization

flush_page_to_ram on each page of theprogram executable each time it is mapped intoa process’s address space. However, once theflush has been done, the instruction cache isconsistent for that page (provided that the pageis not modified), and the flush doesn’t need tobe done when the page is subsequently mappedinto other processes’ address spaces.

A solution to this problem was suggestedby David Miller. He suggested using a bit,called PG_arch_1 , in the flags field ofthepage_struct structure for each page, toindicate whether the instruction is consistentwith memory for the page. This bit is clearedwhen the page is allocated. We use this to in-dicate that the instruction cache may be incon-sistent. When the flush is done on the page, weset the bit, indicating that the instruction cacheis now consistent for the page. Subsequently,if the bit is already set, the flush does not needto be done.

David Miller also requested that we use theflush_dcache_page andupdate_mmu_cache hooks rather thanflush_page_to_ram, since more information is provided tothe architecture code in the calls toflush_dcache_page and update_mmu_cache ,and flush_page_to_ram is deprecated.The VM system calls flush_dcache_

Linux Symposium 307

page when a page which may be mappedinto a process’s address space is modified bythe kernel. update_mmu_cache is calledwhen a page is mapped into a process’s ad-dress space. In our implementation,flush_dcache_page clears thePG_arch_1 bit,and update_mmu_cache does the flush(by calling flush_dcache_icache ) if thePG_arch_1 bit is clear, and then sets it.

The results are shown in the “Optimized” col-umn in Table 1. Clearly the time spent flush-ing the cache has decreased dramatically, al-though it is still significant. The system timefor the compilation decreased from 46.0 sec-onds to 29.9 seconds. The user time wasnot significantly different (301.9 seconds vs.300.2 seconds). The overall speedup was5.1%. (Note that kernel profile measurementsare quite noisy, and the other differences be-tween the columns in Table 1 are not necessar-ily significant.)

Table 2 shows an excerpt from thebefore-and-after LMBench results. (Seeftp://ftp.samba.org/pub/paulus/ols2003/lmb-argo-flush for the fullsummary of results.) The optimization hasproduced worthwhile improvements in thefork, exec and shell process items. The firstline shows the unoptimized results, and thesecond line shows the optimized results.

2.3 Further optimizations

The implementation in the previous sectionaimed at minimizing the number of flusheswhile still making sure that the instructioncache was consistent with memory for eachpage mapped into the process’s address space.This includes anonymous pages and pages thatare copied as a result of a write to a copy-on-write page, as well as page-cache pages. (Acopy-on-write page is one which is mappedwith a private writable mapping, including

anonymous pages which are shared after afork.)

Part of the reason that it is necessary to en-sure consistency of the instruction cache isthat the PowerPC architecture, as originally de-fined, doesn’t provide any way to prevent aprocess from executing code from a readablepage. That is, there is no execute permissionbit in the page table entries (PTEs). If therewas a way to trap attempts to execute from apage, it would be possible to defer the flush un-til a process first executed instructions from thepage. That way, it would be possible to avoidthe flush altogether on anonymous pages whichare only used for data, not for code.

Embedded PowerPC implementations, such asthe IBM PPC405, don’t follow the originalPowerPC memory management unit (MMU)architecture, but instead have a software-loaded TLB with a unique PTE format. ThePTE format for the PPC405 includes anexecute-permission bit. Also, the POWER4processor uses one of the previously-unusedbits in the PTEs as a no-execute bit.

We implemented an optimization on thePPC405 where the pages are not flushed inupdate_mmu_cache . Instead, if thePG_arch_1 bit is clear, we clear the execute-permission bit in the PTE mapping the page.There is an added check indo_page_faultfor an attempt to execute from a page withthe execute-permission bit clear. In that case,we do the flush on the page and then set theexecute-permission bit.

The kernel profiles shown in Table 3 showthat the number of counts recorded inflush_dcache_icache while compiling the testkernel twice dropped from 1685 to 31. Thusthe time spent doing cache flushes for instruc-tion cache consistency has become negligible.The system time for a kernel compile was re-duced by 147.7 seconds to 139.0 seconds, a de-

Linux Symposium 308

Processor, Processes - times in microseconds - smaller is better----------------------------------------------------------------Host OS Mhz null null open selct sig sig fork exec sh

call I/O stat clos TCP inst hndl proc proc proc------ ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----argo Linux 2.5.66 400 0.35 0.76 3.90 5.34 39.1 1.67 6.64 795. 5065 23.Kargo Linux 2.5.66 400 0.35 0.76 3.88 5.33 40.3 1.67 6.13 659. 2254 11.K

Table 2: LMBench results before and after cache-flush optimization

Procedure Orig. Optim.ProgramCheckException 4766 4685do_mathemu 4475 4526ide_intr 1745 1729flush_dcache_icache 1685 31record_exception 1369 1448copy_tofrom_user 1357 1325do_page_fault 1027 1035ret_from_except_full 774 775fmul 731 721fsub 658 629

Table 3: Kernel profiles before and afterexecute-permission optimization

crease of 5.9%. The user time was not signif-icantly different: 1460.6 seconds for the opti-mized kernel vs. 1457.9 seconds for the unop-timized kernel. The overall time for the kernelcompilation was reduced by 0.37%. The re-duction is less than might have been expectedbecause a significant amount of the systemtime was spent in the kernel floating-point em-ulation routines (the PPC405 does not imple-ment floating point instructions in hardware).

3 Memory copying

Copying memory is a fundamental operation inthe kernel, used for:

• Copying data to or from a user process(e.g., for aread or write system call)

• Copying pages of memory, in particularfor write faults on copy-on-write pages

• Copying other data structures within thekernel.

Separate procedures are used for these threeoperations: copy_tofrom_user , copy_page and memcpy respectively. The pro-files in Table 1 show thatcopy_tofrom_user andcopy_page are among the top tenmost time-consuming operations in the kernel.These routines are thus a candidate for opti-mization.

However, these routines are already well opti-mized in the 32-bit PowerPC kernel for most32-bit PowerPC implementations. In the 64-bit kernel, it becomes possible to use 64-byteloads and stores to move more data per instruc-tion. This is easy in thecopy_page case,since the source and destination addresses arepage-aligned. However, incopy_tofrom_user andmemcpy, where the source and des-tination are not necessarily 8-byte aligned, itbecomes more complicated.

PowerPC processors generally handle most 2-byte and 4-byte unaligned loads and storesin hardware, without generating an exception.Older processors would generate an exceptionif the access crossed a page boundary, whereasmost newer processors handle even that casein hardware. However, 64-bit PowerPC pro-cessors typically generate an exception on an8-byte load or store if the address is not 4-bytealigned. The kernel has an alignment excep-tion handler that emulates the load or store andallows the program to continue.

Linux Symposium 309

When copying memory and the source and/ordestination addresses are misaligned, we gen-erally copy a small number of bytes, one ata time, in order to get to an aligned destina-tion address. If the source address is then mis-aligned (that is, the bottom 2 bits of the addressare non-zero), there are two alternative strate-gies to handling the misalignment:

1. Use 32-bit or 64-bit loads and stores, ig-noring the misalignment. In this case wewill have misaligned load addresses andaligned store addresses.

2. Do loads with aligned addresses and useshift and OR instructions to shuffle thebytes into the correct positions to bestored to an aligned address.

For current PowerPC implementations, it turnsout that while misaligned 32-bit loads areslower than aligned 32-bit loads, they are stillfaster than aligned 32-bit loads plus the extrainstructions needed to shuffle the bytes into po-sition. For this reason,copy_tofrom_userandmemcpyin the 32-bit kernel use unalignedloads in a relatively simple loop. However, thesituation is different for 64-bit loads. Since ev-ery misaligned 64-bit load will cause an excep-tion, it is much faster to do the aligned loadsand shuffle the bytes.

In fact, the behaviour of the processor on un-aligned loads and stores is only one of many ar-chitectural and implementation characteristicsthat affect how an optimum memory copyingroutine should be written. Some of the othersare:

• The number of levels in the storage hier-archy and the latency to each level;

• Presence or absence of automatic hard-ware prefetch mechanisms;

• Presence or absence of instructions to pro-vide cache prefetch hints to the processor;

• Load-use penalty, that is, how many otherinstructions should be placed between aload and the store (or other operation)which uses the data from the load, so thatthe processor does not need to stall thestore until the data from the load is avail-able;

• Ability of the processor to issue instruc-tions out of order, so that later instruc-tions are not blocked by earlier instruc-tions which do not have all their operandsavailable;

• The penalty incurred for conditionalbranches (if this is large then there is anadvantage to unrolling loops);

• Extended instruction sets such as Altivecon PowerPC, MMX/SSE on x86, or theVIS instructions on SPARC64, which pro-vide the ability to operate on larger unitsof data (typically 128 bits).

Given how many factors can affect memorycopying performance, it is not surprising thatmemory copy routines can become quite largeand complicated, reaching tens of thousands oflines of assembly code on some architectures.Other factors that affect the performance of amemory copy routine include the size of theregion to be copied, and whether the sourceand/or destination regions are already presentin the processor’s caches. Some optimizations,such as loop unrolling, might improve perfor-mance dramatically for larger copies (i.e., sev-eral cache lines or larger) but hurt performancefor small copies by increasing the setup costs.Similarly, some optimizations, such as usingextended instruction sets, might improve per-formance dramatically when all the source datais present in the level-1 data cache, but have no

Linux Symposium 310

effect or actually reduce performance when thedata has to be brought in from main memory.

Thus it is interesting to know whether the ker-nel routinely does large copies, and whetherthey are misaligned or not. To test this,we added histogramming functions tocopy_tofrom_user andmemcpy in a 64-bit ker-nel. The results can be summarized as follows:

• 98% of calls tomemcpy were for lessthan 128 bytes (one cacheline).

• 13% of calls tomemcpywere not 8-bytealigned.

• 84% of calls tocopy_tofrom_userwere for less than 128 bytes, and 95%were for less than 512 bytes. Of theremainder, most were page-sized (4096bytes) and page-aligned.

• 43% of calls tocopy_tofrom_userwere not 8-byte aligned.

These results indicate that it is important to op-timize for the small-copy case, particularly formemcpybut also forcopy_tofrom_user ,and that performance on unaligned copies isimportant for copy_tofrom_user . Theone large-copy case which is worth optimizingfor is the case of copying a whole page, bothin copy_page and also to a lesser extent incopy_tofrom_user .

3.1 Optimized POWER4 memory copy

On POWER4 the factors that need to be takeninto account include the following:

• POWER4 aggressively executes instruc-tions out of order and uses register renam-ing to avoid false dependencies betweeninstructions.

• POWER4 includes automatic prefetchhardware which detects sequential mem-ory accesses and prefetches cache lineswhich are likely to be needed in the nearfuture.

• Correctly predicted conditional branchesincur a one cycle penalty.

• The level-1 data cache on POWER4 isa write-through cache, thus all stores gothrough to the level-2 cache. The L2cache is organized as three interleavedbanks. Each bank has its own store queue.

The effect of the first three points is that whilesome degree of loop unrolling is beneficial, itis not necessary to aggressively unroll the maincopy loop or to make sure that many instruc-tions intervene between a load and the instruc-tion that uses the result.

Because of the interleaved nature of the L2cache, the optimum pattern of stores is onewhere stores go successively to each bank ofthe level 2 cache. In fact the best perfor-mance for large copies is obtained with a loopthat works on six cachelines at a time, so thatthe loads and stores are interleaved across sixcachelines.

Based on these considerations, the authordeveloped optimizedcopy_tofrom_user ,copy_page and memcpy implementationsfor POWER4, with the following characteris-tics:

• copy_tofrom_user detects page-sized page-aligned copies and calls aroutine similar tocopy_page for them.For other copies, it proceeds with analgorithm similar to memcpy below.(The main difference betweencopy_tofrom_user and copy_page ormemcpy is that copy_tofrom_user

Linux Symposium 311

has to cope gracefully in the case wherethe source or destination address cannotbe accessed, for example if a bad addressis given to a system call.)

• The main loop ofcopy_page works on6 cachelines at once, and contains 18 loadand 18 store instructions, in three groupsof 6 stores followed by 6 loads.

• memcpy has three main loops, each ofwhich do two aligned 64-bit loads and twoaligned 64-bit stores. One loop is for thecase where the source and destination are8-byte aligned with respect to each other,and the other two are for the misalignedcase. These two loops are slightly differ-ent to handle an even or odd number of64-bit doublewords to be transferred (ex-cluding any bytes copied initially to getthe destination 8-byte aligned).

These routines were tested on the 1.45GHzPOWER4+ system using the same kernel com-pile test used in the previous section. The op-timized copy routines were compared with thecopy_tofrom_user andmemcpyroutinesfrom the ppc32 kernel (thus using only 32-bitloads and stores) and a simplecopy_pageimplementation which has two 64-bit loads andstores per iteration in its main loop.

Figure 1 shows selected LMBench re-sults in graphical form. (Seeftp://ftp.samba.org/pub/paulus/ols2003/lmb-power4-copy for the fullsummary of results.) From these it is evidentthat the optimized version is indeed noticeablyfaster than the unoptimized versions.

However, the results from the kernel compiletest were more equivocal: the system time wasreduced from 8.30 seconds to 8.19 seconds(a 1.3% improvement). When the user time(78.84 seconds in both cases) is added in, theoverall improvement was only 0.13%.

Figure 1: Results of memory copy optimiza-tion on 64-bit POWER4+.

4 PTE management

4.1 Introduction

In the PowerPC architecture, page table en-tries (PTEs) are stored in a hash-table structurewhich is accessed by the memory managementunit (MMU) hardware. The hash table is di-vided into groups of 8 entries. Each group hasan index between 0 andN − 1, whereN isthe number of groups in the table (N must be apower of 2). When the MMU needs to find thePTE for a given virtual address, it first com-putes a hash index using an XOR-based hashfunction on the virtual address. The hash indexidentifies one group, called the primary groupfor the virtual address, which is fetched andsearched for a PTE which matches the virtualaddress. The PTE includes part of the virtualaddress so that a match can be verified. If nomatching PTE is found, a secondary hash valueis formed by subtracting the original hash valuefromN−1. This identifies the secondary groupfor the virtual address, which is also searched

Linux Symposium 312

for a matching PTE.

In contrast, the Linux virtual memory (VM)system uses a two-level or three-level treestructure for storing PTEs. It is the responsibil-ity of the PowerPC-specific parts of the kernelto keep the MMU hash table accessed by thehardware synchronized with the Linux page ta-ble trees. Essentially, the MMU hash tableis used as a large level-2 translation lookasidebuffer (TLB).

To avoid confusion, we use the term HPTE(Hashtable PTE) to refer to a PTE in the MMUhash table, and LPTE (Linux PTE) to refer to aPTE in the Linux page table trees. The HPTEand LPTE formats are different, although re-lated.

When a new LPTE is created, a correspondingHPTE must be created before the page can beaccessed. This can be done in theupdate_mmu_cache hook or on demand when thepage is first accessed.

Similarly, when an existing LPTE is invali-dated, the corresponding HPTE (if any) mustbe found and invalidated. One reasonableapproach is to do the HPTE invalidation inthe TLB flushing routines. There are fourTLB flushing routines called by the LinuxVM code: flush_tlb_page , flush_tlb_range , flush_tlb_mm andflush_tlb_kernel_range . In addition, thereare the__tlb_remove_tlb_entry andtlb_flush routines which are used in con-junction with themmu_gather structure usedwhen destroying page tables.

Of these, flush_tlb_page is relativelyeasy to implement efficiently since it only op-erates on a single page, whose address is given.The flushing routine just needs to search oneprimary HPTE group and possibly one sec-ondary group, and invalidate the HPTE if it isfound.

Implementing flush_tlb_range andflush_tlb_mm efficiently is more difficult,since the information about precisely whichLPTEs have been changed or invalidated isnot readily available. Searching the MMUhash table for every address in the range (forflush_tlb_range ) or in the whole addressspace for a process (flush_tlb_mm ) wouldbe very time-consuming, and would be par-ticularly inefficient if there were not actuallyHPTEs present for most of the addresses, as istypically the case forflush_tlb_mm .

Instead, we use a bit in the LPTE, called_PAGE_HASHPTE, to indicate whether aHPTE corresponding to this LPTE has beencreated. Then,flush_tlb_range andflush_tlb_mm can scan the Linux pagetables looking for LPTEs with the_PAGE_HASHPTEbit set and only search the MMUhash table for the corresponding addresses.This scheme does however require some carein handling the_PAGE_HASHPTEbit:

• It must remain valid even if the LPTEis invalidated or used for a swap en-try. Thus an atomic read-modify-write se-quence must be used to invalidate or up-date an LPTE rather than a normal storeinstruction.

• The page holding the LPTE must still beallocated and present in the page table treeat the time that any of the TLB flush rou-tines are called.

The 64-bit PowerPC kernel uses an additional4 bits in the LPTE to indicate whether theHPTE is in the primary or secondary group,and which of the 8 slots in the group it isin. With this information, the TLB flush rou-tines can invalidate the HPTE directly with-out having to search the primary and secondarygroups. This approach isn’t used in the 32-bit

Linux Symposium 313

PowerPC kernel since there are not 4 free bitsin the LPTE.

4.2 Optimized implementation

Instead of having to scan the Linux page ta-bles in flush_tlb_range and flush_tlb_mm , it would potentially be more effi-cient to invalidate the HPTE at the time thatthe LPTE is changed. Alternatively, it wouldbe possible to make a list of virtual addresseswhen LPTEs are changed and then use that listin the TLB flush routines to avoid the searchthrough the Linux page tables.

This approach was not possible in ear-lier kernels because there was not enoughinformation supplied in the calls to thefunctions that update LPTEs (set_pte ,ptep_get_and_clear , etc.) to deter-mine the address space (represented by themm_struct structure) and virtual address forthe LPTE being modified. However, the infras-tructure added for the reverse-mapping (rmap)support in the Linux VM system allows usto determine this information efficiently, sincea pointer to themm_struct for the addressspace and the base virtual address mappedby the LPTE page are now stored in thepage_struct structure for each LPTE page.

The results for the 32-bit kernel are shown inTable 4. The times shown are the system anduser times in seconds for the kernel compila-tion test on the 400MHz PPC750 (G3) system,and are averages of at least two repetitions. Thefirst row is for the original unoptimized kernel,the second row (“Immediate update”) is for akernel which invalidates the HPTE at the timewhen the LPTE is modified, and the third row(“Batched update”) is for a kernel that recordsthe virtual addresses when LPTEs are modifiedand invalidates the HPTEs in the TLB flushroutines.

Kernel version System User Totaltime time time

Original 32.09 303.71 335.80Immediate update 32.28 302.93 335.21Batched update 32.40 303.22 335.62

Table 4: PTE optimizations, 32-bit kernel

Kernel version System User Totaltime time time

Original 8.51 78.13 86.64Batched update 8.44 78.04 86.48

Table 5: PTE optimizations, 64-bit kernel

Clearly, neither optimization gives a significantincrease in performance. The differences in to-tal time of less than 0.6s are less than the stan-dard deviation of the total time measurementfor the original kernel, which was 1.42s (from7 measurements).

The results in Table 5 for the 64-bit kernel onthe 1.45GHz POWER4+ system paint a similarpicture. The table compares the original ker-nel with one that records the virtual addresseswhen LPTEs are modified and invalidates theHPTEs in the TLB flush routine (the “Batchedupdate” row). The times are in seconds and areaverages of 6 measurements. Previous experi-ence has shown that it is important to batch upchanges to the MMU hash table in the ppc64kernel, particularly on SMP systems. Conse-quently we did not test the variant which inval-idates the HPTE at the time that the LPTE ismodified.

The difference in total time is 0.16 seconds,about 0.2% of the total time. Even if the dif-ference were statistically significant, it hardlyrepresents an important optimization. How-ever, the optimized code is actually simpler andshorter (by 66 lines) than the original code, andmay be worth adopting for that reason alone.

Linux Symposium 314

5 Conclusions

The previous sections demonstrated perfor-mance improvements of varying magnitudesfrom the optimizations considered. Some ofthe individual LMBench numbers were morethan doubled by the first optimization, thatof avoiding the instruction cache flush onpages which had already been flushed. The64-bit memory copy optimizations improvedthe unix-domain socket bandwidth by over60%, with smaller improvements on otherbandwidth-related measurements.

The overall improvements for the kernel com-pile benchmark were more modest. This is tobe expected since the time spent in the kernelis only about a tenth of the time spent in userprocesses for this benchmark, and the opti-mizations considered here only reduce the timespent in the kernel.

In sum, the optimizations presented here pro-vide a substantial performance boost for theLinux kernel on PowerPC machines.

The results illustrate some general principlesabout optimization work:

• Kernel profiling is a useful tool for de-termining what a profitable target for op-timization is, and whether a given opti-mization is effective. However, the ker-nel profile results are usually quite noisy,and care must be exercised in interpretingthem.

• Some optimizations may produce dra-matic improvements on benchmarks buthave almost no effect on the speed of ac-tual application programs.

• Measurement is key; some optimizationsmight seem like an extremely good ideabut not produce any significant perfor-mance gains, either because of unfore-

seen side-effects or because the thing be-ing optimized doesn’t consume a signifi-cant amount of time.

Acknowledgements

The author would like to thank Anton Blan-chard for assistance with implementing andbenchmarking the cache-flushing optimiza-tions presented here.

Legal statements


IBM, pSeries, PowerPC, PowerPC 750, POWER4and POWER4+ are trademarks or registered trade-marks of International Business Machines Corpo-ration in the United States and/or other countries.


LMBench is a trademark of BitMover, Inc.

Apple and PowerBook are trademarks of AppleComputer, Inc., registered in the U.S. and othercountries.


Sharing Page Tables in the Linux Kernel

Dave McCrackenIBM Linux Technology Center

Austin, TX

[email protected]

Abstract

An ongoing barrier to scalability has been theamount of memory taken by page tables, es-pecially when large numbers of tasks are map-ping the same shared region. A solution for thisproblem is for those tasks to share a commonset of page tables for those regions.

An additional benefit to implementing sharedpage tables is the ability to share all the pagetables duringfork in a copy-on-write fashion.This sharing speeds upfork immensely forlarge processes, especially given the increasedoverhead introduced by rmap.

This paper discusses my implementation ofshared page tables. It covers the areas that areimproved by sharing as well as its limitations.I will also cover the highlights of how sharedpage tables was implemented and discuss someof the remaining issues around it.

This implementation is based on an initialdesign and implementation done by DanielPhillips last year and posted to the kernel mail-ing list. [DP]

1 Introduction

The Linux® memory management (MM) sub-system has excellent mechanisms for minimiz-ing the duplication of identical data pages bysharing them between address spaces when-

ever possible. The MM subsystem currentlydoes not, however, make any attempt to sharethe lowest layer of the page tables, even thoughthese may also be identical between addressspaces.

Detecting when these page tables may beshared, and setting up sharing for them is fairlystraightforward. In the following sections thecurrent MM data structures are described, fol-lowed by an explanation of how and when thepage tables would be shared. Finally some is-sues in the implementation and some perfor-mance results are described.

2 Major Data Structures

To understand the issues addressed by sharedpage tables it is helpful to understand the struc-tures that make up the Linux MM.

2.1 Themm_structStructure

The anchor for the memory subsystem is themm_structstructure. There is onemm_structfor each active user address space. Allmemory-related information for a task is foundin themm_structor in structures it points to.

2.2 Thevm_area_structStructure

The address layout is contained in two par-allel sets of structures. The logical lay-out is described by a set of structures called

Linux Symposium 316

vm_area_structs, or vmas. There is onevmafor each region of memory with the same filebacking, permissions, and sharing character-istics. Thevmas are split or merged as nec-essary in response to changes in the addressspace. There is a single chain ofvmas attachedto themm_structthat describes the entire ad-dress space, sorted by address. This chain alsocan be collected into a tree for faster lookup.

Each vma contains information about a vir-tual address range with the same characteris-tics. This information includes the starting andending virtual addresses of the range, and thefile and offset into it if it is file-backed. Thevma also includes a set of flags, which indi-cate whether the range is shared or private andwhether it is writeable.

2.3 The Page Table

The address space page layout is described bythe page table. The page table is a tree withthree levels, the page directory (pgd), the pagemiddle directory (pmd), and the page table en-tries (pte). Each level is an array of entries con-tained within a single physical page. The entryat each level contains the physical page num-ber of the next level. The pte entries containthe physical page number of the data page.

For the architectures that use hardware page ta-bles, the page table doubles as the hardwarepage table. This use of the page table puts con-straints on the size and contents of the page ta-ble entries when they are in active use by thehardware.

2.4 Theaddress_spaceStructure

In addition to the structures for each addressspace, there is a structure for each file-backedobject called structaddress_space(not to beconfused with an address space, which is rep-resented by anmm_struct). Theaddress_space

structure is the anchor for all mappings relatedto the file object. It contains links to everyvmamapping this file, as well as links to ev-ery physical page containing data from the file.

All shared memory uses temporary files asbacking store, so each shared memoryvma islinked to anaddress_spacestructure. The onlyvmas not attached to anaddress_spaceare theones describing the stack and the ones describ-ing the bss space for the application and eachshared library. These are called the ’anony-mous’ vmas and are backed directly by swapspace.

2.5 ThepageStructure

All physical pages in the system have a struc-ture that contains all the metadata associatedwith the page. This is called a “structpage.”It contains flags that indicate the current usageof the page and pointers that are used to linkthe page into various lists. If the page containsdata from a file, it also has a pointer to thead-dress_spacestructure for that file and an indexshowing where in the file the data came from.

3 How Memory Gets Mapped

Memory areas are created, modified, and de-stroyed via one of the mapping calls. Thesecalls can map new memory areas, change theprotections on existing memory areas, and un-map memory areas.

3.1 Creating A Memory Mapping

New memory regions are created using thecallsshmget/shmmapor mmap. Both calls cre-ate a mapped memory region, either backedby an open file passed as an argument or bya temporary file created for the purpose. Thenewly mapped memory region is representedby a new or modifiedvma that is attached to

Linux Symposium 317

the tasks mm_structand to theaddress_spacestructure for the file. The page table is nottouched during the mapping call.

Various flags are passed in when a page ismapped. These flags specify the characteristicsof the mapped memory. Two of the key char-acteristics are whether the area is shared or pri-vate and whether the area is writeable or read-only. For read-only or shared areas, the filedata is read from and written back to the file.Private writeable areas are treated specially inthat while the data is read from the file, mod-ified data is not written back to the file. Thedata is saved to swap space as an anonymouspage if it needs to be paged out.

Pages are actually mapped when a task at-tempts to access a virtual address in themapped area. The page fault code first findsthevmadescribing the area, then finds thepteentry that maps a data page for that address. Apage is then found based on the information inthevma.

3.2 Locks and Locking

The MM subsystem primarily relies on threelocks. Two locks are in themm_struct. First isthe mmap_sem, a read/write semaphore. Thissemaphore controls access to thevma chainwithin themm_struct.

The second lock in themm_struct is thepage_table_lock. This spinlock controls accessto the various levels of the page table.

The third lock is in theaddress_spacestruc-ture. This lock controls the chain ofvmasthat map the file associated with thatad-dress_space. In 2.4, it is a spinlock and iscalled i_shared_lock. In 2.5, it was changedto a semaphore and is calledi_shared_sem.

During a page fault themmap_semsemaphoreis taken for read at the beginning of the fault

and is held until the fault is complete. Thepage_table_lockis taken early in the fault, butis released as necessary when the fault needsto block. Holding themmap_semsemaphorefor read allows other tasks with the samemm_structto take page faults but not changetheir mappings.

4 Sharing the PTE Page

Normally the overhead taken up by page tablesis small. On 32 bit architectures, there is typ-ically a maximum of one pgd page, three pmdpages, and one pte page for every 512 or 1024data pages. This ratio may be somewhat higherfor sparsely populated address spaces, but vir-tual addresses are typically allocated in orderso pte pages tend to be filled in fairly densely.

This ratio changes for shared memory areas.A shared memory region that covers an entirepte page may be shared among many addressspaces. This sharing means there will be 1 ptepage for each address space for every 512 or1024 mapped data pages. Note that the ptepages, once allocated, are not freed even if thedata pages have been paged out, so the pte pageoverhead is fixed even under memory pressure.

Shared memory is a common method for manyapplications to communicate among their pro-cesses. It is possible for a massively shared ap-plication to use over half of physical memoryfor its page tables.

Each address space in this scenario has an iden-tical set of pte pages for its shared areas, withall its entries pointing to the same physical datapages. The premise behind shared page tablesis to only allocate a single set of pte pages andset the pmd entry for each address space topoint to it.

A beneficial side effect of sharing the pte pageis once a data page has been faulted in by one

Linux Symposium 318

task, the page appears in the address space ofall other tasks mapping that area. Without shar-ing, each task would have to take its own pagefault to get access to that page.

There are clearly some constraints on when ptepages can be shared. The shared area mustspan the virtual memory mapped by the entirepte page. All address spaces must map the areaat the same virtual address. While in theorythe mapped areas only need to be aligned perpte page, the current implementation requiresthat the virtual addresses be the same.

4.1 Finding PTE Pages to Share

For sharing to work, it is necessary to find anexisting pte page if one exists. Finding thispage is accomplished at page fault time whenthere is not already a pte page for the faultingaddress.

First, the currentvma is checked to see if it iseligible for sharing. It needs to be either share-able or read-only, and needs to span the entireaddress range for that pte page.

Next, the code searches for an existing pte pagethat can be shared. This search is done by go-ing to theaddress_spacefor the current mem-ory area, then walking itsvmachain. Eachvmais checked for compatibility. Thevmaneeds tobe at the same file offset and virtual addressas the faultingvmaand needs to also span theentire pte page. For each matchingvma, itscorresponding page table is checked for a ptepage. If a pte page is found, it is installed inthe pmd entry and its use count is incremented.While in theory it should be possible to sharepte pages between anyvmas whose mappingshave the same pte page alignment, the currentrmap implementation limits sharing to thosethat map to the same virtual address.

4.2 Copy On Write

On some architectures there is a second use forshared page tables. Duringfork, every pte en-try is copied to the new page table. Data pagesthat can’t be fully shared are marked as “copyon write.” Marking a page as copy on write in-volves setting both the parent and the child pteentry to point to the same data page, but withwrite disabled. When either task tries to writeto that page, the page is then duplicated andwrite access is enabled.

Some architectures (including x86) supportdisabling write access in the pmd entry, and in-terpret this to mean disabling write to all thedata pages mapped by that entry. Disablingwrite allows the copy on write concept to be ex-tended to shared page tables. Instead of copy-ing each pte entry atfork time, each pmd entryis set to point to the same pte page and writeaccess is disabled. When a write fault occurs, anew pte page is allocated and all the pte entriesare copied in the same fashion as duringfork inthe non-shared version.

4.3 Locking Changes for Shared PTE Pages

When page tables have shared pte pages, theexisting locking scheme becomes inadequate.The page_table_lockin the mm_structcan nolonger protect the entire page table, since ptepages may be shared with other page tables.

There is a spinlock in the page struct that isnormally used forpte_chains in data pages.Since pte pages have nopte_chains, the lock inthe pte page’s page struct can be used to con-trol access to it. For pte pages this lock be-comes thepte_page_lock. This change meansthepage_table_lockprotects the pgd and pmdlevels and thepte_page_lockprotects the ptepage.

Using this lock changes the locking protocol in

Linux Symposium 319

the page fault path. Thepage_table_lockmuststill be taken to until the pte page is found andselected. Thepte_page_lockis taken for thatpage, at which point thepage_table_lockcanbe released. The rest of fault resolution is doneunder thepte_page_lock.

5 Complications

While making pte pages shared seems like asimple task in theory, there are several thingsthat complicated the task.

Part of pte_chain-based reverse mapping is apointer in the pte page’sstruct pagethat pointsto the mm_structthe page belongs to. Shar-ing the pte page means it can belong to sev-eral mm_structs. It was necessary to add anmm_chainstructure which points to a chain ofmm_structsthat use the pte page.

There are various memory management-related system calls that can modify existingmappings. These includemremap, mprotect,remap_file_pages, and mmap itself. Thesecalls can all change the mappings for an ad-dress space such that the pte page can no longerbe shared. Each of those system calls was mod-ified to identify shared pte pages and unsharethem as necessary.

6 Performance

Shared page tables are primarily intended toreduce the space overhead of page tables, butthere are some performance benefits, as well.

The primary performance gain is duringforkbecause of copy-on-write sharing. Instead ofduplicating a reference to each data pageforkonly needs to duplicate a reference to each ptepage. This can improvefork performace by upto a factor of 10.

Fork speedup is balanced, however, by the costof unsharing each pte page when one of its datapages is written to. Typically, three pte pagesare unshared after everyfork due to the userspace layout. Since small programs generallyonly have three pte pages, only larger programsbenefit from the improvement. In fact, the costof sharing the pte, then unsharing it on pagefault, has a small but measureable cost com-pared to copying. The simple solution to thisis to copy the pte pages onfork if the addressspace only has 3 pte pages.

There is a corresponding performance im-provement forexit and execwhen they teardown the address space. Any pte pages that areshared can be detached simply by decrement-ing their reference count. For each pte pagethat is not shared, the code must examine eachentry to determine what to do with its data pageor swap page.

7 Conclusion

Shared page tables achieves its primary objec-tive of dramatically improving the scalabilityof massively shared applications, as well asalso improving thefork andexit performanceof large tasks. While there is some cost inadded complexity, the benefits far outweigh thecost.

Legal Statement

This paper represents the views of the author, andnot the IBM Corporation.

IBM® is a registered trademark of InternationalBusiness Machines Corporation.

Linux® is a registered trademark of Linus Torvalds.

Other company, product or service names may bethe trademarks or service marks of others.

Linux Symposium 320

References

[DP] Daniel Phillips.http://nl.linux.org/phillips/page.table.sharing

Kernel Janitors: State of the Project

Arnaldo Carvalho de MeloConectiva S.A.

[email protected]

http://advogato.org/person/acme

Abstract

The Kernel Janitors Project has been cleaningup the kernel for quite some time, in this pa-per I’ll present what has been done, tasks thatkernel hackers added to TODO list, tools usedto help in the process, 2.5 changes that need tobe propagated thru the tree and other aspects ofkernel janitoring.

1 What is the Kernel JanitorProject?

The Kernel Janitors Project grew out of oursearch for things to help in the developmentof the Linux kernel, and learning from otherpatches submitted by more experienced peo-ple, we saw that some of these patches indi-cated error patterns that could exist in otherparts of the kernel, we looked and. . . yes, wediscovered that some parts of the kernel suf-fered from the same problems and in theprocess we found code bitrotting. . . I (acme)started maintaining a TODO list for thingsto fix or clean up and people started sub-mitting suggestions for things to fix that Icollected athttp://bazar.conectiva.com.br/~acme/TODO .

2 A Way for Newbies to StartHacking the Kernel

Looking at the httpd logs I discovered that lotsof people accessed it, so this indeed was some-thing useful as a starting point for people alsowanting to help in cleaning up/fixing parts ofthe kernel.

Several people have the goodwill to help inkernel programming, but many don’t have thetime to help on areas that require more knowl-edge and effort, so the Kernel Janitor ProjectTODO list helps, as there are lots of simpletasks that needs work but are trivial enough toallow those people to help while not requiringmuch time and effort.

3 The Project Moves to Source-Forge

Dave Jones suggested that we moved this tosourceforge, so that we could have it on CVSand allow more people to have admin rights toadd more entries, add explanations about thetasks, etc.

I then registered the domain kerneljanitors.organd made it point to the sourceforge web page,that is very simple but has important pointersto resources for new janitors.

Linux Symposium 322

4 Project Committers

We’re always accepting more people, prefer-ably maintainers of parts of the kernel, so thatwe can improve the TODO list, and as thereis always some boring and repetitive work thatthe maintainers postpone because there is al-ways more important things to do.

In fact even Linus has posted at least once inthe mailing list asking for help on simple repet-itive tasks, such as the irqreturn_t conversionfor all the interrupt handling routines.

These are the current commiters for thisproject, at this time:

• Arnaldo Carvalho de Melo

• Dave Jones

• Jeff Garzik

• Matthew Wilcox

• Randy Dunlap

• Tariq Shureih

• William Lee Irwin III

5 Mailing list

Using the SourceForge infrastructure we cre-ates the kernel-janitor-discuss mailing list,where we discuss the aspects of the project andreview janitor patches.

There has been a continuous arrival of peopleasking where to start helping, and a number ofthem became regular janitors.

6 Some Janitors

• William Stinson

– check_region removal

• Art Haas

– C99 struct init style

7 Tools

There are now several tools that help on pin-pointing most of the entries in the TODO listand even some more involved problems thatneeds hands to fix, I’ll briefly talk about sev-eral of such tools in the next sections.

7.1 Stanford Checker

Perhaps the first tool for source checking usedin the Linux kernel, the Stanford Checker isbased on a modified gcc that does several ver-ifications for different types of common prob-lems, and then the Stanford guys do some san-ity checking and post the list on the linux-kernel mailing list, where people comment onit, checking if they are false positives and mostfrequently fixing the bugs.

It is unfortunately an unreleased tool, but it hasbeen of great help over the last years.

7.2 kj.pl

This was a very simple perl script that DaveJones wrote, that searched for some very trivialproblems, but Dave has stopped working on itas now we have smatch.

7.3 smatch

Dan Carpenter’s smatch is a released tool thatallows developers to write perl scripts to searchfor problems, it is also based on a modified gcc.

Dan has created the kbugs.org web site wherehe has several smatch scripts and a list of the

Linux Symposium 323

results of those scripts, allowing janitors topick real problems to work on.

The current scripts, for reference, are:

• ReleaseRegion

• ReleaseIRQ

• DoubleSpinlock

• Dereference

• GFP_DMA

• UnreachedCode

• SpinlockUndefined

• FunctionStack

• UncheckedReturn

• SpinSleepLazy

• UnFree

I’ll quote Dan now just as an example ofsmatch results in kbugs.org:

“There were quite a few smatch related fixesin the 2.5.69 kernel. Someone fixed 4SpinSleepLazy bugs, 3 UnreachedCode bugsand 6 unchecked calls ofcopy_to_user() .It’s not entirely clear who did what from thechangelog, but thank you anonymous heros.”

7.4 cqual

Cqual is another tool that can help findingproblems on codebases such as the kernel, hereis the project description, from its authors:

“Cqual is a type-based analysis tool that pro-vides a lightweight, practical mechanism forspecifying and checking properties of C pro-grams. Cqual extends the type system of Cwith extra user-defined type qualifiers. The

programmer adds type qualifier annotations totheir program in a few key places, and Cqualperforms qualifier inference to check whetherthe annotations are correct. The analysis re-sults are presented with a user interface thatlets the programmer browse the inferred quali-fiers and their flow paths.”

It is part of a bigger project at the Universityof Berkeley called Open Source Quality, thisseems to be a project that deserves investiga-tion by janitors as it wasn’t mentioned up tonow on the kjp mainling list.

7.5 sparse

And now to something unreleased at this time:Linus Torvalds’s sparse tool:

“Sparse is a semantic parser of source files: it’sneither a compiler (although it could be usedas a front-end for one), nor is it a preprocessor(although it contains as a part of it a prepro-cessing phase).

It is meant to be a small—and simple—library.Scanty and meager, and partly because of thateasy to use. It has one mission in life: create asemantic parse tree for some arbitrary user forfurther analysis. It’s not a tokenizer, nor is itsome generic context-free parser. In fact, con-text (semantics) is what it’s all about—figuringout not just what the grouping of tokens are, butwhat thetypesare that the grouping implies.”

It is indeed very interesting that more and morepeople are working towards having tools thatcan help in improving the quality of OpenSource projects such as Linux.

8 Documentation

• lwn.net Articles by Jonathan Corbet

• Tariq’s "Drivers DOs and DONTs"

Linux Symposium 324

• Other Articles for janitors

• Arjan’s article about how not write adriver

• Greg KH’s article.

Linux Kernel Power Management

Patrick MochelOpen Source Development Labs

[email protected]

Abstract

Power management is the process by whichthe overall consumption of power by a com-puter is limited based on user requirements andpolicy. Power management has become a hottopic in the computer world in recent years,as laptops have become more commonplaceand users have become more conscious of theenvironmental and financial effects of limitedpower resources.

While there is no such thing as perfect powermanagement, since all computers must usesome amount of power to run, there have beenmany advances in system and software archi-tectures to conserve the amount of power beingused. Exploiting these features is key to pro-viding good system- and device-level powermanagement.

This paper discusses recent advances in thepower management infrastructure of the Linuxkernel that will allow Linux to fully exploit thepower management capabilities of the variousplatforms that it runs on. These advances willallow the kernel to provide equally great powermanagement, using a simple interface, regard-less of the underlying archtitecture.

This paper covers the two broad areas ofpower management—System Power Manage-ment (SPM) and Device Power Management(DPM). It describes the major concepts behindboth subjects and describes the new kernel in-frastructure for implementing both. It also dis-

cusses the mechanism for implementing hiber-nation, otherwise known as suspend-to-disk,support for Linux.

1 Overview

Benefits of Power Management

A sane power management infrastructure pro-vides many benefits to the kernel, and not onlyin the obvious areas.

Battery-powered devices, such as embeddeddevices, handhelds, and laptops reap most ofthe rewards of power management, since themore conservative the draw on the battery is,the longer it will last.

System power management decreases boottime of a system, by restoring previously savedstate instead of reinitializing the entire system.This conserves battery life on mobile devicesby reducing the annoying wait for the computerto boot into a useable state.

Recently, power management concepts havebegun to filter into less obvious places, like theenterprise. In a rack of servers, some serversmay power down during idle times, and powerback up when needed again to fulfill networkrequests. While the power consumption of asingle server is but a drop in the water, beingable to conserve the power draw of dozens orhundreds of computers could save a companya significant amount of money.

Linux Symposium 326

Also, at the lower-level, power managementmay be used to provide emergency reaction toa critical system state, such as crossing a pre-defined thermal threshold or reaching a criti-cally low battery state. The same concept canbe applied when triggering a critical softwarestate, like an Oops or a BUG() in the kernel.

System and Device Power Management

There are two types of power management thatthe OS must handle—System Power Manage-ment and Device Power Management.

Device Power Management deals with the pro-cess of placing individual devices into low-power states while the system is running. Thisallows a user to conserve power on devices thatare not currently being used, such as the sounddevice in my laptop while I write this paper.

Individual device power management may beinvoked explicitly on devices, or may happenautomatically after a device has been idle for aset of amount of time. Not all devices supportrun-time power management, but those that domust export some mechanism for controlling itin order to execute the user’s policy decisions.

System Power Management is the process bywhich the entire system is placed into a low-power state. There are several power statesthat a system may enter, depending on the plat-form it is running on. Many are similar acrossplatforms, and will be discussed in detail later.The general concept is that the state of the run-ning system is saved before the system is pow-ered down, and restored once the system hasregained power. This prevents the system fromperforming an entire shutdown and startup se-quence.

System power management may be invoked fora number of reasons. It may automatically en-ter a low-power state after it has been idle forsome amount of time, after a user closes a lid

on a laptop, or when some critical state hasbeen reached. These are also policy decisionsthat are up to the user to configure and requiresome global mechanism for controlling.

2 Device Power Management

Overview

Device power management in the kernel ismade possible by the new driver model in the2.5 kernel. In fact, the driver model was in-spired by the requirement to implement decentpower management in the kernel. The newdriver model allows generic kernel to commu-nicate with every device in the system, regard-less of the bus the device resides on, or the classit belongs to.

The driver model also provides a hierarchi-cal representation of the devices in the system.This is key to power management, since thekernel cannot power down a device that an-other device, that isn’t powered down, relieson for power. For example, the system cannotpower down a parent device whose children arestill powered up and depend on their parent forpower.

In its simplest form, device power manage-ment consists of a description of the state adevice is in, and a mechanism for controllingthose states. Device power states are describedas ‘D’ states, and consist of states D0-D3, in-clusive. This device state representation is in-spired by the PCI device specification and theACPI specification [ACPI]. Though not all de-vice types define power states in this way, thisrepresentation can map on to all known devicetypes.

Each D state represents a tradeoff between theamount of power a device is consuming andhow functional a device is. In a lower powerstate (represented by a higher digit following

Linux Symposium 327

D), some amount of power to a device is lost.This means that some of the device’s operatingstate is lost, and must be restored by its driverwhen returning to the D0 state.

D0 represents the state when the device is fullypowered on and ready for, or in, use. This stateis implicitly supported by every device, sinceevery device may be powered on at some pointwhile the system is running. In this state, allunits of a device are powered on, and no devicestate is lost.

D3 represents the state when the device is off.This state is also implicitly supported by everydevice, since every device is implicitly pow-ered off when the system is powered off. In thisstate, all device context is lost and must be re-stored before using the device again. This usu-ally means the device must also be completelyreinitialized.

The PCI Power Management spec goes on todefine D3hot as a D3 state that is entered viadriver control and D3cold that is entered whenthe entire system is powered down. In D3hot,the device may not lose all operating power,requiring less restoration that must take place.This is however, device-dependent. The kerneldoes not distinguish between the two, though adriver theoretically could take extra steps to doso.

D1 and D2 are intermediate power states thatare optionally supported by a device. In eachcase, the device is not functional, but not en-tirely powered off. In order to bring the deviceback to an operating state, less work is requiredthan reviving the device from D3. In D1, morepower is consumed than in D2, but more devicecontext is preserved.

A device’s power management information isstored instruct device_pm :

struct device_pm {

#ifdef CONFIG_PMdev_power_t power_state;u8 * saved_state;atomic_t depend;atomic_t disable;struct kobject kobj;

#endif};

struct device contains a statically allo-cateddevice_pm object. The PM configu-ration dependency guarantees the overhead forthe structure is nil when power managementsupport is not compiled in.

The kernel defines the following power statesin include/linux/pm.h :

typedef enum {DEVICE_PM_ON,DEVICE_PM_INT1,DEVICE_PM_INT2,DEVICE_PM_OFF,DEVICE_PM_UNKNOWN,

} dev_power_t;

When a device is registered, its initial powerstate is set toDEVICE_PM_UNKNOWN. Thedevice driver may query the device and initial-ize the known power state using

void device_pm_init_power_state(struct device * dev,dev_power_t state);

Controlling a Device’s State

A device’s power state may be controlled bythe suspend() and resume() methods instruct device_driver :

int (*suspend)(struct device * dev,u32 state, u32 level);

int (*resume) (struct device * dev,u32 level);

Linux Symposium 328

These methods may be initialized by the low-level device driver, though they are typicallyinitialized at registration time by the bus driverthat the driver belongs to. The bus’s functionsshould forward power management requests tothe bus-specific driver, modifying the seman-tics where necessary.

This model is used to provide the easiest routewhen converting to the new driver model.However, a device driver’s explicit initializa-tion of these methods will be honored.

The same methods are called during individualdevice power management transitions and sys-tem power management transitions.

There are two steps to suspend a device andtwo steps to resume it. In order to suspend adevice, two separate calls are made to the sus-pend() method—one to save state, and anotherto power the device down. Conversely, one callis made to the resume() method to power thedevice up, and another to restore device state.

These steps are encoded:

enum {SUSPEND_SAVE_STATE,SUSPEND_POWER_DOWN,

};

enum {RESUME_POWER_ON,RESUME_RESTORE_STATE,

};

and are passed as the ‘level’ parameter to eachmethod.

During theSUSPEND_SAVE_STATEcall, thedriver is expected to stop all device requestsand save all relevant device context based onthe state the device is entering.

This call is made in process context, so thedriver may sleep and allocate memory to

save state. However during system suspend,backing swap devices may have already beenpowered down, so drivers should useGFP_ATOMICwhen allocating memory.

SUSPEND_POWER_DOWNis used only tophysically power the device down. This callhas some caveats, and drivers must be awareof them. Interrupts will be disabled when thisroutine is called. However, during run-time de-vice power management, interrupts will be re-enabled once the call returns. Some devicesare known to cause problems once they arepowered down and interrupts reenabled—e.g.flooding the system with interrupts. Driversshould be careful not to service power manage-ment requests for devices known to be buggy.

During system power management, interruptsare disabled and remain disabled while power-ing down all devices in the system.

The resume sequence is identical, though re-versed, from the suspened sequence. TheRESUME_POWER_ONstage is performed first,with interrupts disabled. The driver is expectedto power the device on. Interrupts are then en-abled and theRESUME_RESTORE_STATEisperformed, and the driver is expected to restoredevice state and free memory that was previ-ously allocated.

A driver may use thestruct device_pm::saved_state field to store a pointerto device state when the device is powereddown.

Power Dependencies

Devices that are children of other devices (e.g.devices behind a PCI bridge) depend on theirparent devices to be powered up to either pro-vide power to them and/or provide I/O transac-tions.

The system must respect the power dependen-

Linux Symposium 329

cies of devices and must not attempt to powerdown a device which another device dependson being on. Put another way, all children de-vices must be powered down before their par-ent can be powered down. Conversely, the par-ent device must be powered up before any chil-dren devices may be accessed.

Expressing this type of dependency is simple,since it is easy to determine whether or not adevice has any children or not. But, there aremore interesting power dependencies that aremore difficult to express.

On a PCI Hotplug system, the hotplug con-troller that controls power to a range of slotsmay reside on the primary PCI bus. How-ever, the slots it controls may reside behind aPCI-PCI bridge that is a peer of the hotplugcontroller. The devices in the slots depend onthe hotplug controller being on to operate, butit is not the devices’ parent. There are sim-ilar transversal relationships on some embed-ded platforms in which some I/O controller re-sides near the system root that some PCI de-vices, several layers deep, may depend on tocommunicate properly.

Both types of power dependencies are rep-resented using the struct device_pm::depend field. Implicit dependencies,like parent-child relationships, are handled bythe depend count being incremented when achild is registered with the PM core. Whenthat child device is powered down or removed,its parent’s depend count is decremented. Onlywhen a device’s depend count is 0 may it bepowered down.

Explicit power dependencies can be imposedon devices using

int device_pm_get(structdevice *);

void device_pm_put(structdevice *);

device_pm_get() will increment a de-vice’s dependency count, anddevice_pm_put() will decrement it. It is up to the driverto properly manage the dependency counts ondevice discovery, removal, and power manage-ment requests.

Disabling Power Management

There are circumstances in which a driver mustrefuse a power management request. Thisis usually because the driver author does notknow the proper reinitialization sequence, orbecause the user is performing an uninterrupt-ible operation like burning a CD.

It is valid for a driver to return an error froma suspend() method call. For example, adriver may know a priori that it can’t handlethe request. This works to the system’s bene-fit, since the PM core can check if any deviceshave disabled power management before start-ing a suspend transition.

To disable or enable power management, a de-vice may call

int device_pm_disable(structdevice *);

void device_pm_enable(structdevice *);

The former increments the structdevice_pm::disable count, and thelattr decrements it. If the count is positive,system power management will be disabledcompletely, and device power management onthat device.

These calls should be used judiciously, sincethey have a global impact on system powermanagement.

Linux Symposium 330

3 System Power Management

System power management (SPM) is the pro-cess of placing the entire system into a low-power state. In a low-power state, the sys-tem is consuming a small, but minimal, amountof power, yet maintaining a relatively low re-sponse latency to the user. The exact amountof power and response latency depends on thestate the system is in.

Power States

The states a system can enter are dependent onthe underlying platform, and differ across ar-chitectures and even generations of the samearchitecture. There tend to be three states thatare found on most archtitectures that supporta form of SPM, though. The kernel explic-itly supports these states—Standby, Suspend,and Hibernate, and provides a mechnanism fora platform driver (an architectural port of thekernel) to define new states.

typedef enum {POWER_ON = 0,POWER_STANDBY = 0x01,POWER_SUSPEND = 0x02,POWER_HIBERNATE = 0x04,

} pm_system_state_t;

Standby is a low-latency power state that issometimes referred to as “power-on suspend.”In this state, the system conserves power byplacing the CPU in a halt state and the devicesin the D1 state. The power savings are not sig-nificant, but the response latency is minimal—typically less than 1 second.

Suspend is also commonly known as “suspend-to-RAM.” In this state, all devices are placed inthe D3 state and the entire system, except mainmemory, is expected to maintain power. Mem-ory is placed in self-refresh mode, so its con-tents are not lost. Response latency is higher

than Standby, yet still very low—between 3-5seconds.

Hibernate conserves the most power by turn-ing off the entire system, after saving state to apersistant medium, usually a disk. All devicesare powered off unconditionally. The responselatency is the highest—about 30 seconds—butstill quicker than performing a full boot se-quence.

Most platforms support these states, thoughsome platforms may support other states orhave requirements that don’t match the as-sumptions above. For example, some PPC lap-tops support Suspend, but because of a lackof documentation, the video devices cannot befully reinitialized and hence may not enter theD3 state. The hardware will supply enoughpower to devices for them to stay in the D2state, which the drivers are capable of recov-ering from.

Instead of cluttering the code with a lot of con-ditional policy to determine the correct state fordevices to enter, the PM subsystem abstractssystem state information into dynamically reg-istered objects.

struct pm_state {struct pm_driver * drv;pm_system_state_t sys;pm_device_state_t dev;struct kobject kobj;

};

The drv field is a pointer to the platform-specific object configured to handle the powerstate. The sys field is the low-level power statethat the system will enter. The dev field is thelowest power state that devices may enter. Thekobj field is the generic object for managing aninstance’s lifetime.

The kernel defines default power state objectsrepresenting the assumptions above:

Linux Symposium 331

struct pm_state pm_state_standby;struct pm_state pm_state_suspend;struct pm_state pm_state_hibernate;

Platform drivers may also define and registeradditional power states that they support using:

intpm_state_register(struct

pm_state *);voidpm_state_unregister(struct

pm_state *);

The PM sysfs Interface

The PM infrastructure registers a top-levelsubsystem with the kobject core, which pro-vides the/sys/power/ directory in sysfs.By default, there is one file in the directory/sys/power/state .

Reading from this file displays the states thatare currently registered with the system; e.g.:

# cat /sys/power/statestandby suspend hibernate

By writing the name of a state to this file, thesystem will perform a power state transition,which is described next.

Each power state that is registered receivesa directory in/sys/power/ , and three at-tribute files:

# tree /sys/power/suspend//sys/power/suspend/|-- devices|-- driver‘-- system

The ‘devices’ file and the ‘system’ file describewhich power state the devices in the computer

and the state the computer itself are to enter,respectively. The ‘driver’ file displays whichlow-level platform PM driver is configured tohandle the power transition. Writing to this filesets the driver internally.

Power Management Platform Drivers

The process of transitioning the OS into alow-power state is largely platform-agnostic.However, the low-level mechanism for actuallytransitioning the hardware to a low-power stateis very platform specific, and even dependenton the generation of the hardware.

On some platforms, there may be multipleways to enter a low-power state, presenting apolicy decision for the user to make. Notethis arises usually only in choosing whether toenter a minimal power state during a Hiber-nation transition, or turning the system com-pletely off.

To cope with these variations, the PM core de-fines a simple driver model:

struct pm_driver {u32 states;int (*prepare)(u32 state);int (*save) (u32 state);int (*sleep) (u32 state);int (*restore)(u32 state);int (*cleanup)(u32 state);struct kobject kobj;

};

intpm_driver_register(struct

pm_driver *);

voidpm_driver_unregister(struct

pm_driver *);

The states field ofstruct pm_driver is alogical or of the states the driver supports. Themethods are platform-specific calls that the PM

Linux Symposium 332

core executes during a power state transition.They are designed to perform the following:

prepare — Verify that the platform can enterthe requested state and perform any nec-essary preparation for entering the state.

save — Save low-level state of the platformand the CPU(s).

sleep — Enter the requested state.

restore — Restore low-level register state ofthe platform and CPU(s).

cleanup — Perform any necessary actions toleave the sleep state.

A platform should intialize and register a driveron startup:

struct pm_driver acpi_pm_driver = {.states = (POWER_STANDBY |

POWER_SUSPEND |POWER_HIBERNATE),

.prepare = acpi_enter_sleep_state_prep,

.sleep = acpi_pm_sleep,

.cleanup = acpi_leave_sleep_state,

.kobj = { .name = "acpi" },};

static int __init acpi_sleep_init(void) {return pm_driver_register(

&acpi_pm_driver);}

Each registered PM driver receives a di-rectory in sysfs in/sys/power/ . Eachdriver receives one default attribute file namedstates , which displays the power states thedriver supports. This file is not writable byuserspace.

# tree /sys/power/acpi//sys/power/acpi/‘-- states# cat /sys/power/acpi/statesstandby suspend hibernate

Platform drivers may define and export theirown attributes.

struct pm_attribute {struct attribute attr;ssize_t (*show)(struct pm_driver *,

char *);ssize_t (*store)(struct pm_driver *,

const char *,size_t);

};

int pm_attribute_create(struct pm_driver *,struct pm_attribute*);

void pm_attribute_remove(struct pm_driver *,struct pm_attribute *);

The semantics forpm_driver attributes fol-low the same semantics as other sysfs at-tributes. Please see the kernel sysfs documen-tation for more information.

Power State Transitions

Transitioning the system to a low-power stateis, unfortunately, not as simple as telling theplatform to enter the requested low power state.The file drivers/power/suspend.ccontains the entire sequence, and should beused as reference material for the officialprocess. A synopsis is provided here.

The first step is to verify that the systemcan enter the power state. The PM coremust have a driver that supports the requestedstate, the driver must return success from itsprepare() method, and the driver core mustreturn success fromdevice_pm_check() .Next, the PM core quiesces the running systemby disabling preemption and ‘freezing’ all pro-cesses.

Next, system state is saved by callingdevice_suspend() to save device state,

Linux Symposium 333

and the driver’ssave() method to save low-level system state. If we’re entering a vari-ant of the Hibernation state, the contents ofmemory must be saved to a persistant medium.pm_hibernate_save() is called to per-form this, which is described in the section Hi-bernation.

Once state is saved, the PM core disables in-terrupts and callsdevice_power_down()to place each device in the specified low powerstate. Finally, it calls the driver’ssleep()method to transition the system to the low-power state.

The resume sequence has two variants, de-pending on whether the system is returningfrom a Hibernation state or not. If it is not,the platform is responsible for returning execu-tion to the correct place (after the return fromthe driver’ssleep() method). This may be afunction of the processor, the firmware, or thelow-level platform driver.

If we’re returning from Hibernation, the sys-tem detects it during a boot process in thefunction pm_resume() . pm_resume() isa late_initcall, which means it is called aftermost subsystems and drivers have been regis-tered and initialized, including all non-modularPM drivers. It calls pm_hibernate_load() , which is responsible for attemptingto read, load, and restore a saved memory im-age. Doing this replaces the currently run-ning system with a saved one, and executionreturns to after the call topm_hibernate_store() .

One way or another, the PM core proceedsto power on all devices and restore interrupts.The driver’srestore() method is called torestore low-level system state, anddevice_pm_resume() is called to restore device con-text. Finally, the driver’scleanup() methodis called, processes are ‘thawed,’ and preemp-tion is reenabled.

A suspend transition is triggered by writ-ing the requested state to the sysfs file/sys/power/state . Once the completetransition is complete, execution will return tothe process that wrote the value.

4 Hibernation

This section describes the Hibernate powerstate; specifically the process the PM core usesto save memory to a persistant medium, andthe model for implementing a low-level back-end driver to read and write saved state from aspecific medium.

As mentioned in the previous section, Hiber-nate is a low-power state in which systemmemory state is saved to a persistant mediumbefore the system is powered off and restoredduring the system boot sequence.

Hibernate is the only low-power state that canbe used in the absence of any platform sup-port for power management. Instead of en-tering a low-power state, the configured PMdriver may simply turn the system off. Thismechanism provides perfect power savings (bynot consuming any), and can be used to workaround broken power management firmware orhardware. The PM core registers a default plat-form driver that supplies this mechanism. It isnamed ‘shutdown’ and supports the Hibernatestate only.

Hibernation can also add value to situationswhich would otherwise ignore standard powermanagement concepts. For example, systemstate can be saved and restored should a battery(either in a laptop or a UPS) become critcallylow. Or, system state could be saved when thekernel Oops’d or hit aBUG() . The systemcould be rebooted and the state examined later.

Linux Symposium 334

Hibernation Backend Drivers

Hibernate is commonly referred to as“suspend-to-disk,” implying that the mediumthat system state is saved to is a physical disk.This assumption does not offer the possibilitythat another type of media may be used tocapture state, nor does it make the distinctionof how the state is stored on disk, since it couldtheoretically be stored on a dedicated partition,in free swap space, or in a regular file on anarbitrary filesystem.

The PM subsystem offers the ability to config-ure a variable medium type to save state to.

struct pm_backend {int (*open) (void);void (*close)(void);int (*read_image)(void);int (*write_image)(void);struct kobject kobj;

};

int pm_backend_register(struct pm_backend *);

void pm_backend_unregister(structpm_backend *);

The PM core provides a default backend drivernamedpmdisk that uses a dedicated partitiontype to save state. The internals of pmdisk arediscussed later.

Backend drivers are registered as children ofthe Hibernatepm_state object, and are rep-resented by directories in sysfs.

They may also define and export attributes us-ing the following interface:

struct pm_backend_attr {struct attribute attr;ssize_t (*show)(struct pm_backend *,

char *);

ssize_t (*store)(struct pm_backend *,

const char *,size_t);

};

int pm_backend_attr_create(struct pm_backend *,struct pm_backend_attr *);

void pm_backend_attr_remove(struct pm_backend *,struct pm_backend_attr *);

Snapshotting Memory

The Hibernate core ‘snapshots’ system mem-ory by indexing and copying every active pagein the system. Once a snapshot is complete, thesaved image and index is passed to the backenddriver to store persistantly.

The snapshot process has one critical require-ment: that at least half of memory be free. Thisimposes a strict limitation on the use of the cur-rent Hibernate implementation during periodsof high memory usage. However, this designdecision simplifies the requirements of the im-plementaion itself.

The snapshot sequence is a three-step process.First, all of the active pages in the system areindexed, enough new pages are allocated toclone these pages, then each page is copied intoits clone, or “shadow.”

Active pages are detected by iterating overeach page frame number (’pfn’ ) in the sys-tem and determining whether we should save itor not. A page’s saveability is initially deter-mined by whether or not thePageNosave bitis set, and then whether the page is free or not.Reserved pages may not be saveable, depend-ing on whether they exist in the’__nosave’data section.

Pages marked Nosave or declaredin the __nosave section (with the’__nosavedata’ suffix) are volatile

Linux Symposium 335

data and variables internal to the Hibernatecore. They are used and modified during thesnapshot process, and are not saved.

Saveable pages are indexed in page-sized ar-rays calledpm_chapters :

#define PG_PER_CHAPT \(PAGE_SIZE / sizeof(pgoff_t))

struct pm_chapter {pgoff_t c_pages[PG_PER_CHAPT];

};

pm_chapters are dynamically allocatedbased on the number of saveable pages in thesystem. The addresses of the allocated chaptersare stored in another page-sized array, called apm_volume :

#define CHAPT_PER_VOL \(PAGE_SIZE / \sizeof(struct pm_chapter *))

struct pm_volume {struct pm_chapter *

v_chapters[CHAPT_PER_VOL];};

There are two staticpm_volumes in the Hi-bernate core—one for the memory index (pm_mem_index ), and one for the snapshot (pm_mem_shadow). This imposes an upper limiton the amount of memory that can be snapshot-ted by the Hibernate core:

CHAPT_PER_VOL * PG_PER_CHAPT * PAGE_SIZE / 2

is the number of bytes that can be saved, as-suming half of memory must be free to storethe snapshot. On a 32-bit x86 machine with4K-sized pages, this works out to be:

1024 * 1024 * 4096 / 2= 2,147,483,648 bytes= 2 GB

which is more than enough, since accessingmemory above 1GB requires 4M-sized pages.

After memory has been indexed, but beforeit has been copied, the contents ofpm_mem_index andpm_mem_shadoware copied topm_mem_clone and pm_shadow_clone .The latter are also statically allocated objects,but are not declared “nosave.” The purpose ofthe clones is to save the addresses of the dy-namically allocated chapter pages so we canfree them once the saved image has been re-stored.

At this stage, the Hibernate core calls a re-quired architecturespecific function:

int pm_arch_hibernate(pm_system_state_t state);

The state parameter should be set toPOWER_HIBERNATE. This call is responsible for sav-ing low-level register state and callingpm_hibernate_save() , which copies each in-dexed page inpm_mem_index to its corre-sponding page inpm_mem_shadow.

Restoring Memory

During a resume sequence, the Hibernatecore calls the backend’sopen() method,which is responsible for settingpm_num_pages , which the Hibernate core will use topre-allocatepm_mem_index and pm_mem_shadow .

The backend’sread_image() method iscalled, which populatespm_mem_index withthe target location of each saved page, andpm_mem_shadow, which contains the savedpages.

The saved image will replace the memory on adifferent running system. The pages that havebeen allocated to store the saved image popu-lated from the backend may conflict with pages

Linux Symposium 336

in the saved image that are to be restored. TheHibernate backend must guarantee that none ofthe pages currently pointed to bypm_mem_shadow conflict with the pages indexed bypm_mem_index . To do this, it loops througheach page address inpm_mem_shadowandcompares them with each page address inpm_mem_index . If any matches are found, a newpage is allocated and the contents copied.

To replace memory, the Hibernate core calls

pm_arch_hibernate(POWER_ON);

The architecture is responsible for iteratingover the pages inpm_mem_shadow andcopying each one to its destination, as indexedin pm_mem_index . It is also responsible forrestoring low-level register state once memoryhas been replaced.

This burden is placed on the architecture so itcan implement a replacement algorithm with-out using the stack for variable storage. Thesaved memory image contains the saved stack,while the current stack pointer register willpoint to a location on the stack in the memorybeing replaced. These will likely not match andcause the system to crash very quickly.

Once the memory image is restored, the archi-tecture must restore register context to get thestack pointer pointing to the right place. Thisis the reason that the same function is called toboth save and restore the low-level registers.

Returning from pm_arch_hibernate()once memory has been replaced will re-store execution to the point inhibernate_write() wherepm_arch_hibernate()was called, in the saving sequence. To detectthis, the Hibernate core declares:

static in_suspend __nosavedata = 0;

and sets to it one during the save path. Sinceit’s not saved, it will be 0 during the re-store path, allowing the Hibernate core to be-have appropriately. The cloned volumes arecopied back intopm_mem_index and pm_mem_shadow, and the dynamically allocatedpages are freed.

Backend Driver Semantics

The Hibernate core calls the backend driver’sopen() method before any Hibernate opera-tion. It is the backend’s responsibility to ver-ify the existence of the media and to open anynecessary communication channels to it. Thebackend driver is responsible for reading imagemetadata from the medium and settingpm_num_pages to the number of saved pages if asaved image exists. The Hibernate core willuse this value to pre-allocate storage for thesaved pages.

It may also use this opportunity to verify thereis enough free space on the device. The maxi-mum requirement is the total amount of mem-ory in the system, as indicated by:

num_physpages * PAGE_SIZE

This check is optional at this stage, since thesize of the saved memory image may be muchsmaller than this, and may fit on a device withless free space than the total size of memory.

When the Hibernate core is done, it will callthe backend’sclose() method. The back-end is responsible for closing any communica-tion channels to the storage medium and free-ing any memory it had allocated.

After the Hibernate core has shadowed mem-ory, it calls the backend’swrite_image()method. It does not pass any parameters.pm_mem_index andpm_mem_shadowmust beused directly. The backend must saved each

Linux Symposium 337

page pointed to in each chapter ofpm_mem_shadow . It must also save each chapter pageof pm_mem_index . The exact format inwhich these are saved are up to the driver.

When restoring a memory image, after theHibernate core has allocated storage forthe saved memory, the backend’sread_image() method is called.pm_mem_indexcontains enough allocated chapters to store thesaved chapters andpm_mem_shadow con-tains enough allocated chapters and pages tostore all of the saved pages. The backend mustpopulate all of these.

pmdisk

pmdisk is a simple hibernate backend driver.It uses a dedicated partition with a custom for-mat for storing system state. Internally, pmdiskuses the bio layer to read and write pages di-rectly to/from the disk.

A pmdisk partition may be created using a util-ity calledpmdisk . It simply writes a pmdiskheader to a partition, which is defined as:

#define PM_HIBERNATE_SIG "PMHibernate"#define PM_HIBERNATE_VER 1

#define PM_UNUSED_SPACE \(PAGE_SIZE - (4 * \sizeof(unsigned long) + 16))

struct pmdisk_header {char h_unused[PM_UNUSED_SPACE];unsigned long h_version;unsigned long h_chksum;unsigned long h_pages;unsigned long h_chapters;char h_sig[16];

} __attribute__((packed));

Internally, the pmdisk backend driver reads theheader from the first page of the configuredpartition when itsopen() method is called.It verifies that it is a pmdisk partition, and sets

pm_num_pages if there is an image stored onthe disk.

On aclose() call, pmdisk set theh_pages ,h_chapters , andh_chksum fields of theheader and writes it to the first page on the disk.Note that on a memory restore operation,pm_num_pages will be 0, signifying the memoryimage on the disk is no longer valid.

A saved memory image on a pmdisk partitionis layed out like:

0: pmdisk header1 toNc Saved chapters ofpm_mem_indexNc to Np Saved pages frompm_mem_shadow

On a write_image() call, pmdisk willfirst initialize an internal checksum variable.Itwill then write each chapter frompm_mem_index to disk, then each page frompm_mem_shadow to disk. As it writes each page, itwill pass it to a checksum function. The check-sum function is simple and definitely not cryp-tographically secure. But, it does provide aneasy verification that an image on disk is valid.

On aread_image() call, pmdisk reads eachchapter intopm_mem_index and each pageinto pm_mem_shadow. As it reads each page,it checksums them. Once all pages have beenread, it compares the current checksum withthe h_chksum field of the header. It returnssuccess only if they match.

The internal pmdisk exports a sysfs attributefile named ‘dev’ which userspace must use totell the kernel of the correct pmdisk partitionto use. There is currently no way for pmdisk toautomatically detect any valid partitions in thesystem.

The value that userspace must write todev isa 16-bit dev_t value in hexadecimal formatcontaining the major/minor number pair of thedevice to use. This format is not favored, but

Linux Symposium 338

is the only current method for obtaining a ref-erence to a specific block device at the time ofwriting. This interface will change in the fu-ture.

5 Other Power Management Op-tions

So far, a lot of talk has been dedicated to de-scribing the internals of the new power man-agement subsystem, but little has been givento describe how the new infrastructure inter-acts with current power management options.This section describes those relationships, andalthough it focuses on options specific to ia32platforms, the relationships should be extend-able to other platforms.

ACPI

In terms of system power management, fitsnicely into the new PM infrastructure. It be-haves as a PM driver, and provides platform-specific hooks to transition the system into alow-power state. At the basic SPM level, thisis all that is required, though ACPI offers a po-tentially much more powerful solution, since itit exposes more intimate knowledge of the plat-form power requirements than has ever beenavailable on ia32 platforms (e.g. response la-tencies, power consumption etc.). Exploitingthis knowledge is up to the ACPI platformdriver to expose these attributes via sysfs.

ACPI offers similar potential for device powermanagement. Devices that appear in thefirmware’s DSDT (Differentiated System De-scription Table) may expose a very fine-grained level of detail about the devices’ powerrequirements and capabilities.

ACPI also stress the capabilities of device Per-formance States. A performance state is apower state that describes a trade-off between

the capabilities of a device against the powerconsumption of the device. In each perfor-mance state that a device supports, the device isfully running, but different functional hardwareunits may be powered off to conserve power.The driver model does not explicitly recognizeperformance states, though the new PM exten-sions to the driver model provide a frameworkthat could easily be extended to recognize per-formance states.

APM

APM power management does not appear onvery many new systems, but the current Linuxinstalled base includes a large number of APM-capable computers. The new PM model wasnot developed with APM, or any firmware-driven PM model, in mind. However, care wastaken to ensure that it conceptually made senseto use such mechanisms as low-level platformdrivers for the PM model. No work has beendone, however, to convert APM to act as a PMdriver for the new model.

pm infratructure

The original PM infrastucture was developedby Andrew Henroid and was very ground-breaking, since nothing like it had been donefor the Linux kernel before. It exists in its en-tirety in:

kernel/pm.cinclude/linux/pm.h

The general idea is that drivers can declare andregister an object with the pm infrastructurethat is accessed during a power state transition.The idea is very similar to what we have now,though the registration now is implicit whena device is registered with the system. And,based on the implementation, we can guarantee

Linux Symposium 339

that each device is notified in proper ancestralorder, which the old model cannot do.

Because the new model is far superior the old-style pm infrastructure, it is declared depre-cated. All drivers that implement pm callbacksshould be converted to use the hooks providedby the new driver model.

swsusp

swsusp is a mechanism for doing suspend-to-disk by saving kernel state to unused swapspace. It was also a ground-breaking feature,as it was the first true suspend-to-disk imple-mentation for Linux. There are some ques-tionable characteristics of swsusp that manypeople have that the maintainers of swsuspcounter are frivilous concerns, and it currentlyexists as an alternative to the new PM model.However, I’ve revoked any philosophical is-sues with swsusp. It can be, and should beported to be, used as a backend driver for thegeneric Hibernate mechanism. The currentcode base could be reduced to a fraction of itscurrent complexity.

6 Resources

The current power management kernel tree canbe found in the BitKeeper repository:

bk://devloper.osdl.org/linux-2.5-power

Information about the Linux power manag-ment infrastructure, including GNU diffs, doc-umentation and utilities like pmdisk can befound at

http://developer.osdl.org/

~mochel/power/

General OSDL developer resources can befound at:

http://developer.osdl.org/

7 Acknowledgements

Many people have contributed to this docu-ment, both explicitly and implicitly. First, Li-nus deserves a mention for encouraging melook at implementing ACPI suspend-to-ram asmy first kernel project. Andy Grover andPaul Diefenbaugh of Intel for many things—contributing ACPI to the kernel, for talkingwith me, for always arguing with and motivat-ing me internally to do things better, and forpushing me over the edge to write the finestOS driver model in existence. Andy Henroidfor writing the first open-source power man-agement model and providing a great base—despite its shortcomings—to learn and buildfrom. Pavel Machek for constantly providingcode and being energetic about the project. Allthe swsusp people for doing it in the first placeand keeping it up, no matter how much I gripeabout it.

Linux is a trademark of Linus Torvalds. Bit-Keeper is a trademark of BitMover, Inc.

Bringing PowerPC Book E to LinuxChallenges in porting Linux to the first PowerPC Book E processor implementation

Matthew D. PorterMontaVista Software, Inc.

[email protected] | [email protected]

Abstract

The PowerPCBook E1 architecture introducedthe first major change to the PowerPC archi-tecture since the originalGreen Book2 Pow-erPC processors were introduced. Central tothe Book Earchitectural changes is the MMUwhich is always in translation mode, even dur-ing exception processing. This presented someunique challenges for cleanly integrating thearchitecture into the Linux/PPC kernel.

In addition to the base PowerPCBook Earchi-tecture changes, the first IBM PPC440 core im-plementation included 36-bit physical address-ing support. Since I/O devices are mappedabove the native 32-bit address space, provid-ing support for this feature illuminated severallimitations within the kernel resource manage-ment and mapping system.

1 Overview of PowerPC Book E ar-chitecture

1.1 Book E MMU

It is important to note thatBook Eis a 64-bitprocessor specification that allows for a 32-bit

1Full title of specification isBook E: Enhanced Pow-erPC Architecture.

2Full title is PowerPC Microprocessor Family: TheProgramming Environments for 32-Bit Microprocessors.

implementation. Many of the register descrip-tions in the specification are written describ-ing 64-bit registers where appropriate. In thispaper, discussions ofBook Earchitecture de-scribe the 32-bit variants of all registers. Cur-rently, all announced PowerPCBook Ecom-pliant processors are 32-bit implementations ofthe specification.

In order to understandBook Earchitecture, itis useful to follow the history of the originalPowerPC architecture. The original PowerPCarchitecture was defined at a very detailed levelin theGreen Book. This architecture providesfine details on how the MMU, exceptions, andall possible instructions should operate. Thefamiliar G3 andG4 processor families are re-cent examples of implementations of theClas-sic PPC3 architecture.

Book Earchitecture is a result of a collabora-tion between IBM and Motorola to produce aPowerPC extension which lends itself to theneeds of embedded systems. One of the drivingforces behind the specification was the desirefor each silicon manufacturer to be able dif-ferentiate their products. Due to this require-ment, the specification falls short of providingenough detail to ensure that system softwarecan be shared among Book E compliant pro-cessors.

3An affectionate name bestowed upon all processorsthat conform to theGreen Bookspecification.

Linux Symposium 341

With the bad news out of the way, it can be saidthat all Book E processors share some com-mon architecture features. InBook E, fore-most is the requirement that MMU translationis always enabled. This is in sharp contrast tothe Classic PPCarchitecture which uses themore traditional approach of powering up inreal mode and disabling the MMU upon takingan exception.

Book E, on the other hand, powers up witha TLB entry active at the system reset vec-tor. This insures that the Initial Program Load(IPL) code can execute to the point of load-ing additional TLB entries for system softwarestart up.Book Earchitecture also defines sev-eral standard page sizes from 1KB through1TB. In addition,Book Ecalls for the existenceof two unique address spaces, AS0 and AS1.AS0 and AS1 are intended to facilitate the em-ulation ofClassic PPCreal mode on aBook Eprocessor. This property can best be describedby comparing theClassic PPCMMU transla-tion mechanism to the manner in whichBookE processors switch address spaces.

A Classic PPC processor has InstructionTranslation (IR) and Data Translation (DR)bits in its Machine State Register (MSR).These bits are used to enable or disable MMUtranslation. ABook Eprocessor has the samebits in the MSR but they are called Instruc-tion Space (IS) and Data Space (DS). The ISand DS bits are used a little differently sincethey are used to control the current 4GB virtualaddress space that the processor is executingwithin. BothClassic PPCandBook Eproces-sors set these bits to zero when an exceptionis taken. On aClassic PPC, this disables theMMU for exception processing. On aBook Eprocessor this switches to AS0. If the kerneland user space are run in the context of AS1,then TLB entries for AS0 can be used to emu-lateClassic PPCreal mode operation.

1.2 Book E exception vectors

The Book E specification allows for a largenumber of exception vectors to be imple-mented. Sixteen standard exceptions are listedand space is reserved for an additional 48 im-plementation dependent exceptions.

The Book Eexception model differs from theClassic PPCmodel in that the exception vec-tors are not at fixed memory offsets.ClassicPPC exception vectors are each allocated 256bytes. Using a bit in the MSR, the vectorscan be located at the top or bottom of the PPCphysical memory map.

Book E processors have an Interrupt VectorPrefix Register (IVPR) and Interrupt VectorOffset Registers (IVORs) to control the loca-tion of exception vectors in the system. TheIVPR is used to set the base address of the ex-ception vectors. Each IVORn register is usedto set the respective offset from the IVPR atwhich the exception vector is located.

2 PPC440GP Book E processor

The first PowerPCBook E processor imple-mentation was the IBM PPC440GP. This pro-cessor’s PPC440 core was the next evolution-ary step from the PPC405 cores that were across between aClassic PPCand aBook EPPC design.

The PPC440 core has a 64 entry unified Trans-lation Lookaside Buffer (TLB) as its major im-plementation specific MMU feature. This TLBdesign relies on software to implement any de-sired TLB entry locking, determine appropriateentries for replacement, and to perform pagetable walk and load TLB entries. This ap-proach is very flexible, but can be performancelimiting when compared to processors that pro-vide hardware table walk, Pseudo Least Re-cently Used (PLRU) replacement algorithms,

Linux Symposium 342

and TLB entry locking mechanisms.

The PPC440 core also implements a subset ofthe allowedBook Epage sizes. Implementedsizes range from 1KB to 256MB, but excludethe 4MB and 64MB page sizes.

3 Existing Linux/PPC kernel ports

Linux/PPC already has a number of sub-architecture families which require their ownhead.S implementation. head.S is usedby Classic PPCprocessors,head_8xx.S isused by the Motorola MPC8xx family, andhead_4xx.S is used by the IBM PPC40xfamily of processors. In order to encapsu-late the unique features of aBook Eproces-sor, it was necessary to create an additionalhead_440.S .

Traditionally, PowerPC has required fixed ex-ception vector locations, so all ports have fol-lowed the basicClassic PPChead.S struc-ture of a small amount of code at the beginningof the kernel image which branches over theexception vector code that is resident at fixedvector locations. This is true even on PPC405’shead_4xx.S even though the PPC405 offersdynamic exception vectors in the same man-ner as aBook Ecompliant processor. With thestandard Linux/PPC linker script,head.S isguaranteed to be at the start of the kernel im-age which must be loaded at the base of systemmemory.

4 Initial Book E kernel port

4.1 Overview

The first Book Eprocessor kernel port in thecommunity was done on the Linux/PPC 2.4 de-velopment tree4. The PPC440GP was the first

4Information on Linux/PPC ker-nel development trees can be found at

publicly availableBook Ecompliant processoravailable to run Linux.

4.2 MMU handling approaches

Several approaches were considered for imple-menting the basic handling of exception pro-cessing within the constraints of aBook EMMU. With the MMU always being enabled,it is not possible for the processor to accessinstructions and data using a physical addressduring exception processing. At a minimum, itis necessary to have a TLB entry covering thePPC exception vectors to ensure that the firstinstruction of a given exception implementa-tion can be fetched.

One implementation path is to create TLB en-tries that cover all of kernel low memory withinthe exception processing address space (AS0).These entries would be locked so they couldnot be invalidated or replaced by kernel or userspace TLB entries. This is the simplest ap-proach for the 2.4 kernel where page tablesare limited to kernel low memory. This al-lows task structs and page tables to be allo-cated via the normal__get_free_page()or __get_free_pages() calls using theGFP_KERNELflag. Unfortunately, this ap-proach has some drawbacks when applied toa PPC440 System on a Chip (SoC) implemen-tation.

The PPC440 core provides large TLB sizesof 1MB, 16MB, and 256MB. A simple solu-tion would be to cover all of kernel low mem-ory with locked 256MB TLB entries. By de-fault, Linux/PPC restricts maximum kernel lowmemory to 768MB. This would only require amaximum of 3 entries in the TLB to be con-sumed on a permanent basis. Unfortunately,this approach will not work since the behav-ior of the system is undefined in the event thatsystem memory is not a multiple of 256MB. In

http://penguinppc.org/dev/kernel.shtml

Linux Symposium 343

practice, this generates speculative cache linefetches past the end of system memory whichresult in a Machine Check exception.

The next logical solution would be to use acombination of locked large TLB entries tocover kernel low memory. In this approach, wequickly run into a situation where the lockedTLB entries consume too much of the 64 entryTLB. Consider a system with 192MB of sys-tem RAM. In this system, it would be neces-sary to lock 12 16MB TLB entries permanentlyto cover all of kernel low memory. This ap-proach would leave only 52 TLB entries avail-able for dynamic replacement. Artificially lim-iting the already small TLB would put furtherpressure on the TLB and most likely adverselyaffect performance.

4.3 Linux 2.4 MMU Solution

A different approach is necessary because theredoes not seem to be a good method to lock allof kernel low memory into the PPC440 TLB.One possible approach is to limit the area inwhich kernel data structures are allocating bycreating a special pool of memory. Implement-ing the memory pool approach involves the fol-lowing steps:

1. Force all kernel construct allocation to oc-cur within a given memory region.

2. Ensure that the given memory region iscovered by a locked TLB within excep-tion space.

The system is already required to maintain onelocked TLB entry to ensure that instructionscan be fetched from the exception vectors with-out resulting in a TLB miss. Therefore, the ker-nel construct memory region can simply be thepool of free memory that follows the kernel atthe base of system memory. The locked TLBentry is then set to a size of 16MB to ensure

that it covers both the kernel (including excep-tion vectors) and some additional free memory.A TLB entry size of 16MB was chosen becauseit is the smallest amount of RAM one couldconceivably find on a PPC440 system.

It was then necessary to create a facility to con-trol allocation from a given memory region.The easiest way to force allocation of mem-ory from a specific address range in Linux isto make use of theGFP_DMAflag to the zoneallocator calls. The allocation of task structs,pgds, and ptes was modified to result in an al-location from the DMA zone. Figure 1 showsa code fragment demonstrating how this is im-plemented for PTE allocation.

The PPC memory management initializationwas then modified to ensure that 16MB ofmemory is placed intoZONE_DMAand theremainder ends up inZONE_NORMALorZONE_HIGHMEMas appropriate.

With this structure, all kernel stacks and pagetables are allocated withinZONE_DMA. Thesingle locked TLB entry for the first 16MB ofsystem memory ensures that no nested excep-tions can occur while processing an exception.

One complication that resulted from using theZONE_DMAzone in this manner is that therecan be many early consumers of low memoryin ZONE_DMA. It was necessary to place an ad-ditional kludge in the early Linux/PPC mem-ory management initialization to ensure thatsome amount of the 16MB ofZONE_DMAre-gion would be free after the bootmem allocatorwas no longer in control of system memory.This was encountered when a run with 1GBof system RAM caused thepage structsto nearly consume all of theZONE_DMAre-gion. This, of course, is a fatal condition due tothe allocation of all task structs and page tablesfrom ZONE_DMA.

Linux Symposium 344

static inline pte_t * pte_alloc_one(struct mm_struct *mm, unsigned long address){

pte_t *pte;extern int mem_init_done;extern void *early_get_page(void);

if (mem_init_done)#ifndef CONFIG_440

pte = (pte_t *) __get_free_page(GFP_KERNEL);#else

/* Allocate from GFP_DMA to get entry in pinned TLB region */pte = (pte_t *) __get_free_page(GFP_DMA);

#endifelse

pte = (pte_t *) early_get_page();}

Figure 1: pte_alloc_one() implementation

4.4 Virtual exception processing

One minor feature of the PPC440 port is theuse of dynamic exception vectors. As allowedby theBook Earchitecture, exception vectorsare placed in head_440.S using the followingmacro:

#define START_EXCEPTION(label) \.align 5; \

label:

This is used to align each exception vectorentry to a 32 byte boundary as required bythe PPC440 core. The following code fromhead_440.S shows how the macro is used at thebeginning of an exception handler:

/* Data TLB Error Interrupt */START_EXCEPTION(DataTLBError)mtspr SPRG0, r20

This code fragment illustrates how each excep-tion vector is configured based on its link loca-tion:

SET_IVOR(12, WatchdogTimer);SET_IVOR(13, DataTLBError);SET_IVOR(14, InstructionTLBError);

The SET_IVOR macro moves the label addressoffset into aBook EIVOR. The first parameterspecifies which IVOR is the target of the move.Once the offsets are configured and the IVPRis configured with the exception base prefix ad-dress, exceptions will then be routed to the linktime specified vectors.

An interesting thing to note is that the Linux2.4Book Ekernel actually performs exceptionprocessing at the kernel virtual addresses. I.e.,the exception vectors are located at an offsetfrom 0xc0000000.

5 New Book E kernel port

5.1 Overview

Working to get theBook Ekernel support intothe Linux/PPC 2.5 development tree resulted insome discussions regarding the long-term via-bility of the ZONE_DMAapproach used in the2.4 port. One of the major issues has beenthat the 2.5 kernel moved the allocation of taskstructs to generic slab-based kernel code. Thismove broke the currentBook Ekernel modelsince it is no longer possible to force alloca-tion of task structs to occur withinZONE_DMA.Another important reason for considering a

Linux Symposium 345

change is that the current method is somewhatof a hack. That is,ZONE_DMAis used in amanner in which it was not intended.

5.2 In-exception TLB misses

The first method investigated to eliminateZONE_DMAusage simply allows nested excep-tions to be handled during exception process-ing. Exception processing code can be de-fined as the code path from when an excep-tion vector is entered until the processor returnsto kernel/user processing. On a lightweightTLB miss, this can happen immediately aftera TLB entry is loaded. On heavyweight ex-ceptions, this may occur whentransfer_to_handler jumps to a heavyweight han-dler routine in kernel mode.

Upon examining the exception processingcode, it becomes apparent that the only stan-dard exception that can occur is the DataTL-BError exception. This is because exceptionvector code must be contained within a lockedTLB entry, so no InstructionTLBError condi-tions can occur. Further, early exception pro-cessing accesses a number of kernel data con-structs. These include kernel stacks, pgds, andptes. By writing a non-destructive DataTLBEr-ror handler it is possible to safely process dataTLB misses within exception processing code.

In order to make the DataTLBError handlersafe, it is necessary not to touch any of thePowerPC Special Purpose General Registers(SPRGs) when a DataTLBError exception istaken. Instead, a tiny stack is created withinthe memory region covered by the locked TLBentry. This stack is loaded with the contextof any register that need to be used duringDataTLBError processing. The following codefragment shows the conventional DataTLBEr-ror register save mechanism:

/* Data TLB Error Interrupt */START_EXCEPTION(DataTLBError)mtspr SPRG0, r10mtspr SPRG1, r11mtspr SPRG4W, r12mtspr SPRG5W, r13mtspr SPRG6W, r14mfcr r11mtspr SPRG7W, r11mfspr r10, SPRN_DEAR

In the non-destructive version of the DataTL-BError, the code looks like following:

START_EXCEPTION(DataTLBError)stw r10,tlb_r10@l(0);stw r11,tlb_r11@l(0);stw r12,tlb_r12@l(0);stw r13,tlb_r13@l(0);stw r14,tlb_r14@l(0);mfcr r11stw r11,tlb_cr@l(0);mfspr r11, SPRN_MMUCRstw r11,tlb_mmucr@l(0);mfspr r10, SPRN_DEAR

Here, thetlb_* locations within the lockedTLB region are used to save register staterather than the SPRGs.

If we were to continue to perform exceptionprocessing from native kernel virtual address,we would have a problem. Thetlb_* loca-tions allocated withinhead_44x.S would beat some offset from 0xc0000000. A store toany address with a non-zero most significant16 bits would require that an intermediate reg-ister be used to load the most significant bits ofthe address.

This issue made it necessary to make theswitch to emulation ofClassic PPCreal mode.This is accomplished by placing the dynamicexception vectors at a virtual address offsetfrom address zero and providing a locked TLBentry covering this address space. By doing soit became possible to access exception stack lo-cations using zero indexed loads and stores.

In the DataTLBError handler, each access tokernel data which may not have a TLB entry is

Linux Symposium 346

protected. Atlbsx. instruction is used to de-termine if there is already a TLB entry for theaddress that is to be accessed. If a TLB entryexists, the access is made. However, if the TLBentry does not exist, a TLB entry is created be-fore accessing the resource. This method is il-lustrated in the following code fragment basedon thelinuxppc-2.5 head_44x.S :

/* Stack TLB entry present? */3: mfspr r12,SPRG3

tlbsx. r13,0,r12beq 4f/* Load stack TLB entry */TLB_LOAD;

/* Get current thread’s pgd */4: lwz r12,PGDIR(r12)

Using this strategy, the DataTLBError handlergains the ability resolve any possible TLB missexceptions before they can occur. Once it hasperformed the normal software page table walkand has loaded the faulting TLB entry, it canreturn to the point of the exception. Of course,that exception may now be either from a ker-nel/user context or from an exception process-ing context. A DataTLBError can now be eas-ily handled from any context.

5.3 Keep It Simple Stupid

Sometimes one has to travel a long road toeventually come back to the simple solution.This project has been one of those cases. Animplementation of the in-exception tlb missmethod showed that the complexity of the TLBhandling code had gone up by an order of mag-nitude. It is desirable (for maintenance andquality reasons) to keep the TLB handling codeas simple as possible.

The KISS approach pins all of kernel lowmemory with 256MB TLB entries in AS0. Thenumber of TLB entries is determined from thediscovered kernel low memory size. A high

water mark value is used to mark the highestTLB slot that may be used when creating TLBentries in AS1 for the kernel and user space.The remaining TLB slots are consumed by thepinned TLB entries.

This approach was previously thrown out dueto the occurrence of speculative data cachefetches that would result in a fatal machinecheck exception. This situation occurs whenthe system memory is not aligned on a 256MBboundary. In these cases, the TLB entries coverunimplemented address space. The data cachecontroller will speculatively fetch past the endof system memory if any access is performedon the last cache line of the last system mem-ory page frame.

The trick to make this approach stable is tosimply reserve the last page frame of systemmemory so it may not be allocated by the ker-nel or user space. This could be done via thebootmem allocator, but in order to accomplishit during early MMU initialization it is neces-sary to utilize the PPC specificmem_piecesallocation API. Using this trick allows for asimple (and maintainable) implementation ofPPC440 tlb handling.

5.4 Optimizations

One clear enhancement to the low-level TLBhandling mechanism is to support large pagesizes for kernel low memory. This support isalready implemented5 for the PPC405 familyof processors that implement a subset of theBook E page sizes. Enabling the TLB misshandlers to load large TLB entries for kernellow memory guarantees a lighter volume ofexceptions taken from accesses of kernel lowmemory.

Although this is a common TLB handling op-timization in the kernel, a minor change to

5In the linuxppc-2.5 development tree

Linux Symposium 347

the KISS approach could eliminate the need toprovide large TLB replacement for kernel lowmemory. The change is to simply modify theKISS approach to run completely from AS0.AS1 would not longer be used for user/kerneloperation since all code would run from theAS0 context. This yields the same perfor-mance gain by reducing TLB pressure as thelarge TLB replacement optimization. How-ever, this variant leverages the TLB entries thatare already pinned for exception processing.

6 36-bit I/O support

6.1 Overview of large physical address support

The PPC440GP processor implementationsupports 36-bit physical addressing on its sys-tem bus. 36-bit physical addressing has alreadybeen supported on other processors with a na-tive 32-bit MMU as found in IA32 PAE imple-mentations. However, the PPC440GP imple-ments a 36-bit memory map with I/O devicesabove the first 4GB of physical memory.

The basic infrastructure by which large physi-cal addresses are supported is similar to otherarchitectures. In the case of PPC440GP, we de-fine a pte to be anunsigned long longtype. In order to simplify the code, we defineour page table structure as the usual two levellayout, but with an 8KB pgd. Rather than allo-cating a single 4KB page for a pgd, we allocatetwo pages to meet this requirement.

In order to share some code between largephysical address and normal physical addressPPC systems, a new type is introduced:

#ifndef CONFIG_440#include <asm-generic/mmu.h>#elsetypedef unsigned long long phys_addr_t;extern phys_addr_tfixup_bigphys_addr(phys_addr_t, phys_addr_t);#endif

This typedef allows low-level PPC memorymanagement routines to handle both large andnormal physical addresses without creating aseparate set of calls. On a PPC440-based coreit is a 64-bit type, yet it remains a 32-bit typeon all normal physical address systems.

6.2 Large physical address I/O kludge

The current solution for managing devicesabove 4GB is somewhat of a “necessarykludge.” In a dumb bit of luck, the PPC440GPmemory map was laid out in such a way thatmade it easy to perform a simple translation ofa 32-bit physical address (or Linux resource)into a 36-bit physical address suitable for con-sumption by the PPC440 MMU.

All PPC440GP on-chip I/O devices and PCIaddress spaces were neatly laid out so that theirleast significant 32-bits of physical address didnot overlap.

A PPC440 specificioremap() call is createdto allow a 32-bit resource to be mapped intovirtual address space. Figure 2 illustrates theioremap() implementation.

This ioremap() implementation works bycalling a translation function to convert a32-bit resource into a 64-bit physical ad-dress.fixup_bigphys_addr() comparesthe 32-bit resource value to several PPC440GPmemory map ranges. When it matches one dis-tinct range, it concatenates the most significant32-bits of the intended address range. Thisresults in a valid PPC440GP 64-bit physicaladdress that can then be passed to the localioremap64() routine to create the virtualmapping .

This method works fine for maintaining com-patibility with a large amount of generic PCIdevice drivers. However, the approach quicklyfalls apart when a driver implementsmmap() .

Linux Symposium 348

void *ioremap(unsigned long addr, unsigned long size){

phys_addr_t addr64 = fixup_bigphys_addr(addr, size);

return ioremap64(addr64, size);}

Figure 2: PPC440GPioremap() implementation

The core of most drivermmap() implemen-tations is a call toremap_page_range() .This routine is prototyped as follows:

int remap_page_range(unsigned long from,unsigned long to,unsigned long size,pgprot_t prot);

The to parameter is the physical address ofthe memory region that is to be mapped. Thecurrent implementation assumes that a physi-cal address size is always equal to the nativeword size of the processor. This is obviouslynow a bad assumption for large physical ad-dress systems because it is not possible to passthe required 64-bit physical address.

Figure 3 shows the implementation ofremap_page_range() for the mmapcompatibility kludge6.

The physical address parameter is now passedas a phys_addr_t . On large physicaladdress platforms, thefixup_bigphys_addr() call is implemented to convert a 32-bit value (normally obtained from a 32-bit re-source) into a platform specific 64-bit physi-cal address. Theremap_pmd_range() andremap_pte_range() calls likewise havetheir physical address parameters passed usingaphys_addr_t .

This implementation allows device drivers im-plementing mmap() using remap_page_

6Patches for this support can be found atftp://source.mvista.com/pub/linuxppc/

range() to run unchanged on a large phys-ical address system. Unfortunately, this is nota complete solution to the problem.

The fixup_bigphys_addr() routinecannot necessarily be implemented for allpossible large physical address space memorymaps. Some systems will require discreteaccess to the full 36-bit or larger physicaladdress space. In these cases, there is a needto allow themmap() system call to handle a64-bit value on a 32-bit platform.

6.3 Proper large physical address I/O support

One approach to resolving this issue is tochange the parameters ofremap_page_range() and friends as was done in themmap compatibility kludge. The physical ad-dress to be mapped would then be manipulatedin aphys_addr_t . On systems with a nativeword sizephys_addr_t there is no effect.The important piece of this approach is thatall callers ofremap_page_range() wouldneed to do any manipulation of physical ad-dresses using aphys_addr_t variable to en-sure portability.

In the specific case of afops->mmap imple-mentation, a driver must now be aware that avma->pgoff can contain an address that isgreater than the native word size. In any case,the vma->pgoff value would be shifted byPAGE_OFFSETin order to yield a system spe-cific physical address in aphys_addr_t .

Linux Symposium 349

intremap_page_range(unsigned long from, phys_addr_t phys_addr, unsigned long size, pgprot_t prot){

.

.

.phys_addr = fixup_bigphys_addr(phys_addr, size);phys_addr -= from;...

}

Figure 3: PPC440GPremap_page_range() implementation

Once we allow for 64-bit physical addressmapping on 32-bit systems, it becomes nec-essary to expand the resource subsystem tomatch. In order for standard PCI drivers to re-main portable across standard and large phys-ical address systems, it is necessary to ensurethat a resource can represent a 64-bit physi-cal address on a large physical address sys-tem. Building on the approach of usingphys_addr_t to abstract the native system physicaladdress size, this can now be the native storagetype for resource fields. In doing so, it is alsoimportant to extend the concept to user space toensure that common applications like XFree86can parse 64-bit resources on a 32-bit platformand cleanlymmap() a memory region.

7 Conclusion

The Linux kernel is remarkably flexible in han-dling ports to new processors. Despite sig-nificant architectural changes in the PowerPCBook Especification, it was possible to enablethe PPC440GP processor within the existingLinux abstraction layer in a reasonable amountof time. As with all Linux projects, this one isstill very much a work-in-progress. Develop-ment in the Linux 2.5 tree offers an opportunityto explore some new routes for better PowerPCBook Ekernel support.

During this project, I have had the opportunity

to learn which features of aBook Eprocessorwould be most useful in supporting a Linuxkernel port. Clearly, the most important featurein this respect is an abundance of TLB entries.The PPC440 core’s 64 entry TLB is the sin-gle most limiting factor for producing a simpleBook Eport. If the PPC440 core had 128 or256 TLB entries, the work on porting Linux tothe processor would have been far easier.

Although these new processors are now run-ning the Linux kernel, this support does notyet address 100% of this platform’s new ar-chitectural features. As with many areas inthe Linux kernel, support for large physical ad-dress mapping needs to evolve with emergingprocessor technologies. Without a doubt, theincreased number of processors implementinglarge physical address I/O functionality help tomake the Linux kernel community aware of thekernel requirements inherent in this technol-ogy.

8 Trademarks

IBM is a registered trademark of International Busi-ness Machines Corporation.


MontaVista is a registered trademark of MontaVistaSoftware, Inc.

Linux Symposium 350

Motorola is a registered trademark of Motorola In-corporated.

PowerPC is a registered trademark of InternationalBusiness Machines Corporation. Other company,product, or service names may be trademarks orservice marks of others.

Asynchronous I/O Support in Linux 2.5

Suparna BhattacharyaSteven Pratt

Badari PulavartyJanet Morgan

IBM Linux Technology [email protected], [email protected],


Abstract

This paper describes the Asynchronous I/O(AIO) support in the Linux® 2.5 kernel, addi-tional functionality available as patchsets, andplans for further improvements. More specifi-cally, the following topics are treated in somedepth:

• Asynchronous filesystem I/O

• Asynchronous direct I/O

• Asynchronous vector I/O

As of Linux 2.5, AIO falls into the commonmainline path underlying all I/O operations,whether synchronous or asynchronous. Theimplications of this, and other significant waysin which the design for AIO in 2.5 differs fromthe patches that existed for 2.4, are explored aspart of the discussion.

1 Introduction

All modern operating systems provide a vari-ety of I/O capabilities, each characterized bytheir particular features and performance. One

such capability is Asynchronous I/O, an impor-tant component of Enterprise Systems whichallows applications to overlap processing withI/O operations for improved utilization of CPUand devices.

AIO can be used to improve application per-formance and connection management for webservers, proxy servers, databases, I/O intensiveapplications and various others.

Some of the capabilities and features providedby AIO are:

• The ability to submit multiple I/O requestswith a single system call.

• The ability to submit an I/O request with-out waiting for its completion and to over-lap the request with other processing.

• Optimization of disk activity by the kernelthrough combining or reordering the indi-vidual requests of a batched I/O.

• Better CPU utilization and systemthroughput by eliminating extra threadsand reducing context switches.

Linux Symposium 352

2 Design principles

An AIO implementation can be characterizedby the set of design principles on which it isbased. This section examines AIO support inthe Linux kernel in light of a few key aspectsand design alternatives.

2.1 External interface design alternatives

There are at least two external interface designalternatives:

• A design that exposes essentially the sameinterfaces for synchronous and asyn-chronous operations with options to dis-tinguish between mode of invocation [2].

• A design that defines a unique set of inter-faces for asynchronous operations in sup-port of AIO-specific requirements suchas batch submission of different requesttypes [4].

The AIO interface for Linux implements thesecond type of external interface design.

2.2 Internal design alternatives

There are several key features and possible ap-proaches for the internal design of an AIO im-plementation:

• System design:

– Implement the entire path of the op-eration as fully asynchronous fromthe top down. Any synchronous I/Ois just a trivial wrapper for perform-ing asynchronous I/O and waitingfor its completion [2].

– Synchronous and asynchronouspaths can be separate, to an extent,

and can be tuned for differentperformance characteristics (forexample, minimal latency versusmaximal throughput) [10].

• Approaches for providing asynchrony:

– Offload the entire I/O to threadpools (these may be either user-levelthreads, as in glibc, or kernel workerthreads).

– Use a hybrid approach where initi-ation of I/O occurs asynchronouslyand notification of completion oc-curs synchronously using a pool ofwaiting threads ([3] and [13]).

– Implement a true asynchronous statemachine for every operation [10].

• Mechanisms for handling user-context de-pendencies:

– Convert buffers or other such stateinto a context-independent form atI/O submission (e.g., by mappingdown user-pages) [10].

– Maintain dedicated per-addressspace service threads to executecontext-dependent steps in thecaller’s context [3].

The internal design of the AIO support avail-able for Linux 2.4 and the support integratedinto Linux 2.5 differ on the above key features.Those differences will be discussed in somedetail in later sections.

Other design aspects and issues that are rele-vant to AIO but which are outside the main fo-cus of this paper include ([8] describes some ofthese issues in detail):

• The sequencing of operations andsteps within an operation that supports

Linux Symposium 353

POSIX_SYNCHRONIZED_IO andPOSIX_PRIORITIZED_IO require-ments ([5]), as well as the extent offlexibility to order or parallelize re-quests to maximize throughput withinreasonable latency bounds.

• AIO throttling: deciding on the queuedepths and returning an error (-EAGAIN )when the depth is exceeded, or if re-sources are unavailable to complete the re-quest (rather than forcing the process tosleep).

• Completion notification, queuing, andwakeup policies including the design of aflexible completion notification API andoptimization considerations like cache-warmth, latency and batching. In theLinux AIO implementation, every AIOrequest is associated with a completionqueue. One or more application threadsexplicitly wait on this completion queuefor completion event(s), where flexiblegrouping is determined at the time ofI/O submission. An exclusive LIFOwakeup policy is used among multiplesuch threads, and a wakeup is issuedwhenever I/O completes.

• Support for I/O cancellation.

• User/kernel interface compatibility (perPOSIX).

2.3 2.4 Design

The key characteristics of the AIO implemen-tation for the Linux 2.4 kernel are described in[8] and available as patches at [10]. The de-sign:

• Implements asynchronous I/O paths andinterfaces separately, leaving existing syn-chronous I/O paths unchanged. Reuse of

existing, underlying asynchronous code isdone where possible, for example, rawI/O.

• Implements an asynchronous state ma-chine for all operations. Processing oc-curs in a series of non-blocking stepsdriven by asynchronous waitqueue call-backs. Each stage of processing com-pletes with the queueing of deferred workusing "work-to-do" primitives. Sufficientstate is saved to proceed with the nextstep, which is run in the context of a ker-nel thread.

• Maps down user pages during I/O submis-sion. Modifies the logic to transfer datato/from mapped user pages in order to re-move the user-context dependency for thecopy to/from userspace buffers.

The advantages of these choices are:

• Synchronous I/O performance is unaf-fected by asynchronous I/O logic, whichallows AIO to be implemented and tunedin a way that is optimum for asynchronousI/O patterns.

• The work-to-do primitive permits state tobe carried forward to enable continua-tion from exactly where processing left offwhen a blocking point was encountered.

• The need for additional threads to com-plete the I/O transfer in the caller’s con-text is avoided.

There are, however, some disadvantages to theAIO implementation (patchset) for Linux 2.4:

• The duplication of logic between syn-chronous and asynchronous paths makesthe code difficult to maintain.

Linux Symposium 354

• The asynchronous state machine is a com-plex model and therefore more prone toerrors and races that can be hard to debug.

• The implementation can lead to inefficientutilization of TLB mappings, especiallyfor small buffers. It also forces the pin-ning down of all pages involved in the en-tire I/O request.

These problems motivated a new approach forthe implementation of AIO in the Linux 2.5kernel.

2.4 2.5 Design

Although the AIO design for Linux 2.5 usesmost of the core infrastructure from the 2.4 de-sign, the 2.5 design is built on a very differentmodel:

• Asynchronous I/O has been made a first-class citizen of the kernel. Now AIOpaths underlie regular synchronous I/O in-terfaces instead of just being grafted fromthe outside.

• A retry-based model replaces the ear-lier work-to-do state-machine implemen-tation. Retries are triggered through asyn-chronous notification as each step in theprocess completes. However in somecases, such as direct I/O, asynchronouscompletion notification occurs directlyfrom interrupt context without requiringany retries.

• User-context dependencies are handled bymaking worker threads take on (i.e., tem-porarily switch to) the caller’s addressspace when executing retries.

In a retry-based model, an operation executesby running through a series of iterations. Each

iteration makes as much progress as possiblein a non-blocking manner and returns. Themodel assumes that a restart of the operationfrom where it left off will occur at the nextopportunity. To ensure that another opportu-nity indeed arises, each iteration initiates stepstowards progress without waiting. The itera-tion then sets up to be notified when enoughprogress has been made and it is worth tryingthe next iteration. This cycle is repeated untilthe entire operation is finished.

The implications and issues associated with theretry-based model are:

• Tuning for the needs of both synchronousand asynchronous I/O can be difficultbecause of the issues of latency ver-sus throughput. Performance studies areneeded to understand whether AIO over-head causes a degradation in synchronousI/O performance. It is expected that thecharacteristics are better when the under-lying operation is already inherently asyn-chronous or rewritten to an asynchronousform, rather than just modified in order tobe retried.

• Retries pass through some initial pro-cessing steps each time. These process-ing steps involve overhead. Saving stateacross retries can help reduce some of theredundant regeneration, albeit with someloss of generality.

• Switching address spaces in the retrythread can be costly. The impact wouldprobably be experienced to a greater ex-tent when multiple AIO processes are run-ning. Performance studies are needed todetermine if this is a problem.

Note that I/O cancellation is easier to handle ina retry-based model; any future retries can sim-ply be disabled if the I/O has been cancelled.

Linux Symposium 355

Retries are driven by AIO workqueues. If aretry does not complete in a very short time, itcan delay other AIO operations that are under-way in the system. Therefore, tuning the AIOworkqueues and the degree of asynchrony ofretry instances each have a bearing on overallsystem performance.

3 AIO support for filesystem I/O

The Linux VFS implementation, especially asof the 2.5 kernel, is well-structured for retry-based I/O. The VFS is already capable of pro-cessing and continuing some parts of an I/Ooperation outside the user’s context (e.g., forreadahead, deferred writebacks, syncing of filedata and delayed block allocation). The im-plementation even maintains certain state inthe inode or address space to enable deferredbackground processing of writeouts. This abil-ity to maintain state makes the retry modela natural choice for implementing filesystemAIO.

Linux 2.5 is currently without real supportfor regular (buffered) filesystem AIO. Whileext2, JFS and NFS define theiraio_readand aio_write methods to default togeneric_file_aio_read/write ,these routines show fully synchronous behav-ior unless the file is opened withO_DIRECT .This means that anio_submit can block forregular AIO read/write operations while theapplication assumes it is doing asynchronousI/O.

Our implementation of the retry model forfilesystem AIO, available as a patchset from[6], involved identifying and focusing on themost significant blocking points in an opera-tion. This was followed by observations frominitial experimentation and profiling results,and the conversion of those blocking points toretry exit points.

The implementation we chose starts retries at avery high level. Retries are driven directly bythe AIO infrastructure and kicked off via asyn-chronous waitqueue functions. In synchronousI/O context, the default waitqueue entries aresynchronous and therefore do not cause an exitat a retry point.

One of the goals of our implementation forfilesystem AIO was to minimize changes to ex-isting synchronous I/O paths. The intent wasto achieve a reasonable level of asynchrony ina way that could then be further optimized andtuned for workloads of relevance.

3.1 Design decisions

• Level at which retries are triggered:

The high-level AIO code retries filesys-tem read/write operations, passing in theremaining parts of the buffer to be read orwritten with each retry.

• How and when a retry is triggered:

Asynchronous waitqueue functions areused instead of blocking waits to triggera retry (to "kick" a dormantiocb into ac-tion) when the operation is ready to con-tinue.

Synchronous routines such aslock_page , wait_on_page_bit , andwait_on_buffer have beenmodified to asynchronous variations.Instead of blocking, these routines queuean asynchronous wait and return with aspecial return code,-EIOCBRETRY .

The return value is propagated all the wayup to the invoking AIO handler. For thisprocess to work correctly, the calling rou-tine at each level in the call chain needs tobreak out gracefully if a callee returns the-EIOCBRETRY exit code.

• Operation-specific state preserved acrossretries:

Linux Symposium 356

In our implementation [7], the high-levelAIO code adjusts the parameters to reador write as retries progress. The parame-ters are adjusted by the retry routine basedon the return value from the filesystemoperation indicating the number of bytestransferred.

A recent patch by Benjamin LaHaise[11] proposes moving the filesystem APIread/write parameter values to theiocb structure. This change would en-able retries to be triggered at the API levelrather than through a high-level AIO han-dler.

• Extent of asynchrony:

Ideally, an AIO operation is completelynon-blocking. If too few resources ex-ist for an AIO operation to be completelynon-blocking, the operation is expected toreturn-EAGAIN to the application ratherthan cause the process to sleep while wait-ing for resources to become available.

However, converting all potential block-ing points that could be encounteredin existing file I/O paths to an asyn-chronous form involves trade-offs interms of complexity and/or invasiveness.In some cases, this tradeoff produces onlymarginal gains in the degree of asyn-chrony.

This issue motivated a focus on firstidentifying and tackling the major block-ing points and less deeply nested casesto achieve maximum asynchrony benefitswith reasonably limited changes. The so-lution can then be incrementally improvedto attain greater asynchrony.

• Handling synchronous operations:

No retries currently occur in the syn-chronous case. The low-level code distin-guishes between synchronous and asyn-chronous waits, so a break-out and retry

occurs only in the latter case while theprocess blocks as before in the event ofa synchronous wait. Further investiga-tion is required to determine if the retrymodel can be used uniformly, even for thesynchronous case, without performancedegradation or significant code changes.

• Compatibility with existing code:

– Wrapper routines are needed forsynchronous versions of asyn-chronous routines.

– Callers that cannot handle asyn-chronous returns need special caree.g., making sure that a synchronouscontext is specified to potentiallyasynchronous callees.

– Code that can be triggered inboth synchronous and asynchronousmode may present some tricky is-sues.

– Special cases like code that maybe called via page faults in asyn-chronous context may need to betreated carefully.

3.2 Filesystem AIO read

A filesystem read operation results in a pagecache lookup for each full or partial page ofdata requested to be read. If the page is alreadyin the page cache, the read operation locks thepage and copies the contents into the corre-sponding section of the user-space buffer. Ifthe page isn’t cached, then the read operationcreates a new page-cache page and issues I/Oto read it in. It may, in fact, read ahead sev-eral pages at the same time. The read operationthen waits for the I/O to complete (by waitingfor a lock on the page), and then performs thecopy into user space.

Linux Symposium 357

Based on initial profiling results the crucialblocking points identified in this sequencewere found to occur in:

• lock_page

• cond_resched

• wait_on_page_bit

Of these routines the following were convertedto retry exit points by introducing correspond-ing versions of the routines that accept a wait-queue entry parameter:

lock_page --> lock_page_wqwait_on_page_bit -->

wait_on_page_bit_wq

When a blocking condition arises, theseroutines propagate a return value of-EIOCBRETRY from generic_file_aio_read . When unblocked, the waitqueueroutine which was notified activates a retry ofthe entire sequence.

As an aside, the existing readahead logic helpsreduce retries for AIO just as it helps reducecontext switches for synchronous I/O. In prac-tice, this logic does not actually cause a volleyof retries for every page of a large sequentialread.

The following routines are other potentialblocking points that may occur in a filesystemread path that have not yet been converted toretry exits:

• cond_resched

• meta-data read (get block and read ofblock bitmap)

• request-queue congestion

• atime updates (corresponding journal up-dates)

Making the underlying readpages operationasynchronous by addressing the last threeblocking points above might require more de-tailed work. Initial results indicate that signif-icant gains have already been realized withoutdoing so.

3.3 Filesystem AIO write

The degree of blocking involved in a syn-chronous write operation is expected to be lessthan in the read case. This is because (unlessO_SYNCor O_DSYNCare specified) a writeoperation only needs to wait until file blocksare mapped to disk and data is copied into(written to) the page cache. The actual writeout to disk typically happens in a deferred wayin the context of background kernel threads orearlier in the process via an explicit sync opera-tion. However, for throttling reasons, a wait forpending I/O may also occur in write context.

Some of the more prominent blocking pointsidentified in the this sequence were found tooccur in:

• cond_resched

• wait_on_buffer (during a get blockoperation)

• find_lock_page

• blk_congestion_wait

Of these, the following routines were convertedto retry exit points by introducing correspond-ing versions of the routines that accept a wait-queue entry parameter:

down --> down_wqwait_on_buffer --> wait_on_buffer_wqsb_bread --> sb_bread_wqext2_get_block --> ext2_get_block_wqfind_lock_page --> find_lock_page_wqblk_congestion_wait -->

blk_congestion_wait_wq

Linux Symposium 358

The asynchronous get block support has cur-rently been implemented only for ext2, andonly used byext2_prepare_write . Allother instances where a filesystem-specific getblock routine is involved use the synchronousversion. In view of the kind of I/O patterns ex-pected for AIO writes (for example, databaseworkloads), block allocation has not been a fo-cus for conversion to asynchronous mode.

The following routines are other potentialblocking points that could occur in a filesys-tem write path that have not yet been convertedto retry exits:

• cond_resched

• other meta-data updates, journal writes

Also, the case whereO_SYNC or O_DSYNCwere specified at the time when the file wasopened has not yet been converted to be asyn-chronous.

3.4 Preliminary observations

Preliminary testing to explore the viabilityof the above-described approach to filesystemAIO support reveals a significant reduction inthe time spent inio_submit (especially forlarge reads) when the file is not already cached(for example, on first-time access). In the writecase, asynchronous get block support had tobe incorporated to obtain a measurable bene-fit. For the cached case, no observable differ-ences were noted, as expected. The patch doesnot appear to have any effect on synchronousread/write performance.

A second experiment involved temporarilymoving the retries into theio_geteventscontext rather than into worker threads. Thismove enabled a sanity check usingstrace todetect any gross impact on CPU utilization.

Thorough performance testing is underway todetermine the effect on overall system perfor-mance and to identify opportunities for tuning.

3.5 Issues and todos

• Should the cond_resched calls inread/write be converted to retry points?

• Are asynchronous get block implementa-tions needed for other filesystems (e.g.,JFS)?

• Optional: should the retry model beused for direct I/O (DIO) or should syn-chronous DIO support be changed to waitfor the completion of asynchronous DIO?

• Should relevant filesystem APIs be modi-fied to add an explicit waitqueue parame-ter?

• Should theiocb state be updated directlyby the filesystem APIs or by the high-levelAIO handler after every retry?

4 AIO support for direct I/O

Direct I/O (raw andO_DIRECT ) transfersdata between a user buffer and a device with-out copying the data through the kernel’s buffercache. This mechanism can boost performanceif the data is unlikely to be used again in theshort term (during a disk backup, for exam-ple), or for applications such as large databasemanagement systems that perform their owncaching.

Direct I/O (DIO) support was consolidated andredesigned in Linux 2.5. The old scalabilityproblems caused by preallocating kiobufs andbuffer heads were eliminated by virtue of thenew BIO structure. Also, the 2.5 DIO codestreams the entire I/O request (based on the un-derlying driver capability) rather than breakingthe request into sector-sized chunks.

Linux Symposium 359

Any filesystem can make use of theDIO support in Linux 2.5 by defining adirect_IO method in the address_space_operations structure. The methodmust pass back to the DIO code a filesystem-specific get block function, but the DIOsupport takes care of everything else.

Asynchronous I/O support for DIO was addedin Linux 2.5. The following caveats are worthnoting:

• Waiting for I/O is done asynchronouslybut multiple points in the submissioncodepath can potentially cause the processto block (such as the pinning of user pagesor processing in the filesystem get blockroutine).

• The DIO code callsset_page_dirtybefore performing I/O since the rou-tine must be called in process context.Once the I/O completes, the DIO code—operating in interrupt context—checkswhether the pages are still dirty. Ifso, nothing further is done; otherwise,the pages are made dirty again via aworkqueue run in process context.

5 Vector AIO

5.1 2.5 readv/writev improvements

In Linux 2.5, direct I/O (raw andO_DIRECT )readv/writev was changed to submit all seg-ments or iovecs of a request before waitingfor I/O completion. Prior to this change, DIOreadv/writev was processed in a loop by callingthe filesystem read/write operations for eachiovec in turn.

The change to submit the I/O for all iovecsbefore waiting was a critical performance fixfor DIO. For example, tests performed on an

aic-attached raw disk using 4Kx8 readv/writevshowed the following improvement:

Random writev 8.7 times fasterSequential writev 6.6 times fasterRandom readv sys time improves 5xSequential readv sys time improves 5xRandom mixed I/O 5 times fasterSequential mixed I/O 6.6 times faster

5.2 AIO readv/writev

With the DIO readv/writev changes integratedinto Linux 2.5, we considered extending thefunctionality to AIO. One problem is thatAIO readv/writev ops are not defined inthe file_operations structure, nor arereadv/writev part of the AIO API commandset. Further, the interface toio_submit isalready an array ofiocb structures analogousto the vector of a readv/writev request, so a realquestion is whether AIO readv/writev supportis even needed. To answer the question, weprototyped the following changes [12]:

• addedaio_readv/writev ops to thefile_operations structure

• definedaio_readv/writev ops in theraw driver

• added 32- and 64-bit readv and writevcommand types to the AIO API

• added support for readv/writev commandtypes tofs/aio.c :

fs/aio.c | 156 ++++include/linux/aio.h | 1include/linux/aio_abi.h | 14 ++++include/linux/fs.h | 24 files changed, 173 insertions(+)

5.3 Preliminary results

With the above-noted changes, we were able totest whether anio_submit for N iovecs is

Linux Symposium 360

more performant than anio_submit for Niocbs.

io_submit for N iocbs:

io_submit -->------------------------------------

iocb[0] |aio_buf|aio_nbytes|read/write opcode|------------------------------------

iocb[1] |aio_buf|aio_nbytes|read/write opcode|...

iocb[N-1] |aio_buf|aio_nbytes|read/write opcode|------------------------------------

io_submit for N iovecs:

io_submit -->----------------------------------------

iocb[0] |aio_buf|aio_nbytes=N|readv/writev opcode|----------------------------------------| ----------------

--> iovec[0] |iov_base|iov_len|----------------

iovec[1] |iov_base|iov_len|...

iovec[N-1] |iov_base|iov_len|----------------

Based on preliminary data [1] using direct I/O,an io_submit for N iovecs outperforms anio_submit for N iocbs by as much as two-to-one. While there is a singleio_submit inboth cases, aio readv/writev shortens codepath(i.e., one instead of N calls to the underlyingdriver method) and normally results in fewerbios/callbacks.

5.4 Issues

The problem with the proposed support forAIO readv/writev is that it creates code re-dundancy in the custom and generic filesys-tem layers by adding two more methods to thefile_operations structure. One solutionis to first collapse theread/write/readv/writev/aio_read/aio_write methodsinto simplyaio_read andaio_write andto convert those methods into vectored form[11].

6 Performance

6.1 System setup

All benchmarks for this paper were performedon an 8-way 700MHz Pentium™III machine

with 4GB of main memory and a 2MB L2cache. The disk subsystem used for theI/O tests consisted of 4 IBM® ServeRAID-4H™dual-channel SCSI controllers with 109GB disk drives per channel totalling 80 phys-ical drives. The drives were configured in setsof 4 (2 drives from each channel) in a RAID-0 configuration to produce 20 36GB logicaldrives. The software on the system was SuSELinux Enterprise Server 8.0. Where noted inthe results, the kernel was changed to 2.5.68plus required patches [7]. For AIO benchmark-ing, libaio-0.3.92 was installed on the system.

6.2 Microbenchmark

The benchmark program used to analyze AIOperformance is a custom benchmark calledrawiobench [9]. Rawiobench spawns multiplethreads to perform I/O to raw or block devices.It can use a variety of APIs to perform I/Oincluding read/write , readv/writev ,pread/pwrite and io_submit . Supportexists for both random and sequential I/O andthe exact nature of the I/O request is depen-dent on the actual test being performed. Eachthread runs until all threads have completed aminimum number of I/O operations at whichtime all of the threads are stopped and the totalthroughput for all threads is calculated. Statis-tics on CPU utilization are tracked during therun.

The rawiobench benchmark will be run a num-ber of different ways to try to characterize theperformance of AIO compared to synchronousI/O. The focus will be on the 2.5 kernel.

The first comparison is designed to measurethe overhead of the AIO APIs versus usingthe normal read/write APIs (referred to as the"overhead" test). For this test rawiobench willbe run using 160 threads each doing I/O toone of the 20 logical drives for both sequen-tial and random cases. In the AIO case, an

Linux Symposium 361

io_submit/io_get_events pair is sub-mitted in place of the normal read or write call.The baseline synchronous tests will be referredto in the charts simply as "seqread" "seqwrite""ranread" "ranwrite" with an extension of ei-ther "ODIR" for a block device opened withthe O_DIRECT flag or "RAW" for a raw de-vice. For the AIO version of this test, "aio" isprepended to the test name (e.g., aioseqread).

The second comparison is an attempt to reducethe number of threads used by AIO and to takeadvantage of the ability to submit multiple I/Osin a single request. To accomplish this thenumber of threads for AIO was reduced from8 per device to 8 total (down from 160). Eachthread is now responsible for doing I/O to ev-ery device instead of just one. This is done bybuilding anio_submit with 20 I/O requests(1 for each device). The process waits for allI/Os to complete before sending new I/Os. ThisAIO test variation is called "batch mode" andis referred to in the charts by adding a "b" tothe front of the test name (e.g., bseqaioread).

The third comparison will improve upon thesecond by having each AIO process callingio_getevents with a minimum numberequal to 1 so that as soon as any previouslysubmitted I/O completes, a new I/O will bedriven. This AIO test variation is called "min-imum batch mode" and is referred to in thecharts by adding a "minb" to the front to thetest name (e.g., minbseqaioread).

In all sequential test variations, a global offsetvariable per device is used to make sure thateach block is read only once. This offset vari-able is modified using thelock xadd assem-bly instruction to ensure correct SMP opera-tion.

Figure 1: Sequential Read Overhead

6.3 Results, comparison and analysis

Raw andO_DIRECT performed nearly iden-tically on all of the benchmarks tests. In orderto reduce redundant data and analysis, only thedata fromO_DIRECT will be presented here.

For the first comparison of AIO overhead, theresults show that there is significant overheadto the AIO model for sequential reads (Figure1). For small block sizes where the benchmarkis CPU bound, the AIO version has signifi-cantly lower throughput values. Once the blocksize reaches 8K and we are no longer CPUbound, AIO catches up in terms of throughput,but averages about 20% to 25% higher CPUutilization.

In Figure 2 we can see the performance ofrandom reads using AIO is identical to syn-chronous I/O in terms of throughput, but av-erages approximately 20% higher CPU utiliza-tion. By comparing synchronous random readplus seeks with random pread calls (Figure 3)we see that there is minimal measurable over-head associated with having two system callsinstead of one. From this we can infer that theoverhead seen using AIO in this test is associ-ated with the AIO internals, and not the cost ofthe additional API call. This overhead seems

Linux Symposium 362

Figure 2: Random Read Overhead

Figure 3: read vs. pread

excessive and probably indicates a problem inthe AIO kernel code. More investigation is re-quired to understand where this extra time isbeing spent in AIO.

For write performance we can see that AIOachieves approximately the same level ofthroughput as synchronous I/O, but at a signif-icantly higher cost in terms of CPU utilizationat smaller block sizes. For example, during thesequential write test at 2K block sizes, AIOuses 97% CPU while synchronous uses only55%. This mirrors the behavior we see in theread tests and is another indication of problemswithin the AIO code.

Figure 4: Sequential Write Overhead

Figure 5: Random Write Overhead

Linux Symposium 363

Figure 6: Sequential Read Batch

Figure 7: Random Read Batch

Batch mode AIO performs poorly on the se-quential and random read tests. Throughput issignificantly lower as is CPU utilization (Fig-ures 6,7). This is probably due to the fact thatwe can drive more I/Os in a single request thanthe I/O subsystem can handle, but we must waitfor all the I/Os to complete before continuing.This results in multiple drives being idle whilewaiting for the last I/O in a submission to com-plete.

AIO batch mode also under-performs on thewrite tests. While CPU utilization is lower, itnever achieves the same throughput values assynchronous I/O. This can be seen in Figures 8and 9.

Figure 8: Sequential Write Batch

Figure 9: Random Write Batch

Linux Symposium 364

Figure 10: Sequential Read Minimum Batch

Minimum batch mode improves considerablyon the overhead and batch mode tests; how-ever, forO_DIRECT access AIO either lagsin throughput or uses more CPU for all blocksizes in the sequential read test (Figure 10).For the random read test minimum batch modeAIO has identical throughput to synchronousreads, but uses from 10% to 20% more CPUin all cases (Figure 11). Sequential minimumbatch mode AIO comes close to the perfor-mance (both throughput and CPU utilization)of synchronous, but does not ever perform bet-ter.

Minimum batch mode sequential writes, likereads, lag behind synchronous writes in termsof overall performance (Figure 12). This dif-ference gets smaller and smaller as the blocksize increases. For random writes (Figure 9),the difference increases as the block size in-creases.

Since we are seeing lower CPU utilization forminimum batch mode AIO at smaller blocksizes we tried increasing the number of threadsin that mode to see if we could drive higher I/Othroughput. The results seen in Figure 14 showthat indeed for smaller block sizes that using 16threads instead of 8 did increase the through-put, even beating synchronous I/O at the 8Kblock size. For block sizes larger than 8K the

Figure 11: Random Read Minimum Batch

Figure 12: Sequential Write Minimum Batch

Figure 13: Random Write Minimum Batch

Linux Symposium 365

increase in number of threads either made nodifference, or causes a degradation in through-put.

Figure 14: Sequential Read Minimum Batchwith 16 threads

In conclusion, there appears to be no con-ditions for raw or O_DIRECT access un-der which AIO can show a noticeable bene-fit. There are however, cases where AIO willcause reductions in throughput and higher CPUutilization. Further investigation is required todetermine if changes in the kernel code can bemade to improve the performance of AIO to thelevel of synchronous I/O.

It should be noted that at the larger block sizes,CPU utilization is so low (less than 5%) forboth synchronous I/O and AIO that the dif-ference should not be an issue. Since usingminimum batch mode achieves nearly the samethroughput as synchronous I/O for these largeblock sizes, an application could choose to useAIO without any noticeable penalty. Theremay be cases where the semantics of the AIOcalls make it easier for an application to coordi-nate I/O, thus improving the overall efficiencyof the application.

6.3.1 Future work

The patches which are available to enable AIOfor buffered filesystem access are not stableenough to collect performance data at present.Also, due to time constraints, no rawiobenchtestcases were developed to verify the effec-tiveness of the readv/writev enhancements forAIO [12]. Both items are left as follow-onwork.

7 Acknowledgments

We would like to thank the many peopleon the [email protected] [email protected] lists who provided us with valuablecomments and suggestions during the devel-opment of these patches. In particular, wewould like to thank Benjamin LaHaise, authorof the Linux kernel AIO subsystem. Theretry based model for AIO, which we usedin the filesystem AIO patches, was originallysuggested by Ben.

This work was developed as part of the LinuxScalability Effort (LSE) on SourceForge(sourceforge.net/projects/lse ).The patches developed by the authors andmentioned in this paper can be found in the"I/O Scalability" package at the LSE site.

8 Trademarks


IBM and ServeRAID are trademarks or regis-tered trademarks of International Business Ma-chines Corporation in the United States, other coun-tries, or both.

Pentium is a trademark of Intel Corporation in theUnited States, other countries or both.

Linux Symposium 366

Windows is a trademark of Microsoft Corporationin the United States, other countries or both.

Linux is a trademark of Linus Torvalds.


References

[1] AIO readv/writev performance data.http://osdn.dl.sourceforge.net/sourceforge/lse/vector-aio.data .

[2] Asynchronous I/O on Windows® NT.

[3] Kernel Asynchronous I/Oimplementation for Linux from SGI.http://oss.sgi.com/projects/kaio .

[4] POSIX Asynchronous I/O.

[5] Open Group Base Specifications Issue 6IEEE Std 1003.1.http://www.opengroup.org/onlinepubs/007904975/toc.htm , 2003.

[6] Suparna Bhattacharya. 2.5 Linux KernelAsynchronous I/O patches.http://sourceforge.net/projects/lse .

[7] Suparna Bhattacharya. 2.5 Linux KernelAsynchronous I/O rollup patch.http://osdn.dl.sourceforge.net/sourceforge/lse/aiordwr-rollup.patch .

[8] Suparna Bhattacharya. Design Notes onAsynchronous I/O for Linux.http://lse.sourceforge.net/io/aionotes.txt .

[9] Steven Pratt Bill Hartner. rawiobenchmicrobenchmark.http://www-124.ibm.com/developerworks/opensource/linuxperf/rawread/rawr%ead.html .

[10] Benjamin LaHaise. 2.4 Linux KernelAsynchronous I/O patches.http://www.kernel.org/pub/linux/kernel/people/bcrl/aio/patches/ .

[11] Benjamin LaHaise. Collapsed read/writeiocb argument-based filesysteminterfaces.http://marc.theaimsgroup.com/?l=linux-aio&m=104922878126300\&w=2 .

[12] Janet Morgan. 2.5 Linux KernelAsynchronous I/O readv/writev patches.http://marc.theaimsgroup.com/?l=linux-aio&m=103485397403768\&w=2 .

[13] Venkateshwaran VenkataramaniMuthian Sivathanu and Remzi H.Arapaci-Dusseau. Block AsynchronousI/O: A Flexible Infrastructure for UserLevel Filesystems.http://www.cs.wisc.edu/~muthian/baio-paper.pdf .

Towards an O(1) VM:Making Linux virtual memory management scale towards large amounts of physical

memory

Rik van RielRed Hat, Inc.

[email protected]

Abstract

Linux 2.4 and 2.5 already scale fairly welltowards many CPUs, large numbers of files,large numbers of network connections and sev-eral “other kinds of big.” However, the VMstill has a few places with poor worst case (oreven average case) behavior that needs to beimproved in order to make Linux work well onmachines with many gigabytes of RAM.

1 Introduction

In this paper I will explore the problem spacesand algorithmic complexities of the virtualmemory subsystem. This paper will focusmostly on the page replacement code, which bydefinition has all of physical memory and partsof virtual memory as its search space. The fol-lowing aspects of page replacement will be dis-cussed:

• Page launder, the reclaiming of pages thatare selected for pageout.

• Page aging, how to select which pages toevict.

• Balancing filesystem cache vs. anony-mous memory.

• Reverse mapping, pte based vs. objectbased.

2 Page launder

Traditionally the virtual memory managementsubsystems in Unix and Linux systems havehad either a clock algorithm or Mach-style ac-tive and inactive lists to do both LRU aging andeviction of pages. Linux 2.4 and 2.5 have whatamounts to simple Mach-style active and in-active lists (Figure 1), at least when it comesto the writeout and reclaiming of pages thataren’t mapped in processes. In this paper, theMach VM pageout algorithm is used as an ex-ample because it is a decent approximation ofwhat the different Linux VMs have done andthe Mach VM is quite possibly the best docu-mented virtual memory subsystem.

In the Mach VM, pages get recycled once theyreach the end of the inactive list and are clean,meaning they do not need to be written to disk.If the page needs to be written to disk, a so-called dirty page, disk IO is started and thepage is moved to the beginning of the inactivelist. Presumably the disk IO will have finishedand the page will be clean by the time it gets tothe end of the inactive list again.

This organisation works reasonably well whendealing with filesystem cache pages, since

Linux Symposium 368

Inactive

Active

Free

Referenced

Allocations

Allocations

NotReferencedDirtyDisk IO started

NotReferencedClean

Inactive shortageMoved on

Figure 1: Mach pageout lists

those are usually clean pages which can be re-claimed the moment they reach the end of theinactive list. However, when the filesystemcache is small and the system is dealing mostlywith dirty, swap or mmap backed pages fromprocesses, this strategy has a big drawback onmodern, large memory computers.

2.1 The problem with Mach-style page laun-dering

Memory used by processes is often dirty,meaning it needs to be written back to disk.The problem with this becomes obvious whenwe look at exactly what happens when all ofthe pages on the inactive list are dirty:

• The pageout code encounters a dirty page.

• Disk IO is started, the page is written todisk.

• The page is moved to the far end of theinactive list.

• The page reclaiming code encounters thenext dirty page, starts writeout, etc.. . .

• Since only a finite number of disk IO op-erations can be underway at any time, thepage reclaiming code needs to wait forcurrent IO operations to finish once it hasstarted writeout on a certain number ofpages.

• IO on the pages that were written out firstfinishes, meaning the pages are now cleanand reclaimable.

• The page reclaiming code continues withthe write out of the other dirty pages onthe inactive list.

Of course, this has a number of serious draw-backs. The most obvious one is that on largememory systems the system will wait for mostpageout IO to have finished before it can evenstart the last IO. Worse yet, it won’t be able tofree a page before all IO has been submitted.

In the early 1990s, when the Mach VM waspopular, systems had up to a few megabytesof memory, with maybe a few hundred kilo-bytes of inactive pages, which could be writtento disk in one or at most a few seconds. Mod-ern systems, on the other hand, often have mul-tiple gigabytes of memory. Since the speed ofhard disks hasn’t increased nearly as much asthe size of memory, the time needed to writeout all of the inactive list can be unacceptablyhigh, up to dozens of seconds.

2.2 Solutions

One obvious solution is to only write out partof the pages on the inactive list. After all, if thesystem needs to free ten megabytes of memory,there is little reason to write out one gigabyte of

Linux Symposium 369

data. The implementation of this solution is alittle less obvious, since there are various waysto approach this goal and there is a tradeoff tomake between CPU usage and page freeing la-tency.

The first solution would be to simply write outa limited number of pages and skip the dirtypages on the list, scanning the list like usualand freeing all the clean pages encountered. Insituations where the inactive list has both cleanand dirty pages this tactic will allow you to al-ways free the clean pages, reaching your freetarget and allowing allocations to go on with aslittle latency as possible. Of course, if the listonly has dirty pages, then the system could endup spending a lot of CPU time scanning the listover and over again.

For the rmap VM a different, hopefully morepredictable and CPU friendly solution (Fig-ure 2) has been chosen. Instead of just oneinactive list, there are various lists for the dif-ferent stages of the pageout process a page canbe in. Initially all rarely used pages are placedon the inactive_dirty list, regardless of whetheror not they need to be written back to disk.

When a page reaches the end of the inac-tive_dirty list and wasn’t referenced, the VMwill move it to the inactive_laundry list, start-ing disk IO if the page was dirty. Referencedpages get moved back to the active list.

On the other end of the inactive_laundry list theVM removes clean pages, until the system hasenough immediately freeable and free pages.Referenced pages are moved back to the activelist; cleaned pages are moved on to the inac-tive_clean list, from where they can be imme-diately reused by the page allocation code.

The inactive_clean list is just an extension ofthe free page list. It contains clean pages thatwere not referenced and can be immediately re-claimed by the page allocation code. The rea-

son for having an inactive_clean list is that thefree page list in a VM is never the right size.The list should be as large as possible in or-der to be able to satisfy allocations with low la-tency, but at the same time the list should be assmall as possible so almost all of memory canbe used for processes and the cache. Having alist of immediately reclaimable pages with use-ful data in them avoids most of this dilemma.

Free

Inactive

Inactive

Inactive

Active

Laundry

Clean

Dirty

NotReferencedClean

NotReferencedDisk IO started(if needed)

ReferencedNot

Referenced

Allocations

ReferencedNot

Free shortage

Allocations

Figure 2: O(1) page launder

3 Page aging

Since the performance penalty of evicting thewrong page from memory is so high, due to theenormous speed differential between memoryand disk, any virtual memory subsystem needsto take great care in selecting which pages toevict and which pages to keep in memory. Onthe other hand, on systems with more than afew megabytes of memory you do not want to

Linux Symposium 370

scan all the active pages every time the systemis short on inactive memory.

While it is impossible to ensure this situationwill never happen, because some applicationsjust have access patterns you cannot tune apage replacement algorithm for, we can im-prove the situation a lot by pre-sorting the ac-tive pages in various lists (Figure 3), accordingto activity.

The pageout code will only look at the pagesthat most likely aren’t very active, meaning ithas a better chance of finding the proper pagesfor eviction without needing to resort to a fullscan of memory. If the list with least usedpages is empty, the pageout code simply shiftsdown all of the active lists and starts looking atthe pages that came from the next list up.

The page aging (sorting) code scans the ac-tive lists periodically and moves the pages thatwere accessed to higher lists. It only needsto age pages upwards, because the downwardsmovement is done by the pageout code shift-ing down whole lists at a time. The period withwhich the page aging code scans the active listsis varied in reaction to the amount of pageoutactivity. Ideally the system would do a sim-ilar number of up aging scans as the numberof times it shifts down active lists. The scaninterval of the up aging code is reduced if theVM did too many down shifting of active pagesand increased if the VM was quiet in-betweentwo aging scans. The page aging interval hasboth a lower and an upper bound, to keep theoverhead under control and to have some back-ground aging in an otherwise idle system. Theonly time the page aging doesn’t run is whenthere are more active pages on the higher liststhan on the lower lists.

Active 0

Active N−1

InactiveDirty

at a timewhole listsShifts downPageout

referencedindividualMoves up

Page aging

pages

referencedPages, notReferenced

pages

Figure 3: Multi list page aging

4 Balancing cache vs programmemory

LRU style page replacement algorithms havewell-documented, known problems. There areseveral replacement algorithms available thatimprove the replacement of pages within oneset of data, e.g. EELRU, SEQ and LRFU;however none of these address the problem ofbalancing replacement between various sets ofdata. Since all currently implemented page re-placement algorithms for Linux have this prob-lem, the replacement algorithm needs somehelp balancing the file cache with memory usedfor programs.

The rmap VM borrows a common trick fromother systems here. There are separate ac-tive lists for file cache memory and programmemory, active_cache and active_anon, re-spectively. While the cache is larger than acertain percentage of active memory only thecache pages are a candidate for pageout, thisvalue is the so-called borrow percentage andis 15 by default. Below the borrow percent-age the VM will move both cache pages andpages belonging to processes to the inactive

Linux Symposium 371

list, reclaiming the pages that haven’t been ref-erenced again by the time they reach the far endof the inactive list. This gives cache and pro-cesses a chance to balance against each otherby referencing pages. If the cache takes lessthan a predetermined minimum of the activelist, one percent by default, the VM will onlyreclaim pages from processes.

The deeper reasons behind the need for thesebalancing hints are a little more complex thanthe reasons behind other design choices in theVM. One of the factors is that the amountof data on the filesystems tends to be severalmagnitudes larger than the amount of mem-ory taken by the processes in the system. Thismeans that the number of accesses to pagesfrom the file cache could overwhelm the to-tal number of accesses to the pages of the pro-cesses, even though the individual pages of theprocesses get accessed more frequently thanmost file cache pages. In other words, the sys-tem can end up evicting frequently accessedpages from memory in favor of a mass of re-cently but far less frequently accessed pages.

A replacement algorithm like LIRS, Low Inter-reference Recency Set, would probably dothe right thing since it replaces pages witha higher interval between references beforepages that have a lower interval between ref-erences. However, for LIRS to work properlythe VM would need to keep track of pages thathave already been evicted from memory. SinceLinux does not have an infrastructure to keeptrack of those, the rmap VM uses an LRFUstyle page replacement algorithm with cachesize hints.

Even if the direct value of LIRS overLRU/LFU for use as a primary cache wouldn’tbe big enough to offset the overhead of theneeded infrastructure, the facts that LIRSwould make the file cache vs process mem-ory balancing automatic and that LIRS would

also do the right thing as a second level cache(e.g. an NFS server, page cache on a web proxywhere squid itself has the first level cache)make the implementation of LIRS for Linux apromising future experiment.

5 Reverse mapping

Reverse mappings provide an inverse to thepage tables of the processes; that is, they keeptrack of which processes are using the physi-cal pages, at which virtual addresses. Usingreverse mappings, the pageout code can:

• Unmap a page from all processes usingit, without needing to search the virtualmemory of all processes.

• Unmap only those pages it really wants toevict, instead of scanning the virtual mem-ory of all processes and unmapping morepages than it wants to evict in order to beon the safe side. This could reduce thenumber of minor page faults.

• Evict pages in a certain physical addressrange, which is useful since Linux dividesphysical memory into various zones.

• Scan only the virtual mappings of knowninactive pages, which means the pageoutcode has a smaller search space in virtualmemory. Combined with smarter page ag-ing and page laundering, this results in asmaller overall search space for the page-out code.

5.1 Page based vs object based

There are pros and cons to doing reverse map-ping on a per-page or a per-object basis. Re-verse mapping on a per-page basis is more ef-ficient for the pageout code, but the reverse

Linux Symposium 372

mapping code affects more than just the page-out code path. The page fault, fork, exit, andmmap paths all modify the reverse mappings,so doing reverse mappings on objects largerthan a page (like a vma) would reduce the re-verse mapping overhead in those code paths, atthe cost of the pageout code needing to searchmore space.

The big question here is how much theoverhead and algorithmic complexities wouldchange, especially under larger workloads. Aquadratic increase in complexity in the pageoutpath is almost certainly more expensive thanwhat could be offset by a linear speedup in theother code paths, even though the pageout pathis rarely run.

Large workloads, with many gigabytes ofmemory and hundreds or thousands of large,active processes are certainly able to bring outthe worst of any VM; with the current imple-mentations it doesn’t even matter which styleof reverse mapping is used. Bad behaviour canbe triggered in either case.

It appears that for both object-based and page-based reverse mappings, Linux is in need ofsmarter data structures that aren’t suscepti-ble to quadratic algorithmic complexities any-where. Once those are written we will be ableto make a proper comparison between bothmethods of reverse mapping. It is conceiv-able that Linux would end up using a hybrid ofobject-based and page-based reverse mapping,with each type being used where it is most ap-propriate.

6 Conclusions

Linux memory management has come a longway in the last few years, but at the same timeusers have deployed Linux in more and moredemanding environments. In fact, demand al-ways seems to be one step ahead of whatever

stage kernel development is at.

Users have shown beyond any doubt that thereare legitimate workloads that bring out theworst case behaviour in any VM; because ofthis there is a constant need to bring the al-gorithmic complexity of any part of the vir-tual memory management subsystem closer tothe holy grail of constant-time, or O(1) com-plexity. The author expects development ofthe Linux virtual management subsystem to re-main challenging for years to come.

7 References

Draves, Richard P.Page Replacement andReference Bit Emulation in Mach.InProceedings of the USENIX MachSymposium, Monterey, CA, November 1991.

Y. Smaragdakis, S. Kaplan, and P. Wilson,EELRU: Simple and Effective Adaptive PageReplacementin Proceeding of the 1999 ACMSIGMETRICS Conference, 1999.

Gideon Glass and Pei Cao.Adaptive PageReplacement Based on Memory ReferenceBehavior.In Proceedings of ACMSIGMETRICS 1997, June, 1997.

D. Lee, J. Choi, J.-H. Kim, S.H. Noh, S.L.Min, Y. Cho, and C.S. Kim,LRFU: Aspectrum of policies that subsumes the leastrecently used and least frequently usedpoliciesIEEE Trans. Computers, vol. 50, no.12, pp. 1352–1360, 2001.

S. Jiang and X. Zhuang.LIRS: An efficient lowinter-reference recency set replacement policyto improve buffer cache performance.In Proc.of SIGMETRICS 2002.

Developing Mobile Devices based on Linux

Tim RikerTexas Instruments

[email protected], http://Rikers.org/

Abstract

This presentation will cover available com-ponents of embedded solutions that leverageLinux. We discuss bootloaders, Linux kernelconfiguration, BusyBox, glibc, uClibc, GUIchoices such as Qtopia, TinyX, GPE, Kon-queror, Dillo, and similar packages that makeup a commercial-grade consumer device. Thispresentation is aimed at those who are just get-ting into Linux on mobile or other embeddeddevices.

1 Why Linux?

Linux is a stable, tested platform for produc-tion use. The most often-used environment forLinux is as a server for key business services.Is it well suited for embedded use on mobiledevices? It is. Linux has a low total cost ofownership as compared to other options. Thereis a large pool of developers that are familiarwith the environment and this pool continuesto grow rapidly. The use of an Open Sourceplatform allows for flexible hardware and otherproduct design choices that can save money.There is no dependency on a single vendor forsupport.

The largest advantage is quick time to market.The wealth of available projects that can beleveraged means that resources can be directedtoward the specific value a product has to offerwithout spending undue resources duplicatingwhat is done on devices that are otherwise sim-

ilar. This advantage truly comes to light when acommunity is formed around the product. Thiscommunity can focus on enhancing the productwithout direct development cost to the manu-facturer. When planning the product lifetime,thought should be given to seeding hardwareto key community members so that communitysupport is available early in the product life cy-cle.

2 Special Needs of Mobile Devices

Mobile devices have features not often seenin desktop or server systems. Most use flashstorage instead of traditional hard disk media.NOR flash is a common choice for average-sized storage, but becomes more expensivewith larger systems. It is relatively easy to han-dle from a software perspective, and is com-monly available as a direct memory mappeddevice. NAND flash is cheaper for larger (ie:>32MB) devices, but is not normally mappedto a specific memory location. In addition,flash devices track bad sectors and have error-correction code. Special device drivers andfilesystems are required.

Removable storage is also an option. MMC(MultiMediaCard) is one common small-form-factor storage card. These use a 1-bit widthserial interface to access the flash. SD (Se-cure Digital) storage cards can have an en-hanced 4-bit interface that allows for faster datatransfer and I/O devices, but the licensing fromhttp://SDCard.org/ prohibits releasing

Linux Symposium 374

the source to any driver for these cards, sothey are not currently recommended for Linux-based solutions.

Compact Flash (CF) storage cards are anotherpopular option. Testing has shown that most,if not all, CF cards are unreliable when power-cycled during a write. Consider this stronglyif power-cycles are likely to happen. Batteriesoften run down on mobile devices.

2.1 Power needs in hardware

Power consumption is critical in a mobile de-vice. If the device has a display, the lightis often the single largest power-consumingcomponent. Software should be configured tobe aggressive about turning off the lighting.This is commonly user-configurable, and oftenhas a different setting when the device is con-nected to external power. Wireless interfacesare likely the next largest power consumer. Itis wise to spend time during product develop-ment tuning the wireless setting for maximumpower conservation. Choosing a different wire-less chipset can make a large difference in thepower needs of the device.

Linux has support for CPU scaling on a num-ber of architectures. Research the devices thatare affected when the CPU speeds up or downon different platforms. For example, on typicalStrongARM platforms, the LCD display mustbe turned off during any CPU speed changes.This may mean your product cannot leverageCPU scaling in a useful manner. Other devicesthat may be affected by CPU speed changes in-clude audio, USB, serial, network, and manyother timing-sensitive devices.

CompactFlash and other removable devicesmay still consume significant power while ina suspended state. It is wise to add supportfor removing power in software to most sys-tem devices before entering a suspend state. It

is also wise to avoid polling of any hardwaredevice. As a general rule, interrupt-driven de-vices will have a lower load on the CPU andtherefore consume less power. Physical keyson the device are one common area where thisis overlooked. If the device has a power button,it should be on a separate hardware interruptfrom the other keys on the device.

3 The bootloader

Choosing a bootloader has a big impact onthe development environment. Most embed-ded systems do not have a traditional BIOS on-board. They do not have APM or ACPI inter-faces. Some rely on the bootloader to take partin power-resume states; others just resume ex-ecution at the next address after where they en-tered suspend. Most bootloaders will initializeRAM configuration, and startup other devicesin the system. Hardware designs may want toinsure that the Linux kernel can be booted witha minimum of hardware setup so that theredoes not need to be driver code for the remain-ing hardware in the bootloader as well as in thekernel.

The most common interface to a bootloader isover a serial port. If the device has a keyboardand display, that may be another choice, butthere is usually serial port support as well. Thebootloader should support flashing new codeon the device. New kernel images and filesys-tem images are loaded over this interface andstored to flash on the device. This requiresthat the bootloader is able to deal with what-ever styles of flash are on the device. Xmodem,Ymodem, and even Zmodem code is availablein existing open bootloaders.

If the device will have removable media suchas CompactFlash storage cards, the bootloadercan be configured to read images from therefor flashing. This may be a good option for

Linux Symposium 375

upgrading devices in the field. Installing fromMMC or other types of storage devices mayrequire significant work implementing devicesupport in the bootloader.

Some embedded systems have built-in ethernetinterfaces and can use that for bootp or tftp up-dates. This is not very common for a mobiledevice.

For products with a USB client connection,USB serial support can be implemented in thebootloader. This will likely use a different USBdevice ID than presented by the product in itsnormal running state. If the product also hasUSB host connectivity, then it can be imagedoff another working device. The kernel andbase filesystem should be in a separate areafor the device configuration in order to supportthis.

The bootloader will need to be aware of anypartitioning in flash on the device. The kernelwill need access to this same information. Thisinformation could then be shared with the ker-nel by dedicating an area of flash to store it.It could be compiled into both the bootloaderand the kernel. The cleanest approach is for thebootloader to add it to the kernel command linewhen booting. There will likely be other ker-nel command-line options the bootloader willwant to store and pass on to the kernel. Allof these can be contained in one flash blockif desired. Some bootloaders understand thefilesystem, as well as the partitions on the de-vice. This allows for one filesystem image thatcontains the base binaries and libraries as wellas the kernel and its modules. This prevents thekernel and modules from ever being out of syncif they are all in the same image, which is anice feature from a customer support perspec-tive. If the bootloader can read files from thefilesystem directly, then one can store the per-manent kernel command-line options and otherbootloader configuration inside the filesystem.

This will likely not include flash partitioninginformation, as the partitioning would need tobe known before the filesystem could be read.

Erasing the bootloader is a Bad Thing. Stepsshould be taken to insure that it it protected.This might mean putting it in ROM, or protect-ing the flash block(s) it lives in, or other meth-ods. OS updates in the field should not replacethe bootloader as part of the normal procedure.

4 Filesystems

Workstation and server filesystems like ext2 donot scale well for embedded devices. Smallerfilesystems includeromfs and cramfs ,which are read-only.cramfs includes com-pression. Flash or ROM often is slow to ac-cess. The time it takes to read and decompressfiles is often faster than reading the same fileif stored uncompressed.cramfs may be boththe smallest and fastest choice for a read-onlyfilesystem. In most cases it is also importantto conserve memory. This implies that filesys-tems should be used directly from flash ratherthan using an initial ramdisk (initrd) wheneverpossible.

Some applications on the device will createtemporary files. If the filesystem is read-only,ramfs support should be added to store thesefiles. One way to handle this is to link /tmp to/var/tmp then mount ramfs on /var and unpacka tar archive into it, or just copy a tree fromsomeplace else on the filesystem. You maywant /dev linked in here as well if device filesneed to have permission or ownership changedat runtime.

An alternative to /dev is to include devfs sup-port in the kernel. You may be able to avoidusing devfsd by considering the needs of all theapplications that will be included on the deviceand configuring default devices permissions inthe kernel.

Linux Symposium 376

Many handheld portable devices use jffs2. Thisis a compressed journaled filesystem that un-derstands NOR flash directly. It will need afew free blocks to use as workspace in eachmounted partition. jffs2 has undergone a greatdeal of testing to insure that it is always in areadable state even when writes are interruptedby a powerfail or hard reboot. jffs2 support onNAND flash chips is in progress and may becomplete by the time you read this. There areother flash filesystems like YAFFS designedjust for NAND flash devices. This is a reason-able option with a lot of flash and where com-pression at the filesystem level is not needed.Removable storage media will still need to usevfat if it is going to be moved to other devices.

Some systems have dedicated partitions inflash for diagnostics or additional code thathandles reflashing the device in the case wherethe normal filesystem is not usable. The im-pact of requiring this extra flash space shouldbe considered carefully. Reflashing code canbe added to the bootloader in much less space.User-space tools can be used for this task aslong as the device can get to that point usingthe existing filesystem. If the device includesremovable media access, diagnostics can beshipped on the removable media and the boot-loader configured to load a kernel and filesys-tem from there.

5 Kernel device drivers

Embedded devices often do not include com-ponents like PCI, ISA, MCA busses. All de-vices that will not be present should be turnedoff in the kernel config to save space. Theseare not always obvious choices. For example,if the device includes a CompactFlash slot, allof the PCMCIA card drivers would be avail-able as card options. There is no clear list ofwhich drivers are exclusively for pcmcia cardsand do not need to be available as they would

not be inserted in a CompactFlash slot.

5.1 Connectivity

Many mobile Linux devices include some formof networking. This is probably not a normalethernet interface but can include WiFi, Blue-tooth, IrDA, USB, ppp over serial, cell phone,etc. Some devices will have multiple optionsavailable. Very good IPv6 support is availableunder Linux, but be aware that the IPv6 stackis larger than IPv4 or other options.

6 Libraries

Larger Linux systems are commonly based onthe GNU C Library. This is not well suited tosmall devices. To quote the maintainer:

. . . glibc is not the right thing for [anembedded OS]. It is designed as anative library (as opposed to embed-ded). Many functions (e.g., printf)contain functionality which is notwanted in embedded systems.

—Ulrich Drepper<[email protected]>

24 May 1999

Other alternatives will be smaller, but may re-quire some modifications to source that is be-ing ported. This should not be a large taskwhen compared with the other aspects of theport. Alternatives include uClibc, dietlibc,newlib, and using pieces from things such asthe Minix C library, libc5, *BSD libraries, etc.uClibc is the best choice today, as it is un-der the LGPL license to allow for commercialapplications, includes shared library support,pthreads, c++, and even most locale features.It is configurable so you can remove things

Linux Symposium 377

you don’t need. uClibc is tested for compli-ance with POSIX and ISO standards and passesmore tests than glibc.

Other libraries included on the device shouldbe reviewed closely to insure that they do notinclude components that are not needed. Manyof these have compile-time options to excludelarge portions of the functionality.

If no applications are going to be added tothe system at a later date, library reductiontools can be used to remove the unneeded por-tions. Some examples of these tools includemklibs, lipo, and libraryopt. The python scriptmklibs.py is used in the Debian project onmultiple architectures and is the best place tostart.

7 Applications

There are many basic tools that run on top ofLinux and make up a basic distribution. Manyof the GNU tools could be here. GNU fileutils,textutils, grep, modutils, and many others arefound in most every Linux distribution, hencethe GNU/Linux naming convention. While thisis the norm for a desktop distribution, it islikely that none of the binaries or libraries onan embedded Linux system will be the GNUflavor. Applications like BusyBox and Tiny-Login are not GNU packages, nor is uClibc.

BusyBox includes close to 200 different ap-plications in one multicall binary that is under600k. This allows you to hardlink or symlinkto the single binary, and depending on how it iscalled, it acts as each of those different applica-tions. Each applet is normally less feature-richthan the GNU equivalent, but much smaller insize. If you were to include the normal binariesfor all of these from a GNU/Linux distribution,you could have over 5 MB of binaries. Eachapplet in BusyBox can be enabled or disabledso it’s likely that your version will be much

smaller when you only include the programsyou need.

Your system will likely want to have some ser-vices and perhaps a way to get shell access re-motely. A small inetd (24k) with a minimal tel-netd (32k) could do the trick. The latest versionof BusyBox includes both of these. Expansioncapabilities might warrant including pcmcia-csand hotplug code. Use caution in this area aswell. Linux supports a lot of PCMCIA hard-ware but the size of included drivers can addup quickly.

8 Crypto

For secure access OpenSSH is the com-mon choice on Linux desktops and servers.OpenSSH client (215k) and server (254k) nor-mally uses OpenSSL (181k). When OpenSSLis compiled for a smaller set of supportedhashes and algorithms, it can be smaller, buteven then it is large as compared to most em-bedded applications. The author is still search-ing for a small ssh/sshd solution.

FreeSWAN is another secure access method. Itis normally even larger, close to 2M in size.FreeSWAN supports Opportunistic Encryptionwhich can be very useful in an enterprise envi-ronment. Normal IPSec support is also valu-able for many solutions. Adding hardwarecrypto support as provided by the Texas In-struments OMAP161x chips and converting allcrypto systems in the kernel and userland overto use this can save memory as well as boostperformance.

9 Package Management

For devices that are expandable, some form ofpackage management is needed. Many desk-top systems use RPM or dpkg for this purpose.Both of these are large binaries and more im-

Linux Symposium 378

portantly they normally have a large amountof package information stored on the filesys-tem. BusyBox includes a stripped-down ver-sion of both rpm and dpkg which might be agood starting place. The Handhelds.org projecthas a package manager called ipkg that is smalland better suited for embedded systems. TheDebian installer uses .udebs for about the samepurpose. ipkg packages and udeb’s do not nor-mally include man pages or other supportingfiles, but have just the minimum files included.These packages are both the same format as a.deb file.

Note that Linux Standards Base requires theability to install rpm files. It also requires manythings like libgcc_s and glibc that the embed-ded device may not provide.

10 Graphical User Interface

There are many choices of GUIs for a mobiledevice. The Sharp Zaurus models use Qtopiawith Qt/Embedded. These are available un-der the GPL, but using these versions requiresthat all applications available for the platformare also under the GPL. These components arealso available from Trolltech under a differentlicense as royalty-bearing components. Thisallows applications to use licenses other thanthe GPL. These components might be the onlyones on the device that are not free. The basefiles needed for a common Qtopia and Qt/Econfiguration would be about 3 MB in size.This does not include the applications.

The Handhelds.org distribution can use X11and GTK as the basis for an environment calledGPE (GPE Palmtop Environment). This usesTinyX from the XFree86 version of the X Win-dow System. The X server itself will be around700k and the X11 and GTK libraries add up toabout 3 MB. Note that this system retains allthe remote display options of a normal X Win-

dow System desktop machine.

Smaller systems are available such as Pixel orNanoGUI. These might not include advancedweb browsers or other large applications, butthey are appropriate for smaller devices. A cus-tom interface could be built on top of systemslike DirectFB, GTKfb, Qt/E, SDL, or by writ-ing directly to the Linux framebuffer.

The Qtopia interface has gotten more atten-tion lately and is currently ahead of the GPEwork. The OPIE project has enhanced Qtopiaand is starting work on a next-generation li-brary that would replace the Qtopia library andclean up the interface in the process. This newproject should be available under LGPL, and ifit was configured to run on top of TinyX usingQt/X11 instead of Qt/E, could avoid a per-unitroyalty. Removing duplicate code between Xand Qt/X11 could drop the storage size downvery close to that of Qt/E and Qtopia. Thisretains the remote display options that the XWindow System provides, while leaving a pathto use much of the available Qt widgets and ap-plications.

11 Web Browser

As time progresses, an embedded web browserwill become more and more important.Browsers like Netscape and Mozilla are large.Other browsers might use the same engine butbe much more space-constrained. Browserslike Dillo and ViewML are less feature-rich,but even smaller. Text-based browsers likelynx or links could be wrapped with a graphicalfront end, but this would be time-consumingand not in line with the goals of those projects.There are commercial options like Opera toconsider as well. Konqueror Embedded, whichuses Qt, is a good mix of size and features.Some minor interface tweaks could make iteven better.

Linux Symposium 379

12 A Small Example

The TuxScreen project is very tightly storage-constrained. There is no web browser in theLinux-based base filesystem image. This de-vice is a Desktop phone with a StrongARM1100, a 640x480x8 color touch screen, andonly 4 MB of flash storage. In this space wehave a 128k bootloader partition (with only32k used) and a 4MB minus 128k jffs2 rootpartition. In the root partition is the ker-nel and modules, uClibc, BusyBox, TinyLo-gin, pcmcia-cs, lrzsz, inetd, telnetd, XFree86TinyX, rxvt, matchbox (a window manager),and more, with over 500k left in writable space.For devices that are only remote displays, thisis a reasonable target size.

13 Where to Go from Here?

There are many Linux consulting companiesthat can help with a new product release.Monta Vista and Metroworks are two of thelarger players in that space. Red Hat also hasa group focused on embedded work. Someof these solutions might include other royalty-bearing components or have the option tospread out the engineering cost over the prod-uct lifecycle. There are plenty of individualconsultants active in embedded systems workbased on Linux. Some are looking for perma-nent placement and some work on a consultingbasis. You should determine the finances ofyour project and pick a solution that matcheswith that plan.

The big benefits of going with Linux as the so-lution is this large available pool of resourcesto draw from and the ability to have all thesource so you retain the option of doing thework internally or through another vendor. Donot ignore the benefit of growing a commu-nity around your product. The work that ac-tive, qualified developers in the community do

to improve your software solution is a valuablebenefit. Much of it will be available to you formerely the time it costs to monitor those activ-ities. You can effectively kill off this effort bymaking it difficult for others to get involved.This will likely have a detrimental effect onyour product sales.

When working out the details of contract work,you will likely want to add a requirement thatall work be pushed upstream. This implies thatit will be peer-reviewed by other Linux devel-opers and may need to be fixed to comply withexisting Linux kernel code. This is a very im-portant part of new work done in the kernel.Without it, you and your product will be stuckwith an old version of the kernel which thecommunity and other vendors will not be eagerto support. When the changes are pushed up-stream, they are much more likely to get fixedas the kernel develops. A few free units offun hardware in the right hands can go a longways as far as overall software developmentexpenses go.

Conclusion

We have touched on many projects that can beleveraged to offer rapid time to market whendeveloping mobile and other embedded de-vices based on Linux. I hope this has beenuseful. There are many useful web sites thatoffer more information. The author plans toadd links to all of the projects mentioned hereand others as well in the wiki found onhttp://eLinux.org/wiki/ if you would likemore details. Good luck with your projects,and welcome to the community.

Lustre: Building a File System for 1,000-nodeClusters

Philip SchwanCluster File Systems, Inc.

[email protected], http://www.clusterfs.com/

Abstract

Lustre is a GPLed cluster file system for Linuxthat is currently being tested on three of theworld’s largest Linux supercomputers, eachwith more than 1,000 nodes. In the past 18months we’ve tried many tactics to scale tothese limits, and the first half of this paper willdiscuss some of our successes and failures. Thesecond half will explore some of the changesthat we plan to make over the next year, as wescale towards tens of thousands of clients andpetabytes of data.

1 Introduction

The Lustre cluster file system has been de-signed and implemented with the goal of re-moving the bottlenecks traditionally found insuch systems. Lustre runs on commodity hard-ware and provides a cluster storage layout thatis efficient, scalable, and redundant. MetadataServers (MDSs) contain the file system’s di-rectory layout, permissions, and extended at-tributes for each object. Object Storage Tar-gets (OSTs) are responsible for the storage andtransfer of actual file data, and already scaleto many dozens of OSTs and hundreds of ter-abytes of data. Both types of service node canoperate in pairs which automatically take overfor each other in the event of failure. Each alsoruns an instance of the Lustre distributed lockmanager, access to which forms the core of the

Lustre protocols.

Although Lustre’s design dates from 1999, de-velopment began in earnest in early 2002. Inthe time since, surprisingly few of the majorpoints have changed from the original plan, andthe implementation has undergone fewer falsestarts as a result. Our distributed lock managerhas weathered the storm and remains largelyas it was a year ago. The choice of a systemdesigned around object protocols has provento be correct, and Lustre has so far scaled tothe limits of available hardware. Lustre’s inter-nal networking has shown itself to be relativelyflexible and high-performance, and networkabstraction layers exist for TCP/IP, QuadricsElan, Myrinet, and SCI.

Not all of our original decisions were ideal,however. One source of bugs continues to bethe interaction between Lustre and the LinuxVFS layer, which is not very well suited to net-work file systems that want a great deal of con-trol. This interaction had a significant impacton one of Lustre’s major metadata architecturechoices, the concept of “intent-based” lockingoperations, described in more detail later. Ulti-mately, we had to make significant changes toour intent-based metadata implementation.

The last year of working with governmentand industry has suggested which activities aremost important to pursue in the next year. First,and most significantly, two major caching im-

Linux Symposium 381

provements will be made beginning this sum-mer: a write-back metadata cache, and a per-sistent data/metadata cache. The write-backcache will be enabled in times of low concur-rency, and allows for metadata updates whichcan be made in local memory and later re-played on the server. Making this cache per-sistent for both metadata and file data will en-able features such as disconnected operationand server replication. Finally, our collabo-rative read cache will reduce the load on pri-mary OSTs for the most frequently accessedfiles, removing a very common bottleneck indistributed systems.

2 Distributed Lock Manager

All of Lustre’s consistency guarantees are en-forced, in one way or another, by the Lus-tre distributed lock manager (DLM). Core op-erational decisions, such as when to switchbetween writeback caching and synchronousmetadata updates, will be delegated to theDLM.

The design of the Lustre DLM borrows heav-ily from the VAX Clusters DLM, plus exten-sions that are not found in others. Althoughwe have received some reasonable criticism fornot using an existing package (such as IBM’sDLM[1]), experience thus far has seemed toindicate that we’ve made the correct choice:it’s smaller, simpler and, at least for our needs,more extensible.

The Lustre DLM, at just over 4,000 lines ofcode, has proven to be an overseeable main-tenance task, despite its somewhat dauntingcomplexity. The IBM DLM, by comparison, isnearly the size of all of Lustre combined. Thisis not necessarily a criticism of the IBM DLM,however; to its credit, it is a complete DLMwhich implements many features which we donot require in Lustre.

In particular, Lustre’s DLM is not reallydis-tributed, at least not when compared to othersuch systems. Locks in the Lustre DLM are al-ways managed by the service node, and do notchange masters as other systems allow. Omit-ting features of this type has allowed us torapidly develop and stabilize the core function-ality required by the file system.

Next, we feel that our extensions to the basicDLM API and protocol have been quite suc-cessful. File range locking is managed inter-nally as part of the regular lock matching andcompatiblity functions. Through the use of asmall policy function, the lock manager is ableto grant larger locks than originally requested.In this way we avoid the bottleneck found insome other file systems, for which a client mustlock each page or block individually. For thevery common case of a file being accessed byonly one user, Lustre’s DLM will grant exactlyone lock for the entire file.

Finally, intent locking is designed aroundthe concept of allowing the lock manager tochoose between different modes depending onits view of resource contention. In a direc-tory with very little contention—a user’s homedirectory, for example—the DLM can grant awrite-back lock, allowing the client to cachea large number of metadata updates in mem-ory. In this way it will avoid an interactionwith the server for each request and batch themat some later time. In a directory with veryhigh concurrency—such as/tmp —the DLMwill refuse to grant any lock at all. Instead, itwill perform the operation on the client’s be-half, notify it of the result, and avoid bouncingthe directory lock between hundreds or thou-sands of simultaneous users.

Linux Symposium 382

3 Object Protocols

Of all of the concepts that went into Lustre’sarchitecture, the use ofobject protocolsis byfar the most pervasive. Although Lustre is cer-tainly not unique in its use of storage objects,we have also designed many of the internalAPIs to allow for additional layering (RAID0 as one example) or short-circuiting (a clientrunning on an OST with no networking layerbetween them). This symmetry between inter-nal APIs and network protocols has served uswell.

During the initial design it became quite clearthat a shared block file system would abso-lutely not scale to the required limits for manyreasons. First, shared disk arrays on any-thing but the smallest clusters quickly becomecost ineffective for even the largest customers;this would certainly violate our goal of run-ning on inexpensive commidity hardware. Sec-ond, high-level object protocols remove a keybottleneck for scaling beyond a dozen or twonodes: locking and allocation of metadata.

In a traditional shared-block file system, thoseblocks which store inode and block alloca-tion information are subject to incredible con-tention. By organizing the protocol around ob-jects instead of blocks, the OSTs remain re-sponsible for the internal metadata allocation.Parallel file I/O to a single file has been shownto scale to more than 1,100 nodes, the limit ofavailable hardware.

For those customers who have already investedin a large storage area network (SAN) basedaround shared disk, Lustre is still an option. Inthe SAN mode, OSTs are still responsible formanaging the object locking and shared stor-age metadata, but clients can read and write in-dividual data pages directly from the SAN.

4 Networking

Lustre’s networking layer has not changed sig-nificantly from its original form more than ayear ago. It uses a simple message-passingpackage called Portals[2], which has from anAPI standpoint served us fairly well. Impor-tantly, it provides the right abstractions for en-hancements such as remote DMA as supportedby the networking hardware. We’ve made afew relatively minor API changes to accomo-date the different needs of the filesystem, as op-posed to the scientific community from whichPortals emerged.

The original implementation of Portals, how-ever, caused many serious problems. The portto run in the Linux kernel and userspace wasfairly straightforward, but Portals had neverbeen run in a multi-threaded environment andhad absolutely no internal locking. Given thatwe had to rewrite more than 80% of the codeand put up with serious race conditions formany months, it would likely have been a bet-ter choice to keep the API and start the imple-mentation from scratch.

The API and the internal abstraction layers,however, have been both simple enough tounderstand and modify, and flexible enoughto cope with the needs of many networkingdrivers. Lustre (and therefore Portals) needsto support a variety of interconnects, includingkernel TCP/IP, TCP/IP offload cards, QuadricsElan 3, Myrinet, and SCI.

For each network type we have a Portals Net-work Abstraction Layer (NAL), approximatelyone to two thousand lines of code each. Al-though they are small, they are generally quitecomplicated, and may depend on a fair bit ofwizardry to get the most out of a particular in-terconnect. Nevertheless, the Lustre networkregression test running on our Elan NAL be-tween two nodes is bottlenecked by the PCI bus

Linux Symposium 383

at more than 300 MB/s.

5 Intent Operations Explained

Most distributed file systems perform meta-data operations in the same way all thetime, regardless of contention. Some systemschoose to give locks on objects to clients, andsome choose to perform all operations syn-chronously on the server.

In the first mode, a client wanting to performmetadata operations will first take a lock on theparent directory, download the applicable di-rectory information, and make many changeslocally. This is extremely efficient in times oflow contention; it can perform as many opera-tions as it wants locally without contacting theserver, as long as no other nodes try to acquirethe lock. In times of high contention, however,it is a disaster: imagine users on 1,000 nodesall runningtouch /tmp/foo . The lock on/tmp will have to be given to each node inturn, and the cluster will grind to a halt.

In the second mode, the client sends a mes-sage to the server for each operation, and theserver performs the operation without givingany locks out. Not only is this much simpler tocode properly, it also avoids the problem withlock ping-pong. When this mode is used, how-ever, even directories with no contention havethis behaviour, and you suffer the effects of aserver round-trip for each operation.

Lustre currently executes all operations as ifthere were high concurrency, with exactly oneRPC per metadata operation. With the comple-tion of the writeback metadata cache later thisyear, the DLM will be able to make the choicebetween giving the client a writeback lock on asubtree or performing one RPC per op.

6 Intents Gone Wrong

Our first attempt at writing the client-side VFScode to support the intent mechanism wasroughly as follows. Consider the case of amkdir operation: normal filesystems will lockthe parent directory, lookup the new directoryto see if it already exists, create it, then releasethe lock. Lustre added alookup intentstructureto each lookup call, to tell the lock manager onthe server why we asked for the lock (in thiscase, to mkdir). If the server decided not togive out the lock, it would perform the opera-tion on the client’s behalf and return a successor error code.

When the Lustre client received this reply, itwould do complicated things to cooperate withthe VFS. If the mkdir succeeded, for example,it needed to create anegativedirectory entry(dentry) before returning from lookup (if wereturned a new positive dentry, the VFS wouldreturn-EEXISTS ). Later, the VFS would callus back to do the “actual” mkdir, at which timewe would instantiate the dentry based on thereply stored in the lookup intent.

This turned out to be a disaster of race con-ditions, both on the server and on the client.On the server, the lock manager would performthese operations before the lock was granted,so that it could give the client a lock on the newfile. By the time the lock was actually granted,however, anything could have changed. On theclient, our ability to control the dcache, partic-ularly in the window between lookup and finalcreation, proved insufficient.

Our final solution was to make two fairly majorchanges to both sides. Instead of the lock man-ager performing an operation before locks aregranted, the metadata server is able to specifyan already-granted lock to give to the client.This allows the MDS to perform the opera-tion and then return the still-granted lock on

Linux Symposium 384

the new file without races. The client has beensimplified to call directly into the filesystemand return the result immediately: no dcache,no VFS code, no races.

7 Client Metadata Caching

When a node asks to lock a directory for read-ing or writing, the Lustre DLM will soon beable to grant asubtree lock, if the directory hasnot recently seen conflicting activity. This al-lows the client to keep a cache which can befilled and selectively revalidated as necessary.

As a client with a subtree lock fills its cachefrom the MDS, the MDS may revoke locks onother objects. If during this process the MDSencounters an opened file or a file with hardlinks, it flags this file for special attention bythe client. Specifically, the client also flagsthese as shared objects which cannot be cachedlocally and must use the intent path for updates.

Once a client has a subtree lock, it can beginto keep a local journal of updates. Each updateis a short record which describes one logicalfilesystem operation on an object, for example“create directory, mode 0755, parent inode 12,new directory name foo”. Because all updateoperations are now reduced to the creation ofa single record in client memory, they are in-credibly fast.

When another client attempts to perform a con-flicting operation beneath a subtree lock, thatlock must be found and revoked. The MDSserver code can easily walk the dentry tree,looking at each path component of the affectedobject, and revoke subtree locks as necessary.It is now easy to see why we flag hard linkedfiles for special handling, as they have morethan one path by which they can be reached.

When a subtree lock is revoked, any accumu-lated updates must be flushed to the MDS and

replayed on stable storage. This is exactly thesame mechanism which already exists for in-tent locks, except that they are grouped intopages of operations which are all guaranteedto succeed by virtue of the subtree lock. Now asingle network exchange can contain hundredsof records.

8 Persistent Caching

To this design there is one particularly at-tractive extension, which is a persistentcache in the style of AFS[3], Coda[4], andInterMezzo[5]. Two new pieces are requiredfor such an extension: the local cache itself,and a way to revalidate its contents after locksare lost and re-acquired.

Lustre’s stackable object protocols allow a verysymmetric design for the persistent cache byadding a metadata server to the client. Todaythe file system code interacts with a metadataclient (MDC) via function calls, which exe-cutes network commands to an MDS. In thisnew model, the MDC can be replaced with acaching MDCwhich can talk to both a localand remote MDS. The local MDS is respon-sible for maintaining the local cache, either inmemory or on disk; the caching MDC resolvescache misses and replays updates to the realMDS as before.

At some point, after a client has lost and re-acquired a lock, we need a way to validatethe data that already exists in the cache. Unixfilesystems already provide the concept of achange time(ctime), which is updated when-ever the inode changes. For Lustre directorieswe will add a newsubtree change time(stc-time) which will be updated whenever any in-ode in the subtree is changed. These stctimeshave a nanosecond granularity and will allowa client to very quickly establish whether largeportions of a cache are up to date.

Linux Symposium 385

Caching MDC

Datain

cache?

Local MDS

Get locks andattributes fromlocal MDS cache

No

Query Cache

YesPerformlocal updates

Send lock requestwith intentto remote MDS,populate localMDS withthis data

Remote MDS

File System

Metadata operations

MDC

Disk

Figure 1: Persistent Caching

Updates to the sctime will of course dirty more(possibly many more) inodes during each up-date. For customers pushing their metadataserver to the limits, they have the option ofdisabling the sctime and revalidating each ob-ject individually, or going without a persistentcache.

9 Collaborative Caching

A very common load pattern found in industryfilesystem installations is one where read traf-fic vastly outnumbers write traffic. One suchexample is a cluster of web servers servingmostly static content.

Unless some effort is made to distribute theload in these situations, the servers will becompletely overwhelmed, based purely on theraw bandwidth that a single server can provide.Consider the load placed on central serversif workstations have remote root filesystemsand are all booted simultaneously following a

Object Storage Client

Lustre Litefile system

Object Storage Target

Client

(OST)

4. Get requesteddata from target OST(if data not in cache yet)

Cache ServerDedicated

1. Read request 2. Referral

forwarding3. Read request

Figure 2: Collaborative Caching

power outage; or a lab of students all loadinglarge applications at the beginning of a class.

Lustre is once again leveraging the object pro-tocols into a symmetric collaborative cachingdevice. Instead of working directly with anOST, a client communicates with one or morecaching devices which can take locks on theclient’s behalf, locally service cache misses,and provide an alternative path to the high-demand data. These caching devices can alsorun on the clients themselves, which is thecol-laborativepart of the collaborative cache.

With the addition of caching devices, recoverybecomes somewhat more complicated. Nor-mally the decision to give up on a particularOST following a network timeout is a verysimple one. However with a cache in the mid-dle it’s important to distinguish between a fail-ure of the cache node and a failure of the OSTitself.

10 Conclusion

After more than a year of development, theLustre framework has deviated surprisingly lit-tle from the original architecture. The lockmanager and object protocols have served uswell and will continue to form the centre of thedesign. Despite the fairly serious setbacks with

Linux Symposium 386

the VFS and client caching, our intent lock-ing strategy has been shown to be successfulon very large clusters and will be the primarymechanism for dealing with high-contentiondirectories.

Today Lustre scales comfortably to more than1,000 nodes and is running on 3 of the 8 largestclusters in the world[6] (at Lawrence Liver-more and Pacific Northwest National Labora-tories). We have every reason to believe thattoday’s Lustre code will scale to 2,000 nodeswithout serious difficulties; the next two yearsof development are planned to address scala-bility and performance issues on a completelynew scale of clusters which are only just begin-ning to be designed.

11 Acknowledgments

Lustre has benefitted significantly from theexperience, guidance, and funding of severalUS Government national laboratories, notablyLawrence Livermore National Laboratory andPacific Northwest National Laboratory. Wehave also received the support of the NationalNuclear Security Administration ASCI Path-Forward Program, which provides funding formany of the advanced features in Lustre’s fu-ture. Amongst our partners we are grateful forthe support of and opportunities with Hewlett-Packard and Dell. The views and conclusionscontained in this paper are those of the au-thor and should not be interpreted as neces-sarily representing the official policies or en-dorsements, either express or implied, of ourpartners or the US Government.

12 Availability

Lustre is released under the terms and con-ditions of the GNU General Public License,and can be downloaded from our FTP site

or checked out of our public CVS repository.More information can be found athttp://www.lustre.org/

Founded in 2001 by Dr. Peter Braam, ClusterFile Systems, Inc. is a privately held companyheadquartered on the internet, with developersin five countries. CFS is focused on high-endstorage solutions, including the developmentof advanced file systems, novel architectures,and the storage industry as a whole. Moreinformation about how we can improve yourstorage offering, business, or laboratory can befound athttp://www.clusterfs.com/or by writing to [email protected]

References

[1] IBM, Programming LockingApplications, Version 4.3.1, SecondEdition, 1999.

[2] Brightwell et al,The Portals 3.1 MessagePassing Interface, Revision 1.0, 1999.

[3] J. Howard,An Overview of the AndrewFile System, In Proceedings of theUSENIX Winter Technical Conference,1988.

[4] Satyanarayanan, M, Kistler, J. J., Kumar,et. al.,Coda: a Highly available FileSystem for a Distributed WorkstationEnvironment, IEEE Trans. onComputers, 39(4): 447-459, 1990.

[5] Braam, Callahan, and Schwan,TheInterMezzo Filesystem, In Proceedings ofthe Ottawa Linux Symposium, 1999.

[6] http://www.top500.org/list/2003/06/

OSCAR Clusters

John Mugler, Thomas Naughton∗and Stephen L. Scott†

Oak Ridge National Laboratory, Oak Ridge, TN

Brian Barrett‡, Andrew Lumsdaine and Jeffrey M. Squyres§

Indiana University, Bloomington, IN

Benoît des Ligneris, Francis Giraldeau¶

Université de Sherbrooke, Québec, Canada

Chokchai Leangsuksun‖

Louisiana Tech University, Ruston, LA

Abstract

The Open Source Cluster Application Re-sources (OSCAR) is a cluster software stackproviding a complete infrastructure for clus-ter computing. The OSCAR project startedin April 2000 with its first public release ayear later as a self-installing compilation of“best practices” for high-performance classicBeowulf cluster computing. Since its inceptionapproximately three years ago, OSCAR hasmatured to include cluster installation, main-tenance, and operation capabilities and as a re-sult has become one of the most popular clus-ter computing packages worldwide. In thepast year, OSCAR has begun to expand intoother cluster paradigms including Thin OS-CAR, a diskless cluster solution, and High-

∗Contact author: Thomas Naughton,<[email protected]>

†This work was supported by the U.S. Department ofEnergy, under Contract DE-AC05-00OR22725.

‡Supported by a Department of Energy High Perfor-mance Computer Science Fellowship.

§Supported by a grant from the Lilly Endowment.¶Supported by Centre de Calcul Scientifique‖Supported by Center for Entrepreneurship and In-

formation Technology (CEnIT), Louisiana Tech Univer-sity

Availability OSCAR, embracing fault toler-ant capabilities. This paper will cover thecurrent status of OSCAR including its twolatest invocations—Thin OSCAR and High-Availability OSCAR—as well as the details ofthe individual component technology used inthe creation of OSCAR.

1 Introduction

The growth of cluster computing in recentyears has primarily been fueled by the priceperformance quotient. The hardware invest-ment to obtain a supercomputer caliber systemextends the capabilities to a much broader au-dience. This advancement enables individu-als to have exclusive access to their own supercomputer.

The capability of individual computing clus-ters involves the well known exchange of hard-ware costs for software costs. This softwareis necessary to build, configure and maintainthe ever growing number of distributed hetero-geneous machines that make up these clusters.The Open Cluster Group (OCG) was formedto address cluster management needs. The

Linux Symposium 388

first working group formed by OCG was theOpen Source Cluster Application Resources(OSCAR) project. The OSCAR group beganby pooling “best practice” techniques for HighPerformance Computing (HPC) clusters intoan easy to use toolkit. This included the inte-gration of HPC components with a basic wiz-ard to drive the installation and configuration.

The initial HPC oriented OSCAR has evolvedwith the base cluster toolkit being distilled intoa more general cluster framework. The sep-aration of the HPC specific aspects enablesthe framework to be used for other clusterinstallation and management schemes. Thishas led to the creation of two additional OCGworking groups that leverage the basic OS-CAR framework. The “Thin-OSCAR” work-ing group uses the framework for diskless clus-ters. The “HA-OSCAR” group is focusing onhigh availability clusters. The relationship ofOCG working groups and the framework is de-picted in Figure 1.

(high availability)HA−OSCAR

OSCAR(HPC − disk based)

Thin−OSCAR (diskless)

OPDRepository

��

��

OSCARFramework

Open Cluster Group

Figure 1: The Open Cluster Group (OCG)working groups share the OSCAR frameworkfor cluster installation and management.

The following sections will briefly discuss theOCG umbrella organization and the currentworking groups. The basic OSCAR frameworkwill be discussed followed by a brief summaryof the HPC specific components. Then thediskless (Thin-OSCAR) and high availability(HA-OSCAR) working groups will be covered

followed by concluding remarks.

2 Background

The OSCAR working group was the first work-ing group that was formed under the umbrellaOpen Cluster Group (OCG) organization. TheOCG was formed after a handful of individu-als met in April 2000 to discuss their foraysinto cluster construction and management [1].The consensus was that much effort was beingduplicated and a toolkit to assist with the inte-gration and configuration of current “best prac-tices” would be beneficial to the participantsand the HPC community in general.

This initial meeting led to subsequent discus-sions and ultimately a public release of thecluster installation, configuration, and manage-ment toolkit OSCAR v1.0 in April 2001. Thisinitial release addressed OCG’s mission state-ment to make cluster computing simpler whilemaking use of the commonly used open sourcesoftware solutions. In the years that have fol-lowed, the OSCAR working group has en-hanced the toolkit with features such as mod-ular OSCAR packages and improved clustermanagement facilities.

The OSCAR project is comprised of a mixtureof industry and academic/research members.The overall project is directed by a steeringcommittee that is elected every two years fromthe current “core organizations.” This “core”list is composed of those actively contributingto project development. The 2003 core orga-nizations include: Bald Guy Software (BGS),Dell, IBM, Intel, MSC.Software, Indiana Uni-versity, the National Center for Supercomput-ing Applications (NCSA), Oak Ridge NationalLaboratory (ORNL), and Université de Sher-brooke.

There have also been new OCG workinggroups created to address other cluster envi-

Linux Symposium 389

ronments. These new working groups arenamed “Thin-OSCAR” and “HA-OSCAR.”The “Thin-OSCAR” project provides supportfor diskless clusters. The “HA-OSCAR” groupis focused on high availability clusters. Thedifferent groups focus on different cluster envi-ronments but leverage much of the core facili-ties offered by the base OSCAR framework.

3 OSCAR Framework

The standard OSCAR release targets usage in aHigh Performance Computing (HPC) environ-ment. The base OSCAR framework is not nec-essarily tied to HPC. Currently the frameworkand toolkit are simply referred to as OSCARand are used to build HPC clusters. To avoidconfusion throughout this paper, a distinctionwill be made by using OSCAR to refer to theentire toolkit and OSCAR framework for thebase facilities.

The framework includes the set of base or“core” packages needed to build and maintaina cluster. There are two other package classifi-cations: “included” and “third-party.” Thein-cludedclass of packages includes commonlyused HPC applications for OSCAR. Thesepackages are closely maintained and tested forcompatibility with each OSCAR release. Thethird-party distinction is provided for all otherOSCAR packages (see Section 3.5).

The core components enable a user to con-struct a virtualimageof the target machine us-ing System Installation Suite (SIS). There isalso an OSCAR database (ODA) that storescluster information. The final two componentsinclude a parallel distributed “shell” tool setcalled C3 and an environment management fa-cility called Env-Switcher.

The fundamental function of OSCAR is tobuild and maintain clusters. This is greatlycomprised of software package management.

The guiding principle behind OSCAR andOCG is to use “best practices” when avail-able. Thus, the Red Hat Package Manager(RPM) [2] is leveraged by OSCAR.1 RPM filesare pre-compiled binary versions of the soft-ware with meta data that is used to managethe addition, deletion, and upgrade of the pack-age. RPM handles the conflict and dependenceanalysis. This add/delete/upgrade capability isa key strength of RPM. This is made use of bythe framework in “OSCAR Packages.”

3.1 System Installation Suite

The System Installation Suite (SIS) is basedon the well known SystemImager tool [3, 4].SystemImager is used to build an image—a di-rectory tree that comprises an entire filesystemfor a machine—that is used to install clusternodes. The suite has two additional compo-nents: System Installer and System Configu-rator. These two components extend the stan-dard SystemImager to allow for a descriptionof the target to be used to build an image on thehead node. This image has certain aspects gen-eralized for on-the-fly customization via Sys-tem Configurator. This dynamic configurationphase enables the image to be more general soitems such as the network interface card are notin the SIS image. This capability allows forheterogeneity within the cluster nodes, whileleveraging the established SystemImager man-agement model.

SIS is used to “bootstrap” the node installs—kernel boot, disk partitioning, filesystem for-matting, and base OS installation. The imageused during the installation can also be usedto maintain the cluster nodes. Modifying theimage is as straight-forward as modifying alocal filesystem. Once the image is updated,

1The underlying framework is designed to be as dis-tribution agnostic as possible. The RPM name is slightlymisleading but the system is available on distributionsother than Red Hat.

Linux Symposium 390

rsync 2 is used to update the local filesystemon the cluster nodes. This method can be usedto install and manage an entire cluster, if de-sired. This image based cluster management isespecially useful for maintaining diskless clus-ters and is used by the Thin-OSCAR workinggroup, (see Section 5).

3.2 C3 Power Tools

The distributed nature of clusters introduces aneed to execute commands and exchange filesthroughout the cluster. The Cluster, Commandand Control (C3) tool set offers a comprehen-sive set of commands to perform parallel com-mand execution across cluster(s) as well as filescatter and gather operations [6, 7]. The toolsare useful at both administrative and user lev-els. The tool set is the product of scalable sys-tems research being performed at ORNL [8].

C3 includes commands to execute (cexec )across the entire cluster—or a subset ofnodes—in parallel. File scatter and gather(cpush /cget ) operations are also available.The C3 power tools have been developed tospan multiple clusters. This multi-cluster ca-pability is not fully harnessed by OSCAR cur-rently but is available for administrators orstandard users.

C3 is used internally throughout the OSCARtoolkit to distribute files and perform paralleloperations on the cluster. For example, the usermanagement commands, e.g.,useradd , aremade cluster aware using the C3 tools. SinceC3 enables standard Linux commands to be runin parallel, administrators can use the tools tomaintain clusters. A common example is theinstallation of a RPM on all nodes of the clus-ter, e.g.,

shell$ cexec rpm -ivh \

2rsync is a tool to transfer files similar torcp /scp [5].

foo-1.0.i386.rpm

3.3 Environment Switcher

Managing the shell environment—both atthe system-wide level as well as on a per-user basis—has historically been a dauntingtask. For cluster-wide applications, systemadministrations typically need to provide cus-tom, shell-defendant startup scripts that, cre-ate and/or augmentPATH, LD_LIBRARY_PATH, and MANPAGEenvironment vari-ables. Alternatively, users could hand-edittheir “dot” files (e.g., $HOME/.profile ,$HOME/.bashrc , and/or$HOME/.cshrc )to create/augment the environment as neces-sary. Both approaches, while functional andworkable, typically lead to human error—sometimes with disastrous results, such asusers being unable to login due to errors in their“dot” files.

Instead of these models, OSCAR providesthe env-switcher OSCAR package.env-switcher forms the basis for sim-plified environment management in OSCARclusters by providing a thin layer on top ofthe Environment Modules package [9, 10].Environment Modules provide an efficient,shell-agnostic method of manipulating theenvironment. Basic primitives are providedfor actions such as: add a directory to aPATH-like environment variables, displayingbasic information about a package, and settingarbitrary environment variables. For example,a module file for setting up a given applicationmay include directives such as:

setenv FOO_OUTPUT $HOME/resultsappend-path PATH /opt/foo-1.2.3/binappend-path \

MANPATH /opt/foo-1.2.3/man

The env-switcher package installs andconfigures the base Modules package and cre-ates two types of modules: those that are un-conditionally loaded, and those that are subjectto system- and user-level defaults.

Linux Symposium 391

Many OSCAR packages use the unconditionalmodules to append thePATH, set arbitrary en-vironment variables, etc. Hence, all users au-tomatically have these settings applied to theirenvironment (regardless of their shell) andguarantee to have them executed even when ex-ecuting on remote nodes viarsh /ssh .

Other modules are optional, or a provide one-of-many selection methodology between mul-tiple equivalent packages. This allows the sys-tem to provide a default set of applications, thatoptionally can be overridden by the user (with-out hand-editing “dot” files). A common ex-ample in HPC clusters is having multiple Mes-sage Passing Interface (MPI) [11, 12] imple-mentations installed. OSCAR installs both theLAM/MPI [13] and MPICH [14] implementa-tions of MPI. Some users prefer one over theother, or have requirements only met by one ofthem. Other users wish to use both, switchingbetween them frequently (perhaps for perfor-mance comparisons).

Theenv-switcher package provides trivialsyntax for a user to select which MPI packageto use. Theswitcher command is used toselect which modules are loaded at shell initial-ization time. For example, the following com-mand shows a user selecting to use LAM/MPI:

shell$ switcher mpi = lam-6.5.9

3.4 OSCAR Database

The OSCAR database, or ODA, is used to de-scribe the software in an OSCAR cluster. As ofOSCAR version 2.2.1, ODA has seven tablesin a MySQL database namedoscarwhich runson the head node. Every OSCAR package hasan XML meta file namedconfig.xml thatis used to populate the package database. ThisXML file contains a description of the package,and the RPM names associated with the pack-age. Thus the database enables a user to findinformation on packages quickly and easily.

ODA provides a command line interface intothe database via theoda command. This offerseasier access into the database without havingto use a MySQL client. ODA also provides anabstraction layer between the user and the ac-tual backing store method. The internal orga-nization of data is masked through the use of auniform interface irrespective of this underly-ing database engine. The retrieval and updateof cluster information is aided by the use ofODA “shortcuts” to assist with common tasks.

ODA is intended to provide OSCAR packagedevelopers with a central repository for infor-mation about the installed cluster. This alle-viates the problem of every package developerhaving to keep an independent data store, andmakes the addition and removal of packageseasier.

3.5 OSCAR Packages

An OSCAR package is a simple way to wrapsoftware and a given configuration for a cluster.The most basic OSCAR package is an RPMin the appropriate location in the package di-rectory structure. The modularity of this fa-cility allows for easy addition of new softwareto the framework. OSCAR packages are mostuseful, however, when they also provide sup-plemental documentation and a meta file de-scribing the package. The packaging API pro-vides authors the ability to make use of scriptsto configure the cluster software outside of theRPM itself. The scripts fire at different stagesof the installation process and test scripts canbe added to verify the process. Additionally,an OSCAR Package Downloader (OPD) (seeSection 3.6) is provided to simplify acquisitionof new packages.

With this modular design, packages can be up-dated independently of the core utilities andcore packages. The OSCAR toolkit is freelyredistributable and therefore requires all “se-

Linux Symposium 392

lected” packages to adhere to this constraint.However, any software which does not fulfillthis requirement can be made available via anOSCAR repository with access via OPD.

The contents and directory structure of a typi-cal OSCAR package are listed below. For fur-ther details on the creation of packages for thetoolkit, see the OSCAR Architecture documentin the development repository [15].

config.xml – meta file with description,version, etc.

RPMS/ – directory containing binary RPM(s)for the package

SRPMS/ – directory containing sourceRPM(s) used to build the package

scripts/ – set of scripts that run atparticular times during the installa-tion/configuration of the cluster

testing/ – unit test scripts for the package

doc/ – documentation and/or license infor-mation

3.6 OSCAR Package Downloader

The OSCAR Package Downloader (OPD) pro-vides the capability to download and installOSCAR software from remote package repos-itories. A package repository is simply an FTPor web site. Given the ubiquitous access toFTP and web servers, any organization canhost their own OSCAR package repository andpublish their packages on it. There is no cen-tral repository; the OPD network was designedto be distributed such that no central author-ity is required to publish OSCAR packages.Although the OPD client program downloadsan initial list of repositories from the OSCAR

Working Group web site,3 arbitrary repositorysites can be listed on the OPD command line.

Since package repositories are FTP or websites, any traditional FTP client or web browsercan also be used to obtain OSCAR packages.Most users prefer to use the OPD client it-self, however, because it provides additionalfunctionality over that provided by traditionalclients. OPD offers two interfaces: a sim-ple menu-based mechanism suitable for inter-active use and a command-line interface suit-able for use by higher-level tools (or automatedscripts).

Partially inspired by the Comprehensive PerlArchive Network (CPAN), OPD provides thefollowing high-level capabilities:

• Automating access to a central list ofrepositories

• Browsing packages available at eachrepository

• Providing detailed information aboutpackages

• Downloading, verifying, and extractingpackages

While the job that OPD performs is actuallyfairly simple and could be performed manu-ally, having an automated tool for these func-tions provides ease of use for the end-user, per-forms multiple checks to ensure that down-loaded and extracted properly, and lays thegroundwork for higher-level OSCAR pack-age/retrieval tools.

3The centralized repository list is maintained by theOSCAR working group. Upon request, the list maintain-ers will add most repository sites.

Linux Symposium 393

4 OSCAR Toolkit

The integration of common HPC packages is akey feature of the OSCAR toolkit. The focus ofthis paper is to talk about the framework, how-ever, and its use by the various OCG workinggroups. A brief highlight of the packages isprovided with further details available in otherarticles [16, 17].

The “selected” packages that are included en-able a user to install and configure the headnode and cluster nodes to run HPC applica-tions. These HPC packages include parallel li-braries like LAM/MPI, MPICH and PVM. Abatch queue system and scheduler—OpenPBSand MAUI—is included and setup with a rea-sonable set of defaults. The toolkit sets upcommon cluster services such as Network FileSystem (NFS) and Network Time Protocol(NTP). Security packages like Pfilter [18] aresetup as well as OpenSSH with non-interactiveaccess to all nodes in the cluster from the headnode. The testing scripts provided with thepackages are executed by the OSCAR frame-work to validate the cluster installation.

5 Thin-OSCAR

5.1 Goals

Thin-OSCAR4 is a workgroup dedicated to in-tegrate diskless clustering techniques into OS-CAR so that the OSCAR infrastructure can usediskless nodes. There are three class of nodesin the thin-OSCAR perspective: diskless, sys-temless (a disk is present in the node but thereis no operating system on the disk) and disk-full (regular OSCAR node with disk). At thetime of this writing, only diskless and diskfullnodes are supported out of the box. System-less nodes are supported to some extent but you

4Workgroup web site: http://thin-oscar.ccs.usherbrooke.ca/

will have to fiddle with some config files manu-ally. Moreover, a generic abstraction of a nodeis necessary to build a generic but comprehen-sive interface.

5.2 Principle of operation

The thin-OSCAR model is the following: thethin-OSCAR package is a collection of Perlscripts and libraries that are used to transform aregular SIS image (as used by OSCAR) into thetwo ram disks necessary for diskless and sys-temless nodes. The first ram disk to be trans-ferred is called the “BOOT image” and is usedin order to ensure that the node has networkconnectivity, NFS client capabilities and to cre-ate a raid0 array of ram disk [19]. Once thisminimal image (less than 4Mb) has booted, theRUN image from the second ram disk can betransferred. The RUN image is build directlyfrom the SIS image and contains the completesystem that will run on the node. Some directo-ries are copied from the SIS image while othersare NFS exports (read-only) directly exportedfrom the SIS image directory [20].

In order to build the BOOT image, you willneed the SIS image name, modules name to in-tegrate into this ram disk as well as the modulesused by your NIC adapter and the kernel ver-sion you want to use. The use of the root raidin ram technique is necessary because one ofthe thin-OSCAR goals is to support the regularkernel (only 4 Mb or ramdisk) without recom-pilation.

5.3 mini howto

In order to download thin-OSCAR, please useOscar Package Downloader and select the Uni-versité de Sherbrooke repository. Then down-load thin-OSCAR. Before actually using thin-OSCAR, you have to complete a regular instal-lation from stage 1 (selection of packages) tostage 6 (setup networking). Assign nodes IP to

Linux Symposium 394

MAC addresses and don’t forget to click on the“Setup Network Boot” button.

Once this is done, you are ready to use thethin-OSCAR package. Go to/usr/lib/oscar/packages/thin-oscar/ andrun the./oscar2thin.pl script. You willgo into an interface where you will have todefine your diskless model, link the modelto an OSCAR node and then generate all thenecessary ram disk (Configure All). Once thisis done, you are finished and can reboot all thediskless nodes. You must know which kernelmodules are necessary for your NIC beforestarting this process in order to setup networkconnectivity.

5.4 thin-OSCAR future

We will certainly move to a more modernRAM filesystem (tmpfs) as it is now availableon regular kernels and toward an automatic de-tection of NIC modules so that BOOT ramdiskcreation can be fully automated. Multicasttransfer of the ramdisk is under study as it willshorten the boot time and the network usageduring boot (especially for big clusters!).

The proof of concept of the thin-OSCAR hasbeen done and, at the time of this writing, thin-OSCAR is used on a 180 node production disk-less cluster [21]. An improved integration intothe OSCAR framework is under developmentand will be available as soon as the OSCARGUI enables it.

6 HA-OSCAR

High-Availability (HA) computing, oncethought as only important to industry ap-plications such as telecommunications, hasbecome critically important to the fundamentalmission of high-performance computing. Thisis because very large and complex application

codes are being run on increasingly largerscale distributed computing environments.Since COTS (common off the shelf) hard-ware is typically employed to construct theseenvironments (clusters), quite often the appli-cation code’s runtime exceeds the hardware’saggregated mean-time-between-failure rate forthe entire cluster. Thus, in order to efficientlyrun these very large and complex applica-tions, high-availability computing techniquesmust be employed in the high-performancecomputing environment (HPC).

The current HPC release of OSCAR is fullysuitable for mission critical systems as it con-tains several individual system elements thatexhibit a single-point-of-failure trait. In or-der to support HA requirements, clustered sys-tems must provide ways to eliminate single-point-of-failures. HA-OSCAR is a focus groupwith goals to add HA features to the originalHPC OSCAR distribution. While HA-OSCARis still a work in progress, the scope of thiseffort has been defined into three incrementalsteps; the creation of the HA-OSCAR white-paper [22], Active-Hot-Standby, and n + 1Active-Active distributions.

Hardware duplication and network redundancyare common techniques utilized for improv-ing the reliability and availability of computersystems. To achieve the HA-OSCAR clus-ter system, we must first provide a duplica-tion of the cluster head node. There are dif-ferent ways for implementing such an architec-ture, which includes Active-Active, Active-HotStandby and Active-Cold Standby [23]. Cur-rently, the Active-Hot Standby configuration isthe initial model of choice. Figure 2 showsthe HA-OSCAR cluster system architecture.We have experimented with and planned to in-corporated Linux Virtual Server and Heartbeatmechanisms to our initial Active- Hot StandbyHA-OSCAR distribution. However, we willextend the initial architecture to support the

Linux Symposium 395

Active-Active HA after release of the Hot-Standby distribution. The Active-Active archi-tecture will provide a better resource utilizationsince both head nodes will be simultaneouslyactive and providing services. The dual mas-ter nodes will run redundant OpenPBS, MAUI,DHCP, NTP, TFTP, NFS, rsync and SNMPservers. In the event of a head node outage,all functions provided by that node will fail-over to the second redundant head node and allservice requests will continue to be served, al-though at a reduced performance rate (i.e. intheory, 50% at the peak or busy hours).

An additional HA functionality to support inHA-OSCAR is that of providing a high- avail-ability network via redundant Ethernet ports onevery machine in addition to duplicate switch-ing fabrics (network switches, cables, etc.) forthe entire network configuration. This will en-able every node in the cluster to be presenton two or more data paths within its net-works. Backed with this Ethernet redundancy,the cluster will achieve higher network avail-ability. Furthermore, when both networks areup, an improved communication performancemay be achieved by using techniques such aschannel bonding of messages across the redun-dant communication paths.

7 Conclusion

The Open Cluster Group (OCG) has formedseveral working groups focused on improv-ing cluster computing management. The firstworking group, OSCAR, has evolved over timeas has the toolkit by the same name. The OS-CAR toolkit has also been distilled in order toallow for more general usage. This more gen-eral OSCAR framework is being leveraged bysubsequent working groups seeking to extendsupport for new cluster environments. Thesenew environments include the areas of disk-less and high availability clusters. These are

Internet

Dual Channel

Active Node Hot Standby :redundant recovery

Node

RAID (NASor SAN)Dual

Channel

Switch1 : private network

Switch2 : private network

Client 1 Client 2 Client 3 Client N

Heartbeat link

Figure 2: Diagram of HA-OSCAR architec-ture.

being pursued by the Thin-OSCAR and HA-OSCAR working groups respectively. TheseOCG working groups are providing the clus-ter community with sound tools to simplify andspeed cluster installation and management.

References

[1] Richard Ferri. The OSCAR revolution.LinuxJournal, (98), June 2002.http://www.linuxjournal.com/article.php?sid=5559 .

[2] Edward C. Bailey.Maximum RPM: Takingthe Red Hat Package Manager to the Limit.Red Hat Software, Inc., 1998.

[3] Sean Dague. System Installation SuiteMassive Installation for Linux. InThe 4thAnnual Ottawa Linux Symposium (OLS’02),Ottawa, Canada, June 26-29, 2002.

[4] System Installation Suite (SIS),http://www.sisuite.org/ .

[5] A. Tridgell and P. Mackerras. The rsyncalgorithm. Technical Report TR-CS-96-05,Australian National University, Department

Linux Symposium 396

of Computer Science, June 1996.(see also:http://rsync.samba.org/ ).

[6] M. Brim, R. Flanery, A. Geist, B. Luethke,and S. Scott. Cluster Command & Control(C3) tools suite. InTo be published in,Parallel and Distributed ComputingPractices, DAPSYS Special Edition, 2002.

[7] Cluster Command & Control (C3) PowerTools,http://www.csm.ornl.gov/torc/C3 .

[8] Al Geist et al. Scalable Systems SoftwareEnabling Technology Center, March 7, 2001.http://www.csm.ornl.gov/scidac/ScalableSystems/ .

[9] John L. Furlani. Modules: Providing aflexible user environment. InProceedings ofthe Fifth Large Installation SystemsAdministration Conference (LISA V), pages141–152, San Diego, CA, September 1991.http://modules.sourceforge.net/ .

[10] John L. Furlani and Peter W. Osel. Abstractyourself with modules. InProceedings of theTenth Large Installation SystemsAdministration Conference (LISA ’96), pages193–204, Chicago, IL, September 1996.http://modules.sourceforge.net/ .

[11] Al Geist, William Gropp, SteveHuss-Lederman, Andrew Lumsdaine, EwingLusk, William Saphir, Tony Skjellum, andMarc Snir. MPI-2: Extending theMessage-Passing Interface. In Luc Bouge,Pierre Fraigniaud, Anne Mignotte, and YvesRobert, editors,Euro-Par ’96 ParallelProcessing, number 1123 in Lecture Notes inComputer Science, pages 128–135. SpringerVerlag, 1996.

[12] Message Passing Interface Forum. MPI: AMessage Passing Interface. InProc. ofSupercomputing ’93, pages 878–883. IEEEComputer Society Press, November 1993.

[13] Greg Burns, Raja Daoud, and James Vaigl.LAM: An Open Cluster Environment forMPI. In John W. Ross, editor,Proceedings ofSupercomputing Symposium ’94, pages379–386. University of Toronto, 1994.

[14] W. Gropp, E. Lusk, N. Doss, andA. Skjellum. A high-performance, portableimplementation of the MPI message passinginterface standard.Parallel Computing,22(6):789–828, September 1996.

[15] Open Cluster Group: OSCAR WorkingGroup. OSCAR: A packaged clustersoftware for High Performance Computing.http://www.OpenClusterGroup.org/OSCAR .

[16] Thomas Naughton, Stephen L. Scott, BrianBarrett, Jeff Squyres, Andrew Lumsdaine,and Yung-Chin Fang. The Penguin in the Pail– OSCAR Cluster Installation Tool. InProceedings of SCI’02: Invited Session –Commodity, High Performance ClusterComputing Technologies and Applications,Orlando, FL, USA, 2002.

[17] Benoît des Ligneris, Stephen L. Scott,Thomas Naughton, and Neil Gorsuch. OpenSource Cluster Application Resources(OSCAR) : design, implementation andinterest for the [computer] scientificcommunity. InProceeding of 17th AnnualInternational Symposium on HighPerformance Computing Systems andApplications (HPCS 2003), pages 241–246,Sherbrooke, Canada, May 11-14, 2003.

[18] Neil Gorsuch. PFILTER in OSCAR -Industrial Strength Cluster Firewalls in anOpen Source Environment. InProceeding of1st Annual OSCAR Symposium (OSCAR2003), Sherbrooke, Canada, May 11-14,2003.

[19] Mehdi Bozzo-Rey, Michel Barrette, Benoîtdes Ligneris, and Francis Giraldeau. Rootraid in ram how to. InProceeding of 17thAnnual International Symposium on HighPerformance Computing Systems and

Linux Symposium 397

Applications (HPCS 2003), pages 241–246,Sherbrooke, Canada, May 11-14, 2003.

[20] Benoît des Ligneris, Michel Barrette, FrancisGiraldeau, and Michel Dagenais.Thin-OSCAR : Design and futureimplementation. InProceeding of 1st AnnualOSCAR Symposium (OSCAR 2003), pages261–265, May 11-14, 2003.

[21] M. Barrette, X. Barnabé-Thériault,M. Bozzo-Rey, C. Gauthier, F. Giraldeau,B. des Ligneris, J.-P. Turcotte, P. Vachon, andA. Veilleux. Development, installation andmaintenance of Elix-II, a 180 nodes disklesscluster running thin-oscar. InProceeding of1st Annual OSCAR Symposium (OSCAR2003), pages 267–271, May 11-14, 2003.

[22] I. Haddad, F. Rossi, C. Leangsuksun, andS. L. Scott. Telecom/High AvailabilityOSCAR Suggestions for the 2nd GenerationOSCAR. Technical ReportTR-LTU-12-2002-01, Louisiana TechUniversity, Computer Science Program,December 2002.

[23] P. S. Weygant.Cluster for high availability:A Primer of HP solutions. Hewlett-PackardCompany, Prentice-Hall, Inc., second edition,2001.

Porting Linux to the M32R processor

Hirokazu TakataRenesas Technology Corp., System Core Technology Div.

4-1, Mizuhara, Itami, Hyogo, 664-0005, Japan

[email protected]

Naoto Sugai, Hitoshi YamamotoMitsubishi Electric Corp., Information Technology R&D Center

5-1-1, Ofuna Kamakura, Kanagawa 247-8501, Japan

{sugai,hitoshiy}@isl.melco.co.jp

Abstract

We have ported a Linux system to the Renesas1

M32R processor, which is a 32-bit RISC mi-croprocessor designed for embedded systems,and with an on-chip-multiprocessor feature.

So far, both of UP (Uni-Processor) and SMP(Symmetrical Multi-Processor) kernels (basedon 2.4.19) have been ported and they are op-erating on the M32R processor. A DebianGNU/Linux based system has been also devel-oped on a diskless NFS-root environment, andmore than 300 unofficial.deb packages havealready been prepared for the M32R target.

In this paper, we describe this new architectureport in detail and explain the current status ofthe Linux/M32R project.

1Renesas Technology Corp. is a new joint semicon-ductor company established by Hitachi Ltd. and Mit-subishi Electric Corp. on April 1, 2003. It would be theindustry’s largest microcontroller (MCU) supplier in theworld. The M32R family microcontroller and its succes-sor will be continuously supplied by Renesas.

1 Introduction

A Linux platform for Renesas M32R proces-sor has been newly developed. The RenesasM32R processor is a 32-bit RISC microproces-sor, which is designed for embedded systems.It is suitable for a System-on-a-Chip (SoC) LSIdue to its compactness, high performance, andlow power dissipation. So far, the M32R fam-ily microcomputers have widely used for theproducts in a variety of fields—for example,automobiles, digital still cameras, digital videocamcorders, cellular phones, and so on.

Recently, the Linux system has begun to beused widely and employed even in the embed-ded systems. The embedded systems would bemore software-oriented systems hereafter. Themore complex and larger the embedded systemis, the more complicated the software becomesand harder to develop. In order to build thesekinds of embedded systems efficiently, it willbe more important to utilize generic OSes suchas Linux to develop software.

This is the first Linux architecture port to theM32R processor. This porting project, called a“Linux/M32R” project, has been active since

Linux Symposium 399

2000. Its goal is to prepare a Linux plat-form for the M32R processor. At first, thisLinux porting was just a feasibility study forthe new M32R processor development, and itwas started by only a few members of theM32R development team. Then, this projecthas grown to a lateral project among RenesasTechnology Corp., Renesas Solutions Corp.,and Mitsubishi Electric Corp.

In this feasibility study, we have ported notonly Linux kernel, but also whole GNU/Linuxsystem including GNU tools, libraries, andother software packages, so called “userland.”We also enhanced the M32R processor to addMMU (Memory Management Unit) facility inorder to port Linux system. And we have alsodeveloped an SMP kernel to investigate multi-processing by M32R’s on-chip-multiprocessorfeature[1]. At present, the Linux/M32R systemcan operate on the dual M32R cores in SMPmode.

In this paper, we describe this new architectureport in detail and explain about the current sta-tus of the Linux/M32R project.

2 Linux/M32R Platform for Em-bedded Systems

Recently, due to the continuous evolution ofsemiconductor technologies, it is possible tointegrate a whole system into one LSI chip, socalled “System-on-a-Chip (SoC).”

In an SoC, microprocessor core(s), peripheralI/O functions, internal memories, and user log-ics can be integrated into a single chip.

By making use of wide internal buses, an LSIcan achieve high performance which can notbe realized by combination of several general-purpose LSIs. In other words, we can optimizesystem performance and cost by using SoC, be-cause we can employ optimum hardware archi-

tecture and circuit configuration.

In such an SoC, a microprocessor core is akey part; therefore, the more compact andhigher performance microprocessor is signif-icantly required. To make such a high per-formance embedded processor core, not onlyin circuit and process technology but also ar-chitectural breakthrough is necessary. Espe-cially, multiprocessor technology is importanteven for the embedded processor, because itcan improve processor performance and lowersystem power dissipation by increasing proces-sor number scalably and without increasing op-erating clock frequency.

For an SoC with embedded microprocessor,software is also a key point. The more sys-tem is highly functional, the more software willbe complex and there is an increasing demandfor shortening development time of SoC. Un-der such circumstance, recently the Linux OSbecomes to be adopted for embedded systems.Linux platform makes it easy to port applica-tion programs developed on a PC/EWS to thetarget system. We believe that a full-featuredLinux system will come into wide use in em-bedded systems, because embedded systemswill become more functional and higher per-formance system will be required.

In the development of embedded systems,it is important to tune system performancefrom both hardware and software points ofview. Therefore, we used M32R softmacroand FPGA (Field Programmable Gate Array)devices to implement an evaluation board forrapid system prototyping. FPGA devices areslow, but make it possible to develop a systemin short turn-around time.

In this feasibility study, to construct a Linuxplatform, we ported Linux to the M32R archi-tecture, and validated the hardware system ar-chitecture through the porting, and developedthe software development environment.

Linux Symposium 400

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

R13

R14 (link register)

R15 (stack pointer)

0 31

General Purpose Registers

PC

0 31

Control Registers

CR0 (PSW)

CR1 (CBR)

CR2 (SPI)

CR3 (SPU)

CR5 (EVB)

CR6 (BPC)

0 31

Program Counter

0 8 63

Accumulators

processor status word

condition bit register

interrupt stack pointer

user stack pointer

EIT vector base register

backup PC

A0

A1

Figure 1: M32R register architecture

2.1 M32R architecture

The M32R is a 32-bit RISC microprocessor,and employs load-store architecture like otherRISC processors. Therefore, memory accessis executed by only load and store instruc-tions and logical and arithmetic operation isexecuted among registers. Except for multi-ply and divide instructions, most of instruc-tions can be executed in one clock, and instruc-tion completion does not depend on the the in-struction issuing order (out-of-order comple-tion). The M32R supports DSP operation in-structions such as multiply and accumulate in-structions.

Figure 1 shows the M32R register architec-ture. The M32R has sixteen 32-bit general pur-pose registers (R0∼ R15) and 56-bit accumu-lators for the multiply and accumulate opera-tions. R14 is also used as a link register (LR)which keeps return address for a subroutinecall. There are two stack pointer registers, SPI(interrupt stack pointer) and SPU (user stackpointer). The CPU core selects one of them asa current stack pointer (R15; SP) by the SM(stack mode) bit of the PSW (processor statusword).

2.2 M32R softmacro

The M32R softmacro is a compact micropro-cessor, developed to integrate into a SoC. It isa full synthesizable Verilog-HDL model and ithas an excellent feature that the core does notdepend on a specific process technology. Dueto a synchronous edge-triggered design, it hasgood affinity to EDA tools.

This M32R softmacro is so compact that it canbe mapped into one FPGA. Utilizing such anM32R softmacro, we developed software on aprototype hardware and co-designed hardwareand software simultaneously.

To port Linux to the M32R, some enhance-ment of the M32R softmacro core was needed;processor mode (user mode and supervisormode) was introduced and an MMU modulewas newly supported.

• TLB (Translation Lookaside Buffer): in-struction/data TLBs are full-associative,32-entries each, respectively.

• page size : 4kB/16kB/64kB (user page),4MB (large page)

2.3 Integrated debugging function and SDI

The integrated debugging function is a signifi-cant characteristic of the M32R family micro-computer. The M32R common debugging in-terface, called as SDI (Scalable Debug Inter-face), is utilized via five JTAG pins; the in-ternal debug functions are controlled throughthese debug pins.

Using the JTAG interface defined as IEEE1149.1, internal debug function can be used.No on-chip or on-board memory to store mon-itor programs is necessary, because such moni-tor programs can be provided and executed viaJTAG pins.

Linux Symposium 401

3 Porting Linux to the M32R

The Linux system consists of not only theLinux kernel, but also the GNU toolchain andlibraries. Of course, a target hardware environ-ment is also necessary to execute Linux.

Therefore we had to accomplish the followingtasks:

• Porting the Linux kernel

• Development of Linux/M32R platforms(M32R FPGA board, etc.)

• Enhancement of the GNU toolchain

• Porting libraries (GNU C library, etc.)

• Userland; preparing software packages

Actually, in the Linux/M32R development,these tasks have developed concurrently.

3.1 Porting Kernel to the M32R

In the Linux kernel, the architecture-dependentportion is clearly distinguished. Therefore, incase of a new architecture port, all you need todo is prepare only architecture dependent code,which is far less than whole Linux code. Inthe M32R case,include/asm-m32r/ andarch/m32r/ are needed.

To be more precise, we can prepare thearchitecture-dependent portion in reference tothe other architecture implementation. How-ever, it has some difficulties in rewriting theseportions:

asm function : It was very difficult to portsome headers in which asm statementis extensively used, because an insuf-ficient and inadequate rewriting easilycause bugs which are very hard to debug.

function inlining : In the Linux source code,function inlining is heavily used. We cannot disable the compiler’s inline optimiza-tion function to compile the kernel source,but a buggy toolchain sometimes causes asevere problem in optimization.

In this porting, we started with a minimumconfiguration kernel. We prepared stub rou-tines, built and checked the kernel operation,and gradually added files ofkernel/ , mm/,fs/ , and other directories. When we startedkernel porting, there was no evaluation board.So, we made use of GNU M32R simulator toport a kernel at first.

The GNU simulator was very useful at the ini-tial stage of kernel porting, though it does notsupport MMU. It had also good characteris-tics that downloading was quite fast comparingwith evaluation board and C source level debugwas possible.

Employing the simulator and initrd image ofromfs root filesystem, it is possible to developand debug kernel’s basic operation, such asscheduling, initialization and memory manage-ment. Indeed, the demand loading is per-formed after/sbin/init is executed by aexecve() system call at the end of kernelboot sequence ofinit() .

At first, we started to port the kernel 2.2.16.The current stable version of the M32R kernelis 2.4.19 and now we are developing 2.5 serieskernel for the M32R.

3.1.1 System Call Interface

In the M32R kernel, like other processors, asystem call is handled by a system call trapinterface. In the system call interface (syscallI/F), all general purpose registers and accumu-lators are pushed onto the kernel stack to save

Linux Symposium 402

System Call : TRAP#2R7 : System Call NumberR0 .. R6 : arg0 .. arg6 (max. 7 arguments)

User Process

User Stack (SPU)

Kernel

Kernel Stack (SPI)

System CallExecution

System CallInvocation TRAP#2

Change Stack Pointer

Figure 2: System call interface

the context.

In Fig. 2, the syscall ABI (Application BinaryInterface) for the M32R is shown. Two stackpointers, kernel stack (SPI) and user stack(SPU), are switched over by software at the en-try point of syscall I/F routine, because stackpointers do not change automatically byTRAPinstruction. In order to switch stack point-ers without working register and avoid multi-ple TLB miss exception,CLRPSWinstructionis newly introduced.

The stack frame formed by the syscall I/F rou-tine is shown in Fig. 3. It should be notedthat there is a special system call in Linux,like sys_clone(), that has particular interfacepassing a stack top address as a first argu-ment. Therefore, we employ a parameter pass-ing method: The stack top address (∗pt_regs)is always put onto the stack as a implicit stackparameter like the SuperH implementation.

According to the ABI of the M32R gcc com-piler, the first 4 arguments are passed by regis-ters and the following arguments are passed bystack. Therefore, the∗pt_regs parameter canbe accessed as the eighth parameter on the ker-

R4R5R6

*pt_regsR0R1R2R3R7R8R9

R10R11R12

syscall_nrACC0HACC0LACC1HACC1LPSWBPCSPUR13LRSPI

ORIG_R0

+0x00+0x04+0x08+0x0c+0x10+0x14+0x18+0x1c+0x20+0x24+0x28+0x2c+0x30+0x34+0x38+0x3c+0x40+0x44+0x48+0x4c+0x50+0x54+0x58+0x5c+0x60+0x64

Lower Address

Upper Address

Stack Top; SPI(= pt_regs)

Figure 3: Stack frame formed by a system call

nel stack.

The syscall_nr and ORIG_R0 field areused for the signal operations. When a sys-tem call is issued, its system call number isstored intoR7 and trap instruction is executed.syscall_nr also holds the system call num-ber in order to determine if a signal handler iscalled from a system call routine or not. Be-cause theR0 field might be changed to the re-turn value of a system call,ORIG_R0 keepsthe original value ofR0 in preparation to restartthe system call.

3.1.2 Memory Management

Linux manages the system memory bypaging.In the M32R kernel, the page size is 4kB likethe other architecture.

Linux Symposium 403

TLB miss handler

access exception handler

do_page_fault()

handle_mmu_fault()

(1) TLB miss exception(2) access exception

reexecute afterthe execution of exception handlers

MMU exception!

Figure 4: Exception handling for the demand-loading operation

In demand loading and copy-on-write opera-tions, a physical memory page can be newlymapped when apage faulthappens. Such apage fault is handled by bothTLB miss handlerandaccess exception handler(Fig. 4).

demand loading : If an instruction fetch oroperand access to the address which is notregistered in page table, an MMU excep-tion happens. In case of a TLB miss ex-ception, TLB miss handler is called. Tolighten the TLB miss handling operation,TLB handler only sets TLB entries. Pagemapping and page-table setting operationsare to be handled by the access exceptionhandler; For accessing a page which doesnot exist in the page table, the TLB misshandler sets the TLB entry’s attribute tonot-accessible at first. After that, sincethe memory access causes an access ex-ception due to not-accessible, access ex-ception handler deal with the page tableoperations.

copy-on-write : In Linux, copying a processby fork() and reading a page in read-only mode are handled as a copy-on-writeoperation to reduce vain copy operations.For such a copy-on-write operation, TLBmiss handler and access exception handlerare used like a demand loading operation.

The M32R’s data cache (D-cache) is indexedand tagged physically. So, it does not have totake care the cache aliasing. Therefore the D-cache is flushed only for a signal handler gen-eration and a trampoline code generation.

To simplify and speed up the cache flushingoperations for trampoline code, a special cacheflush trap handler (trap#12) is established in theM32R kernel.

3.1.3 SMP support

In Linux 2.4, multiprocessing performance issignificantly improved compared with Linux2.2 or before, because the kernel locking foraccessing resources is finer.

To implement such a kernel locking on SMPkernel, spinlock is generally used for mutualexclusion control. But the M32R has noatomic test-and-set instruction, the spinlockoperations can be implemented withLOCKandUNLOCKinstructions in the M32R kernel.

The LOCK and UNLOCKinstructions are aload and store instructions for mutual exclu-sion operation, respectively.LOCKacquiresand keeps a privilege to access the CPU busuntil UNLOCKinstruction is executed. Access-ing a lock variable byLOCK/UNLOCKinstruc-tion pair under a disabled interruption condi-tion, we implemented an atomic access.

Figure 5 shows an M32R on-chip-multiprocessor prototype chip. Linux SMPkernel can be executed on the on-chip mul-tiprocessor system. On-chip multiprocessormight be a mainstream in near future even inembedded systems, because multiprocessorsystem can enhance the CPU performancewithout increasing operating clock frequencyand power dissipation. In this chip, two M32Rcores are integrated and each has its own cachefor good scalability.

Linux Symposium 404

(a) Chip (b) CPU

Figure 5: A micrograph of an on-chip-multiprocessor M32R prototype chip

3.2 Development of Linux/M32R Platform

To execute full-featured Linux OS, an MMUis necessary; therefore, we developed a newM32R softmacro core with an MMU and madean evaluation board “Mappi,” which used FP-GAs to map the M32R softmacro core, as aLinux/M32R platform.

As shown in Fig. 6, the Mappi evaluationboard consists of two stacked boards. The up-per board is a CPU board and the lower boardis an extension board. The CPU board has noreal CPU chip, but it has two large FPGAs onit. We employ the M32R softmacro core andmap it onto the FPGAs.

The Mappi board is a very flexible system forprototyping. If we have to modify a CPU orother integrated peripherals, we can immedi-ately change and fix them by modifying theirVerilog-HDL model.

At first, we could only useinitrd andbusybox on it, because the Mappi systemhad only a CPU board and it had only 4MBSRAMs. After the extension board was devel-oped, more memory (SDRAM), Ethernet, and

PC-card I/F became available. So, we intro-duced NFS and improved the porting environ-ment. It was Dec. 2001 that we succeededin booting via a network using the extensionboard.

Utilizing the M32R’s SDI function and JTAG-ICE, mentioned before, we can download anddebug a target program via JTAG port. Itis much faster than a serial connection be-cause the Debug DMA function is used fordownloading and refering internal resources.Of cource, it is also possible to set hardwarebreakpoints for the PC break and the accessbreak via SDI.

Generally speaking, it is too difficult to de-velop and debug software programs on an un-steady hardware which is under development.But, we could debug and continued to developthe system by using the SDI debugger, becausethe SDI debugger made it possible to accessthe hardware resources directly and it was veryuseful for the kernel debugging.

Finally, we constructed an SMP environmentto execute the SMP kernel, mapping the M32Rsoftmacro cores to two FPGAs on the Mappi

Linux Symposium 405

CPU

Mem BIU

FPGA#0

I/O

FPGA#1

UserLogic

CPU Board

DisplayCont. LAN

Extension Board

PC-card

FlashROM 4MB

SDRAM 64MB

Figure 6: Mappi: the M32R FPGA evaluation board; it has the M32R softmacro on FPGA (CPU,MMU, Cache, SDI, SDRAMC, UART, Timer), FPGA Xilinx XCV2000E×2, SDRAM(64MB),FlashROM, 10BaseT Ethernet, Serial 2ch, PC-card slot×2, and Display I/F(VGA)

CPU board; concretely, we replaced the userlogic portion in FPGA#1 shown in Fig. 6with an another M32R core with a bus arbitor,and modified the ICU (Interrupt Control Unit)to support inter-processor interruption for themultiprocessor.

After the M32R prototype on-chip-multiprocessor chip was developed, theLinux/M32R system including userland appli-cations has been mainly developed by usingthe real chip, because the operating clockfrequency of the M32R FPGA is 25MHz butthe M32R chip can run more than 10 timesfaster.

3.3 M32R GNU toolchain enhancement

The GNU toolchain is necessary to develop theLinux kernel and a variety of Linux applicationprograms. When we started the Linux port-ing, we had only Cygnus’s (now it’s Red Hat,Inc.) GNUPro™m32r-elf toolchain. It wassufficient for the kernel development; how-ever, it could not be applicable to user applica-tion development on Linux, because in a mod-ern UNIX system a dynamic linking method

is strongly required to build a compact systemand achieve higher runtime performance. (Al-though a static linked program is much fasterthan a dynamic linked program if the programsize is small. The bigger a program becomes,the larger cache miss penalty would be.)

We enhanced the M32R GNU toolchain to sup-port shared libraries:

• Change BFD library to support dynamiclinking; some relocations were added fordynamic linking.

• Change GCC and Binutils to support PIC(Position Independent Code).

Because the version of the GNUpro m32r-elfgcc was 2.8 and too old, we had to upgradeand develop a new m32r-linux toolchain. Weapplied GNUpro patch to the gcc of the FSFversion and developed GCC (v2.95.4, v3.0,v3.2.2) and Binutils (v2.11.92, v2.13.90).

In a prologue portion of a C function, the fol-lowing code is generated when the-fPIC op-tion is specified.

Linux Symposium 406

; PROLOGUEpush r12push lrbl .+4 ; get the next instruction’s

; PC address to lrld24 r12,#_GLOBAL_OFFSET_TABLE_add r12,lr

We also modified BFD libraries to supportdynamic linking. We referenced the i386implementation and supported the ELF dy-namic linking. In the ELF object format [3],GOT (Global Offset Table) and PLT (Proce-dure Linkage Table) are used for the dynamiclinking. In the M32R implementation, theGOT is refered by R12 relative addressing andthe RELA type relocation is emoployed. Likea IA-32 implementation, the code fragment ofPLT refers the GOT to determine the symboladdress, because it is suitable and efficient forthe M32R’s cache which can be simply flushedwhole caching data.

As for GDB, we enhanced it to support a newremote targetm32rsdi to use the SDI remoteconnection. By using the gdb, we can do a re-mote debugging of the kernel in C source-level.In the latest version ofm32r-linux-gdb/m32r-linux-insight (v5.3), we haveemployed aSDI server that engages in ac-cessing the JTAG port of the ICE/emmulatorconnected with the parallel port of the host PC.This gdb makes it possible to debug using SDI,communicating with the SDI server in back-ground. Though the SDI server requires priv-ileged access to use parallel port, we can usegdb in user mode.

3.4 Porting GNU C library

The GNU C library is the most fundamentallibrary, which is necessary to execute a varietykind of application programs. So we decided toport it to implement full-featured Linux systemfor a study, though its footprint is too large fora tiny embedded system.

We started to port glibc-2.2.3 (v2.2.3) in earystage of the Linux/M32R porting, it was aboutthe same time that the kernel’s scheduler beganto work.

Then, the glibc for the M32R have been devel-oped step by step;

• Check by a statically linked program, forexample,hello.c (newlib version→glibc version).

• Build a shared library version of glibc andcheck by dynamically linked programs,hello.c, busybox, etc.

• Port the LinuxThreads library to supportPthreads (POSIX thread).

The latest version of the glibc for the M32Ris glibc-2.2.5 (v2.2.5). It also supports aLinuxThreads library, that implements POSIX1003.1c kernel threads for Linux. In this Lin-uxThreads library, we implemented fast user-level mutual exclusion using the Lamport’salgorithm [2], because the system call im-plementation was quite slow due to contextswitching.

After the glibc porting was finished, we startedto build various kind of software. But it hastaken several months to implement and debugthe following:

• Fixup operations of the user_copy rou-tines in the kernel

• Resolve the relocation by a dynamiclinker ld-linux.so

• Signal handling

Especially, the dynamic linking operation wasthe one of the most difficult portions in this

Linux Symposium 407

GNU/Linux system porting, because the dy-namic linker/loader resolved global symbolsand subroutine function addresses in runtime.Furthermore, the dynamic linker itself is alsoa shared library, so we can not debug it inC source-level. However, we debugged thelinker, making use of a simulator, a SDI de-bugger, and all kinds of things.

3.5 Userland

For the sake of preparing software packagesand making the Linux/M32R distributable, webuilt major software packages.

We chose the Debian GNU/Linux as a base dis-tribution, because it is well-managed and all ofthe package sources are open and published.In Debian, using command programs such asdpkg andapt , it is possible to manage abun-dant software packages easily.

To build a binary package for the M32R, wedid as the following:

1. Expand the source tree from the Debiansource package (*.dsc and *.orig.tar.gz)

2. Rebuild a binary package by using adpkg- buildpackage command,specifying the target architecture tom32r(dpkg-buildpackage -a m32r-t m32r-linux ).

So far, more than 300 unofficial.deb pack-ages have been prepared for the M32R target,including the basic commands, such as self-tools and shells, utilities, package managementtools (dpkg , apt ), and application programsas follows:

adduser, anacron, apt, base-files, bash, bc,binutils, bison, boa, bsdgames, bsdutils, busy-box, coreutils, cpp-3.2, cvs, debianutils, de-

Figure 7: A snapshot of the desktop image ofX; the window manager isAfterStep

vfsd, diff, dpkg, e2fsprogs, elvis-tiny, ex-pect, file, fileutils, findutils, flex, ftp, g++-3.2, gcc-3,2, grep, gzip, hostname, klogd, less,libc6, locales, login, lynx, m4, make, mawk,modutils, mount, nbd-client, net-tools, net-base, netkit-inetd, netkit-ping, passwd, perl-base, perl-modules, portmap, procps, rsh-client, rsh-server, samba, sash, sed, strace,sysklogd, tar, tcl8.3, tcpd, tcsh, telnet, textu-tils, util-linux, wu-ftpd, . . .

Most of these packages were developed underthe cross environment, except some softwarepackages, such as Perl, Xserver, gcc, etc. Be-cause they had to be configured in the targetenvironment. Therefore, self tools were nec-essary in order to build packages under theself environment. Fortunately, by using.debpackages anddpkg-cross commands, thesame package files can be completely sharedbetween the self and the cross environment.

Regarding the GUI environment, we haveported some window systems (X, Microwin-dows, Qt-Embedded). We ported them eas-ily using a framebuffer device. Figure 7 is asample screen snapshot of X desktop image.gears and bounce are demonstration pro-

Linux Symposium 408

grams of the Mesa-3.2 3D-graphics library.

4 Evaluation

The Linux/M32R system’s conformance havebeen checked and validated by using the LSB(Linux Standard Base) testsuites, which areopen testsuites and based on the LSB Specifi-cation 1.2 [5]. In this validation, we comparedthe following two environments.

Linux/M32Rbased on the Debian GNU/Linux (sid)linux-2.4.19 (m32r_2_4_19_20030109)glibc-2.2.5 (libc6_2.2.5-6.4_m32r.deb)gcc-3.0 (self gcc; m32r-20021112)

RedHat7.3 2

linux-2.4.18-10 (kernel-2.4.18-10.i686.rpm)glibc-2.2.5 (glibc-2.2.5-42.i686.rpm)gcc-2.96

The result of validation is shown in Table 1.Judging from the result, the LSB conformanceof the Linux/M32R is no less good than theRedHat Linux 7.3, because the original De-bian distribution has basically good LSB con-formance and quality.

5 Future work

To apply the Linux/M32R to embedded sys-tems, it is indispensable to tune and shrink thewhole system more and more. As for the ker-nel, particulary, tuning and improving realtimeperformance will be strongly required.

At present, we are porting the Linux 2.5 se-ries kernel for the M32R in order to supportthe state of the art kernel features, such as

2The kernel and glibc are upgraded and differentfrom the original Red Hat 7.3 distribution.

O(1) scheduler, the preemptible kernel, the no-MMU support, the NUMA support, and so on.

We are also planning to utilize DMA functionand internal SRAM to increase Linux systemperformance. And for the high-end embeddedsystems, we intend to continuously focus onthe SMP kernel for the on-chip-multiprocessor.

6 Summary

To build a Linux platform, we have porteda GNU/Linux system to the M32R processor.In this work, a hardware/software co-designmethodology was employed, using a full syn-thesizable M32R softmacro core to accelerateLinux/M32R development. To develop SoC ina short time, such a hardware and software co-design and co-debugging methodology will be-come more important hereafter.

Linux will play a great role in the field of notonly PC servers but also embedded systems inthe near future. Through the feasibility study,we believe that the Open Source will providea quite large impact on developing embeddedsystem design and development. If we have op-portunity, we hope to publish the Linux/M32Rand M32R GNU toolchain.

7 Acknowledgements

The authors greatly acknowledge the collabo-ration and valuable discussion with the M32Rdevelopment team [1] and thank Takeo Taka-hashi, Kazuhiro Inaoka, and Takeshi Aoki fortheir special contributions, and we would alsolike to thank Dr. Toru Shimizu and HiroyukiKondo for their promotion of the M32R pro-cessor development project.

Linux Symposium 409

ANSI.os POSIX.os LSB.os RedHat7.3Section ANSI.hdr

F MPOSIX.hdr

F M F MTotal

Total

Expect 386 1244 1244 394 1600 1600 908 908 8284 8284Total

Actual 386 1244 1244 394 1600 1600 908 908 8284 8284Succeeded 176 1112 86 207 1333 0 695 0 3609 3583Failed 4 0 0 5 2 0 49 0 60 45Warnings 0 12 0 0 5 0 2 0 19 18FIP 2 0 0 2 2 0 1 0 7 7Unresolved 0 0 0 0 0 0 5 0 5 4Uninitiated 0 0 0 0 0 0 0 0 0 0Unsupported 203 0 0 179 72 0 59 0 513 513Untested 0 4 0 0 7 0 39 0 50 43NotInUse 1 116 1158 1 179 1600 58 908 4021 4021

Key: F:function, M:macro; FIP: Further Information Provided

Table 1: LSB 1.2 testsuites result

References

[1] Satoshi Kaneko, Katsunori Sawai, NorioMasui, Koichi Ishimi, Teruyuki Itou,Masayuki Satou, Hiroyuki Kondo, NaotoOkumura, Yukari Takata, Hirokazu Takata,Mamoru Sakugawa, Takashi Higuchi,Sugako Ohtani, Kei Sakamoto, NaoshiIshikawa, Masami Nakajima, ShunichiIwata, Kiyoshi Hayase, Satoshi Nakano,Sachiko Nakazawa, Osamu Tomisawa,Toru Shimizu,A 600MHz Single-ChipMultiprocessor with 4.8GB/s InternalShared Pipelined Bus and 512kB InternalMemory, Proceedings of 2003International Solid-State CircuitsConference, 14.5.

[2] Laslie Lamport,A Fast Mutual ExclusionAlgorithm, ACM Trans. on ComputerSystem, Vol. 5, No. 1, Feb. 1987, pp. 1-11.

[3] Executable and Linkable Format (ELF),http://www.cs.northwestern.edu/~pdinda/ics-f01/doc/elf.pdf

[4] Debian GNU/Linux,http://www.debian.org/

[5] LSB testsuites, http://www.linuxbase.org/test/ ,ftp://ftp.freestandards.org/pub/lsb/test_suites/released-1.2.0/runtime/

Linux ∗ in a Brave New Firmware Environment

Matthew TolentinoIntel Corporation

Enterprise Products & Services Division

[email protected]

Abstract

Initially included exclusively on Intel®1 Ita-nium®2 platforms, the Extensible FirmwareInterface Architecture (EFI) will soon be sup-ported on IA-32 server, workstation, and desk-top systems. This paper provides insight intothe design and composition of an EFI en-abled IA-32 Linux kernel capable of bootingon legacy free platforms. An overview of theEFI development environment is provided, in-cluding the specifications, development tools,and software development kits available for de-velopment today.

The design and prototype implementation ofthe kernel initialization sequence from the in-stantiation of the EFI enabled Linux bootloader to the login prompt is detailed withan emphasis on maintaining backward com-patibility with existing legacy platforms. Thelegacy free VGA replacement, the UniversalGraphics Adapter (UGA), is introduced in thecontext of Linux, including the requirementsfor use within the kernel. Additionally, detailsof a prototype implementation of the Univer-sal Graphics Adapter Driver stack, includingan EFI Byte Code Interpreter and Virtual Ma-chine (EBC VM), are presented and analyzed.

∗Linux is a trademark of Linus Torvalds1Intel is a registered trademark of the Intel Corpora-

tion2Itanium is a registered trademark of Intel Corpo-

ration or its subsidiaries in the United States and othercountries

This paper concludes with a call for kernel de-velopers to review and provide feedback on thedesign and implementation presented.

1 Introduction

This paper begins with an overview of the Ex-tensible Firmware Interface and the UniversalGraphics Adapter. The architecture of an EFIenabled Linux kernel is presented as well asthe design and implementation details of theEFI Linux Boot Loader, kernel initializationchanges, and support for the Universal Graph-ics Adapter. Future enabling work is outlinedand conclusions presented.

2 EFI Overview

The EFI Specification defines a consistent, ar-chitecturally neutral interface between plat-form firmware and operating systems. De-signed to address the limitations and issues in-herent in legacy BIOS support for PC-AT sys-tems, EFI provides a core set of services andprotocol interfaces used to initialize platformhardware as well as common interfaces to ac-cess platform capabilities. Additionally, EFIaggregates platform configuration informationthat operating systems require for initialization– information that has been traditionally ob-tained through BIOS calls. This includes de-tails ranging from the system memory map andthe number of processors in the system to the

Linux Symposium 411

parameters of the current video console. Fur-ther, the structure of EFI is modular such thatas incarnations of new technologies are incor-porated into systems, the interfaces to the op-erating system for these technologies will notrequire significant modifications. The systemlevel view of the conceptual design of EFI isdepicted in figure 1.

SMBIOS

ACPI

Operating System Kernel

EFI OS Boot Loader

EFI Boot Services

EFI Runtime Services

Hardware

Interfaces from other

specs

Figure 1: EFI System View

Once the system is powered on, the systemfirmware initializes and owns all hardware re-sources. Before an operating system is loaded,EFI provides a pre-boot environment and man-ages platform resources through the Boot Ser-vices and Runtime Services tables. Thesetables, encapsulated in the EFI System Ta-ble provide access to system resources. Un-like legacy BIOS which executes in 16bit realmode, EFI provides a 32bit protected mode op-erating environment.

2.1 EFI Boot Services

Accessed through the EFI System Table, EFIBoot Services provide platform independentfunctionality that is only available before anoperating system is loaded. This includes ser-vices such as memory allocation routines, de-vice access protocols, time services, and others

detailed in [1]. Once control of the system istransferred to the operating system, EFI BootServices are terminated. The OS loader is re-quired to call ExitBootServices() as part of thetransition of control to the operating system.

2.2 EFI Runtime Services and Drivers

The EFI Runtime Services provide OS neu-tral platform specific functionality that persistsinto OS runtime. One example of an EFI Run-time driver is the floating point software as-sist driver (fpswa.efi) on Itanium platforms dis-cussed in [4]. Another example supported byboth IA-32 and Itanium platforms is the Uni-versal Graphics Adapter driver. EFI implemen-tations may provide any number of EFI driversfor use during OS runtime in order to take ad-vantage of the generic functionality. Details ofthese drivers are passed to the operating systemvia the EFI Configuration Table and the mem-ory locations occupied are specified as Run-time Memory.

2.3 EFI Configuration Table

Leveraging existing standards and technolo-gies, the EFI configuration table provides aninterface to commonly used tables that de-scribe platform resources. Each entry in theConfiguration Table is comprised of 2 elements- a Globally Unique Identifier as defined in [9]and a pointer to the respective table. Examplesof configuration table entries include ACPI ta-bles, UGAs discovered by the firmware, andSMBIOS tables.

2.4 EFI Boot Loader

Generally, loading and initializing an operatingsystem involves several steps. The first step isthe determination of the kernel image to loadinto memory as well as any additional requiredcomponents, such as a RAM disk, that are re-

Linux Symposium 412

quired. The second is the act of loading thechosen images into memory. The third andfinal step involves transferring control to theloaded kernel, thus enabling the kernel initial-ization sequence to commence.

Several boot loaders have been developed toload Linux on IA-32 platforms, including thepopular lilo and grub loaders described in [7]and [8] respectively. Each of these provides thecapability to boot Linux kernels from a legacyBIOS and offer varying degrees of functional-ity as discussed in [3]. However, these loadersdo not provide the capability to boot from theEFI environment. In order to boot a legacy-freeLinux kernel from EFI, an EFI native applica-tion is necessary to set up and affect the trans-fer of control from the firmware to the kernel.

3 UGA Overview

The UGA is a software abstraction that pro-vides an interface for simple graphical outputthat does not require specific implementationknowledge of the video hardware. UGA servesas a replacement for VGA hardware and videoBIOS providing graphical display capabilitiesin an image that can be used by both systemfirmware and operating systems. Unlike VGA,UGA enables several significant features in-cluding support for controlling multiple outputdevices and higher default screen resolutions.UGA ROMs may reside anywhere in mem-ory, alleviating the need for static reservationof specific video RAM and ROM memory re-gions. Additionally, multiple UGA ROMs maybe used in a single system without conflict.

A key design consideration of UGA is the pro-cessor and platform independent nature of thedriver image. Compilation of driver imagesinto EFI Byte Code (EBC) enables a single im-age to execute on multiple architectures. Theresultant byte code image must be interpreted

and executed in the context of an EBC VirtualMachine capable of translating EBC instruc-tions to native instructions.

Because the UGA is an EBC image and operat-ing systems are expected to use the same imageas the pre-boot firmware environment, an OSresident EBC Virtual Machine is required. ThisVirtual Machine provides an EFI execution en-vironment within the OS, interfaces with theconsole layer and the PCI subsystem of the op-erating system, and provides the means to in-terpret and execute the EBC instruction streamof the UGA. Figure 2 pictorially describes thisgeneric design.

Figure 2: UGA Conceptual Design

Although UGA does provide graphics displaycapabilities, it is not intended to replace high-performance, operating system specific graph-ics drivers. Rather, the UGA is specifiedfor use in the absence of a performance ori-ented graphics driver. For example, a back-end server may not require extensive graphics-oriented functionality during normal operation.There may also be cases where an installationkernel may require the user to provide a graph-ics driver. In these cases, a UGA driver maybe used as a default console driver. Further ad-vantages of UGA can be found in section 10 of[1].

Linux Symposium 413

3.1 OS Console Driver

Support for UGA at OS runtime requires an in-terface to the OS specific console subsystem.This driver is responsible for affecting changeto the video controller via the UGA protocolsoutline in section 10 of [1].

3.2 OS EFI Byte Code Virtual Machine

The OS EBC Virtual Machine, as depicted infigure 2, serves two functions. The first is toprovide an EFI emulation environment func-tionally equivalent to the EFI pre-boot environ-ment. This enables the use of the same UGAdriver as firmware through emulation of theEFI Boot Services.

The second function is to provide the facilityto decode and execute EBC instructions. UGAdrivers compiled into EBC can not be directlyexecuted. Therefore, an OS present mecha-nism is needed translate the EBC instructionstream of the UGA image into instructions ofthe native processor.

3.3 OS PCI Driver Interface

In pre-boot space, the UGA driver uses the PCII/O Protocol, a Boot Services structure, to ac-cess the video controller. Because this mecha-nism is no longer available when the operatingsystem takes control, PCI access must be pro-vided via an OS level implementation of thePCI I/O Protocol interface structure. This en-ables the UGA to use the standard operatingsystem interface to access PCI space.

3.4 UGA EBC ROM

The OS support for UGA is capable of ex-ecuting any UGA EBC ROM discovered byfirmware that is compliant with the EFI DriverModel. In order to facilitate the transfer of con-trol from EFI to the operating system, UGA

drivers used in pre-boot space will be haltedupon termination of Boot Services. The imagemust be re-initialization to be used during OSruntime.

4 Architecture of an EFI enabledLinux Kernel

The advent of EFI on legacy free, IA-32 sys-tems necessitates additional support in severalkey areas of the Linux kernel. This sectionpresents an overview of the architectural dif-ferences in booting the kernel from EFI ver-sus legacy BIOS as well as the impact of thechanges. The details of the changes specific tothese areas are discussed in the remainder ofthis paper.

4.1 Key Differences

One of the key differences in booting from EFIversus legacy BIOS on IA-32 systems is the ca-pability to launch the kernel in 32bit, protectedmode. The firmware no longer invokes the bootloader in 16bit real mode. As a result, the ker-nel no longer needs to affect the transition to32bit, protected mode.

Loading an EFI enabled kernel requires anEFI Linux boot loader. The loader is a na-tive EFI application, which is responsible forobtaining platform configuration informationEFI and loading the kernel into memory. In-cluded as boot parameters, all platform config-uration information is collected by the loaderand passed to the kernel. This permits theloader to transfer control directly to the archi-tecturally specific entry point of the kekernel.This difference alleviates the need for the ker-nel code that collects sytem information viaBIOS calls.

On EFI platforms, EFI defined services are em-ployed to invoke firmware functionality as op-

Linux Symposium 414

posed to legacy BIOS calls. The EFI RuntimeServices provide a standard interface for ac-cessing firmware and hardware in a platformindependent manner. For example, access tothe real time clock is accessed via a firmwarefunction in the Runtime Service table as op-posed to directly reading CMOS. The practiceof scanning memory to find the signatures ofhardware description tables is no longer neces-sary because tables such as ACPI are includedas part of the EFI interface.

The advent of the Universal Graphics Adapterno longer requires static reservation of fixedmemory regions and presents the opportunityfor kernel level multi-head console support.

The design of an EFI enabled kernel requireskey modifications to the following three crucialareas:

• Boot loader

• Kernel initialization sequence

• Console subsystem

4.2 Impact of EFI Kernel Changes

The changes to the kernel to support bootingfrom EFI are relatively straightforward. Themodifications to the boot loader involve ex-tending the functionality of the IA-32 aspectsof the Elilo loader to pass EFI data structuresto the kernel.

The changes to the IA-32 kernel initializa-tion sequence provide the capability to initial-ize kernel data structures, such as the memorymanager, using EFI tables and data structuresversus those obtained via BIOS calls. Primar-ily isolated in the architecturally specific di-rectory of the kernel source tree, EFI supportprovides additional initialization options with-out affecting existing functionality. In otherwords, the changes to the kernel initialization

sequence do not radically alter the architectureof the kernel.

Inclusion of UGA support necessitates severalnew Linux kernel drivers; however, this merelyprovides an additional console display option.Collectively, these drivers serve as an alterna-tive to the legacy VGA console driver, but mayalso operationally coexist.

4.3 Architectural Influence

Itanium Linux kernels have included supportfor EFI for several years. Accordingly, muchof the design discussed in this paper for IA-32 kernels has been leveraged from the Itaniumport of Linux. Additionally, the prototype im-plementation has reused EFI related code toalso support IA-32.

5 EFI Linux Boot Loader

Launching an operating system from EFI re-quires an EFI aware boot loader. This sec-tion provides background on the Elilo LinuxBoot Loader, design considerations for im-proved IA-32 support, and details of the newboot parameters structure used to convey plat-form configuration information to the kernel.

5.1 Elilo Background

Elilo is the predominant loader used to launchLinux on Itanium platforms (on which EFIis the only pre-boot firmware solution) andhas been included in numerous Itanium spe-cific Linux distributions. Despite the pointedfocus on supporting the Itanium architecture,a framework for booting self-extracting com-pressed IA-32 kernels has been incorporatedinto Elilo. This support permits booting IA-32kernels without passing EFI information to thekernel. Instead, there is an implicit assumptionthat legacy mechanisms, such as BIOS calls,

Linux Symposium 415

still exist for traditional platform configurationinformation retrieval. Essentially, Elilo manu-ally fabricates the legacy boot parameters datastructure without any semblance of EFI aware-ness.

5.2 Elilo Design Considerations

The primary consideration for modifying theElilo loader is to provide adequate EFI infor-mation to the kernel to ensure proper initializa-tion and functionality in a legacy free environ-ment. The information required by the kernelconsists of:

• Kernel Location and Size

• RAM disk (initrd) Location and Size

• Kernel Command Line

• EFI Memory Map

• Console Information

• EFI System Table

• ACPI Tables

• Other EFI Configuration Table Entries(HCDP, SMBIOS, etc.)

In addition to collecting and passing salientplatform configuration information to the ker-nel, Elilo is also responsible for setting up theenvironment for passing control to the kernel.Further information regarding the features andcapabilities of Elilo can be found in [6].

5.3 EFI Aware Boot Parameters

The following boot parameter structure is in-troduced that encapsulates the information thekernel requires. The form of the structures isas follows:

struct ia32_boot_params {UINTN command_line;UINTN efi_systab;UINTN efi_memmap;UINTN efi_memmap_size;UINTN efi_memdesc_size;UINTN efi_memdesc_version;UINTN initrd_start;UINTN initrd_size;UINTN loader_addr;UINTN loader_size;UINTN kernel_start;UINTN kernel_size;struct {

UINT16 num_cols;UINT16 num_rows;UINT16 orig_x;UINT16 orig_y;

} console_info;} boot_parameters;

This data structure provides all necessary in-formation to enable kernel initialization. Notethat inclusion of the EFI system table permitsaccess to the EFI Runtime Services and Con-figuration Tables that contain further platformconfiguration information and provide accessto runtime firmware functionality. For exam-ple, the location of the ACPI tables is presentedto the kernel as an entry in the EFI Configura-tion Table.

6 EFI Kernel Initialization

The Linux kernel initialization sequence re-quires modification to utilize the EFI datastructures that describe the platform hardwareconfiguration. This section presents the detailsof the kernel modifications and contrasts thesewith the existing kernel initialization. Themethodology for dynamic, runtime determina-tion of the appropriate structures to use is pre-sented as are details on the EFI support rou-tines necessary for proper kernel initialization.

Linux Symposium 416

6.1 Existing EFI Kernel Initialization

The Linux kernel initialization sequence wasdeveloped to boot in 16bit real mode and obtainplatform configuration information via BIOScalls. Figure 3 outlines the current initializa-tion sequence of the Linux kernel.

Figure 3: Existing Kernel Initialization

6.2 Kernel Initialization on EFI Platforms

Booting from EFI simplifies the kernel initial-ization sequence, but requires modifications toparse EFI data structures. Figure 4 depictsthe proposed modified kernel initialization se-quence.

In this model, the processor is already in 32bit, protected mode when the boot loader is in-voked, hence the real mode to protected modetransition code in the kernel is not necessary.Also, all platform configuration informationtraditionally obtained through BIOS calls iscollected by the EFI Elilo loader and passedvia the boot parameter structure. This allevi-ates the need for the code resident in setup.S,

Figure 4: EFI Kernel Initialization

video.S, and bootsect.S. Consequently, controlis transferred directly from the loader to the ar-chitecturally specific startup_32() routine in 32bit, protected mode.

6.3 Dynamic Configuration Detection

Because the structure of platform configurationinformation on EFI platforms differs from thestructure of BIOS provided information, thekernel must determine which to use to initial-ize kernel data structures. This determinationis based on the location of the boot parametersin memory.

The boot parameters of legacy kernels areplaced in a designated page in memory. Lever-aging the re-use of this memory area, the bootparameters of EFI aware kernels are placedwithin the same page, but at a different offset.The correct kernel initialization code path tofollow is determined by verifying which bootparameters have been passed to the kernel. Anew global flagefi_enabled is set if thekernel was found to be loaded from EFI. Thisflag is used during the kernel initialization se-

Linux Symposium 417

quence to determine the appropriate data struc-tures to use during initialization. This capa-bility enables a single kernel image to load onplatform with either legacy BIOS or EFI.

6.4 EFI Data Structure Mapping

Initially, only a limited kernel virtual ad-dress space is available through the tempo-rary, statically initialized page global directoryswapper_pg_dir . This maps the first eightmegabytes of physical memory to kernel vir-tual address space starting at 3GB.

This limited mapping is not sufficient becauseEFI drivers and related structures may be lo-cated anywhere in memory (below 4GB), thusthe kernel requires the capability to dynam-ically map EFI pages into kernel virtual ad-dress space for use during kernel initialization.Examples of these structures include the EFImemory map, RAM disk details, etc. Becausethese structures are only used during early ini-tialization, the memory is only required tem-porarily. Once the kernel memory manager isinitialized these memory regions will be avail-able as for normal kernel memory allocations.

6.5 EFI Memory Map

The EFI memory map provides a snapshot ofsystem memory usage before control is passedto the kernel. Consisting of memory descrip-tor entries that describe contiguous ranges ofphysical memory by type, attribute, and size,the EFI memory map serves as a replacementfor the e820h memory map obtained via thelegacy INT 15h (ax=0xe820) BIOS call. EachEFI memory map descriptor consists of the fol-lowing structure:

struct efi_memory_desc {u32 type;u64 phys_start;u64 virt_start;

u64 num_pages;u64 attribute;

};

In order to properly initialize the kernel mem-ory manager as well as the EFI RuntimeDrivers and Services, the EFI memory map isrequired. Several routines from the Itaniumkernel have been massaged into the IA-32 ker-nel to support EFI memory map traversal andparsing.

6.5.1 EFI Memory Map Walking Routine

The prototype for the EFI memory map de-scriptor traversal routine is:

void efi_memmap_walk(efi_freemem_callback_tcallback, void *arg);

This routine employs a callback mechanismused to discern further information aboutmemory regions during memory map traver-sal. For example, used in concert with thefind_max_pfn() callback routine, the kernelis capable of discovering the maximum pageframe number.

6.5.2 Other Memory Map Related Rou-tines

Several additional routines have been addedto accommodate initialization using the EFImemory map. The following routine is used todetermine the memory type of the region de-scribed by a given memory descriptor.

static int is_available_memory(structefi_memory_desc *md);

Differentiation between memory descriptortypes is necessary to determine usable regions

Linux Symposium 418

by the kernel. The following types constitutememory regions available for kernel use.

• EFI_LOADER_CODE

• EFI_LOADER_DATA

• EFI_BOOT_SERVICES_CODE

• EFI_BOOT_SERVICES_DATA

• EFI_CONVENTIONAL_MEMORY

The following two functions are conveniencefunctions used to determine the type and at-tributes of memory regions in the EFI memorymap given an address.

u32 efi_mem_type(unsigned longphys_addr);

u64 efi_mem_attributes(unsigned longphys_addr);

6.6 Persistent EFI Drivers and Services

Unlike EFI Boot Services, which are termi-nated when control is transitioned to the ker-nel, several EFI drivers and services are avail-able for use during OS runtime. The follow-ing sections detail these services and describethe support framework for maintaining accessto these drivers and services as well as man-aging the mappings into kernel virtual addressspace.

6.6.1 EFI Runtime Drivers and Services

The IA-32 Linux kernel requires the capabilityto call the EFI Runtime Drivers and Servicesto take advantage of platform specific func-tionality, such as access to the real time clock(RTC), persistent EFI NVRAM environmentvariables, and the capability to reset the sys-tem. The following structure, included in the

include/linux/efi.h header constitutes the ker-nel data structure through which calls to theEFI Runtime Services and Drivers are man-aged.

struct efi_runtime_services {struct efi_table_hdr hdr;unsigned long get_time;unsigned long set_time;unsigned long get_wakeup_time;unsigned long set_wakeup_time;unsigned long

set_virtual_address_map;unsigned long convert_pointer;unsigned long get_variable;unsigned long get_next_variable;unsigned long set_variable;unsigned long

get_next_high_mono_count;unsigned long reset_system;

};

Additional EFI Runtime Drivers may be em-ployed to exploit OS independent functional-ity. For example, the floating point softwareassist driver (fpswa.efi) on Itanium platformsdiscussed in [4] and the Universal GraphicsAdapter are EFI compliant runtime driversused on both IA-32 and Itanium platforms. De-tails of runtime drivers are passed to the oper-ating system via the EFI Configuration Table.

The memory occupied by runtime drivers is re-served by the kernel to prevent the memorymanager from viewing the area as free mem-ory. Additionally, these memory regions aremapped into kernel virtual address space toavoid the overhead of invoking these servicesin flat, physical addressing mode. The map-ping of runtime services and drivers into kernelvirtual address space is provided by the follow-ing routine:

void efi_enter_virtual_mode(void);

This routine walks the EFI memory map andmaps all regions described by memory descrip-tors of the following type:

Linux Symposium 419

• RunTimeServicesCode

• RunTimeServicesData

Once mapped, the VirtualStart field of thememory descriptor is updated with the virtualaddressed returned by the mapping function,ioremap() . After all memory descriptorshave been updated with virtual addresses, theEFI Runtime routine SetVirtualAddressMap isinvoked and passed the updated EFI memorymap. SetVirtualAddressMap must be called inphysical mode requiring the capability to tran-sition to physical mode before invocation andreturn. This function updates all EFI runtimeimages with virtual addresses and completesall necessary fix-ups to enable EFI RuntimeServices to be called in virtual mode. Shouldthe call to SetVirtualAddressMap fail to com-plete or return an error status code, the kernelwill panic.

During the mapping process, each memory de-scriptor is also checked to determine if the ad-dress for the EFI system table is included in therange. Once SetVirtualAddressMap returns,the EFI System Table pointer is updated withthe newly assigned kernel virtual address as areas the kernel’s EFI data structures.

Because the ioremap capability is not avail-able until the kernel memory manager isinitialized the efi_enter_virtual_mode() func-tion must be called after the mem_init()and kmem_cache_sizes_init() functions instart_kernel().

6.7 ACPI Initialization

In order to support device discovery and powermanagement, kernel support for ACPI is re-quired. The ACPI tables contain vital plat-form configuration information necessary forproper kernel initialization, such as the num-ber of processors in a system, PCI interrupt

routing, etc. Inclusion of ACPI support inkernel builds requires the kernel configurationflag CONFIG_ACPI_EFI to be defined in orderto enable inspection of the kernel’s EFI datastructure for the ACPI tables. An additional re-quirement for proper ACPI table discovery is toupdate the following ACPI initialization func-tion:

unsigned long __initacpi_find_rsdp(void);

On legacy systems this function scans mem-ory looking for the Root System DescriptionPointer (RSDP). On EFI based systems, the ad-dress of the ACPI tables is included in the EFIConfiguration Table. This function has beenupdated to examine the EFI Configuration Ta-ble for address of the RSDP.

7 Kernel UGA Architecture

The Linux kernel requires UGA support in or-der to provide console display functionality onlegacy free platforms. Because the UGA iscompiled into EFI Byte Code and is program-matically designed for execution in the EFI en-vironment, kernel level support for UGA re-quires the addition of new driver functionality.These drivers, also pictorially described in fig-ure 5, include:

• EFI Boot Service Emulation Driver

• EBC VM & Interpreter Driver

• UGA Console Driver

Each of these components may be built as ker-nel driver modules, although all are requiredfor proper console display. Because thesedrivers have minimal architectural dependen-cies, all EBC drivers are included in a newdrivers/ebc directory of the kernel tree.

Linux Symposium 420

Figure 5: Kernel UGA Architecture

7.1 EFI Boot Services Emulation Driver

As is the case with all EFI drivers, the UGArequires the EFI system table, as well as apointer to itself, as initialization parameters.All requests from the UGA for services to allo-cate memory, bind to the respective controller,and other routines are based on EFI Boot Ser-vices functionality. Because EFI Boot Ser-vices are terminated when ExitBootServices()is invoked, a minimal framework must be fab-ricated within the kernel in order to simulateEFI Boot Services. The Boot Services Emula-tion Driver fulfills this requirement by provid-ing kernel implementations of the Boot Serviceroutines. For example, the EFI memory allo-cation routine AllocatePages() is implementedwith the kernel’s kmalloc() service.

7.2 EBC VM & Interpreter Driver

A kernel implementation of an EBC VirtualMachine and interpreter is also required. Thisdriver provides the framework and executionengine to:

• Interpret and execute all EBC instructions

• Provide an interface to handle calls be-tween a native environment and the VM

• Provide an interface to fix-up calls to EBCdriver images.

Details on the full EBC instruction set can befound in Chapter 19 of [1]. However, detailson two instructions that require particular at-tention are included in the following sections.

7.2.1 BREAK 5 Instruction

The BREAK 5 instruction enables the transi-tion of the instruction stream from native toEBC instructions. This technique is referredto as thunking in [1]. During compilation ofthe UGA ROM image, the EBC compiler in-serts BREAK 5 instructions in the initializa-tion instruction sequence to handle the tran-sition for each of the image’s protocol entrypoints. Functionally, the BREAK 5 instruc-tion introduces a level of indirection, as theVM/Interpreter must replace the address of ev-ery entry point in an image with the addressof a thunk. The thunk is an area in memorythat includes the hexadecimal encoding of na-tive instructions to transition control to the in-terpreter with an entry point of the image. Thisenables the seamless interpretation and execu-tion of the EBC instruction stream via the UGAprotocol interfaces. Figure 6 depicts the logicalconceptual calling mechanism and an exampleimplementation used to invoke the VM and in-terpreter at the appropriate UGA protocol entrypoint.

7.2.2 CALLEX Instruction

The CALLEX instruction provides the mech-anism to facilitate calls outside the context ofthe EBC instruction stream or VM. For exam-ple during compilation, when the compiler ob-serves a call to a Boot Service function, it in-serts a CALLEX instruction, such that the tran-

Linux Symposium 421

Figure 6: Thunking Mechanism

sition of the stack, IP, return value, etc. are han-dled gracefully.

7.2.3 Architectural Dependencies

This driver supports both the IA-64 and IA-32 architectures. Architecturally specific, re-quired features are included through inclusionof appropriate header files. For example, theinline assembly routines used to manipulate thestack pointer and obtain the entry point froma processor register are included in architec-turally specific headers.

7.3 UGA Console Driver

The UGA console driver provides an interfacebetween the Linux kernel console subsystemand the UGA ROM. Because the UGA is con-ceptually encapsulated (i.e. registered) withthe EFI Boot Services Emulation driver, theconsole driver will affect changes to the displaythrough the use of the UGA_IO_PROTOCOLand UGA_DRAW_PROTOCOL protocols. Asa result, the console driver must obtain theUGA protocol interface structures from theEFI Boot Services Emulation Driver.

Unlike VGA, multiple UGAs may be used in

a single system. Therefore, the UGA consoledriver must maintain data structures to handlemultiple UGA simultaneously. For early bootmessage display, the console driver will ini-tially employ the UGA used by the firmware,the details of which are included in the bootparameters.

Similar to the VGA console driver, the UGAconsole driver supports the generic consoleroutines through theconsw structure. Oncethe console driver obtains the protocol inter-faces from the EFI Emulation Driver and thedriver is initialized the UGA may be used forconsole display.

7.3.1 Firmware to OS Handoff StructureParsing

Details of the UGAs discovered duringfirmware initialization are passed via the EFIConfiguration Table. Each entry in the Config-uration Table consists of a GUID and pointerpair. During kernel initialization, the pointer inthe Configuration Table is stored in the kernel’sefi structure. In the case of UGA, this points tothe firmware-to-OS handoff header, which is ofthe following form:

struct efi_os_handoff_hdr {u32 version;u32 hdr_size;u32 entry_size;u32 num_entries;

};

This header is immediately followed by thedriver handoff entries for all UGAs discoveredby the firmware. Each entry driver handoffstructure is of the following form:

struct efi_driver_handoff {int type;struct efi_dev_path *dev_path;

Linux Symposium 422

void *pci_rom;u64 pci_rom_size;

};

The UGA Console driver parses each of thesedriver handoff structures to obtain device in-formation as well as the address of the PCI Ex-pansion ROM that has been copied into mem-ory by the firmware. The PCI Expansion ROMis then parsed to locate the EBC UGA image.Once the UGA image is located, the consoledriver updates its internal data structures withthe image address and device information. Fig-ure 7 shows the organization of the informationpassed from the firmware to the kernel.

Figure 7: UGA Configuration Table Parsing

Details on the header information and layout ofthe ROM image may be found in [2].

7.3.2 Executable Image Parsing

After the EBC image is located, the consoledriver initializes a list of UGAs in the systemwith device information as well as pointers tothe actual PE32+ UGA EBC image in mem-ory. In order to use the driver, the image mustbe effectively loaded. Although the image isalready in memory, the image must be parsedto correctly identify the entry point and imagespecific information. Once the entry point isobtained, the EBC VM and Interpreter is in-voked through the EFI Boot Services Emula-

tion Driver and the UGA is initialized. The ef-fective loading and parsing of PE32+ EFI im-ages requires PE header structure informationbe included in the kernel.

8 Future Challenges

The possibility exists to offload kernel decom-pression to the Elilo loader. This loader ex-tension would provide the capability to bootboth compressed and uncompressed IA-32 ker-nel images, similar to existing functionality onItanium platforms. This is an area that is stillunder investigation.

Additional work to enable the use of the moreadvanced features of UGA is ongoing. Aprototype implementation has been developedthat supports the UGA functionality discussed.However, this support is currently limited to asingle console and does not account for possi-ble UGA related changes to XFree86. Further,the current implementation does not address is-sues involved in supporting multi-head consolewithin the kernel through the use of UGA.

9 Conclusions

EFI provides a standard interface to platformfirmware from which to launch operating sys-tems on legacy free platforms. The IA-32Linux kernel requires several key changes toinitialize properly on EFI supported platforms.The first change involves the inclusion of ad-ditional IA-32 support in the Elilo Linux bootloader. Modification of the kernel initializationsequence to enable the use of EFI constructssuch as the EFI memory map and EFI RuntimeDrivers and Services are required. The kernelalso requires several additional driver modulesin order to use the legacy-free UGA for consoledisplay.

As platform hardware evolves and legacy hard-

Linux Symposium 423

ware and legacy BIOS support is phased out,operating systems must adapt. Inclusion of thecapabilities discussed in this paper constitutesa significant step towards enabling Linux toboot on EFI based, legacy free platforms.

10 Acknowledgements

Special thanks to Mark Doran, Andrew Fish,Harry Hsiung, Mike Kinney, and Greg Mc-Grath of the Intel EFI team for their contribu-tions to this paper and always helpful advice.Thanks to Steve Carbonari, Mark Gross, TonyLuck, Asit Mallick, Mark Gross, Mohan Ku-mar, Sunil Saxena, and Rohit Seth for review-ing this paper.

Special credit is due to Mark Gross for his ef-forts spent working on the design and imple-mentation of the prototype EFI enabled kernel.

References

[1] EFI Specification Version 1.1, IntelCorporation, 2003http://developer.intel.com/technology/efi

[2] PCI Local Bus Specification,Revision2.3, PCI Special Interest Group,Hillsboro, ORhttp://www.pcisig.org/specifications

[3] Werner Almesberger,Booting Linux:The History and the FutureProceedingsof the Ottawa Linux Symposium 2000,July 2000

[4] David Mosberger and Stephane Eranian,ia-64 Linux Kernel, Design andImplementation.Prentice Hall, NJ 2002.

[5] ACPI SpecificationVersion 2.0bhttp://www.acpi.info/spec.htm

[6] David Mosberger and Stephane Eranian,elilo-3.3a Documentation.2000.ftp://ftp.hpl.hp.com/pub/linux-ia64/elilo-3.3a.tar.gz

[7] Werner Almesberger,Lilo TechnicalOverview,ftp://metalab.unc.edu/pub/Linux/system/boot/lilo

[8] Eric Boleyn, et al.GNU Grubhttp://www.gnu.org/software/grub/grub.html

[9] Wired for Management Baseline,IntelCorporation, 1998.http://developer.intel.com/ial/WfM/wfmspecs.htm

Implementing the SMIL Specification

Malcolm TredinnickCommSecure

[email protected]

Abstract

Synchronized Multimedia Integration Lan-guage (SMIL) is a W3C recommendation forencoding multimedia presentations. It pro-vides presentational control over not just thespatial layout of the document, but also therelationship between elements over time. Atthe present time there does not appear to bea high quality Open Source implementation ofthe SMIL 2.0 specification available. This pa-per describes one attempt at an implementa-tion. Some ideas about where future softwaredevelopment could take this implementation tofulfill the requirements of other projects arealso mentioned.

1 A Potted History Of SMIL

1.1 SMIL 1.0 — June 1998

In mid-1998, the W3C promoted the SMIL1.0 specification [4] to Recommendation sta-tus. The goals of this recommendation were(from its abstract)

1. to describe the temporal behavior of a pre-sentation,

2. describe the layout of the presentation ona screen, and

3. associate hyperlinks with media objects.

Here, media objects are either continuous me-dia, such as video or audio presentations, ordiscrete media such as text blocks and images.

The SMIL 1.0 specification was not overlycomplex. It was not short (about 30 pageswhen printed out), but that was because it in-troduced a number of elements with specificsemantics that needed to be explained clearly.Implementations appeared, but SMIL did notset the world on fire. This was possibly be-cause although it fulfilled the design require-ments mentioned above, that wasall it did. Apresentation needed to be separated out into in-dividual media elements in practice and lever-aging existing content was difficult1.

1.2 SMIL 2.0 — August 2001

Roughly three years later the SMIL 2.0 spec-ification reached Recommendation status [5].Superficially, this specification looked vastlydifferent from the SMIL 1.0 version. It wasalso much larger (the 1.0 document was around150 KB in size, 2.0 was 3.3 MB). However,upon closer inspection, it became apparent thatthe changes fell broadly into a only a few cate-gories.

1. Additional markup for transitions be-tween media components, extra layout

1This is just the present author’s observation. Thereal reason may be that SMIL 1.0 was ahead of its time;a solution in search of a problem.

Linux Symposium 425

possibilities, content control (so that pre-sentations on PDAs and large monitorscould be driven from the same source) andextra events for controlling the presenta-tion.

2. Improved accessibility features (for exam-ple, offering a choice between closed cap-tioning and audio descriptions).

3. A modular design so that SMIL modulescould be reused in conjunction with otherXML modules. This also provides a wayto allow small-footprint implementationsto implement the commonly used portionsof the language and omit the more com-plex parts whilst still being able to processdocuments intended for more featurefulimplementations.

SMIL 2.0 provided backwards-compatibilitywith SMIL 1.0 documents and precisely laidout how to process older documents with re-spect to the newer features. Content writtenfor the older specification did not become ob-solete. In view of this, an implementation of“SMIL” need only focus on the 2.0 specifica-tion since 1.0 compatibility comes for free2.

Today, SMIL isstill not setting the world onfire. It is slowly gaining visibility, though,and is becoming incorporated into a number ofauxiliary specifications which is forcing it intopeoples’ consciousness. We consider a few ofthese uses later in this paper.

2 Building A SMIL Library

Around June of 2002, the present author stum-bled across SMIL whilst looking for somethingentirely different and started to read the spec-ification, as one does in those circumstances.

2Well, it comes after a lot of work, in practice. Butyou still only need to implement a single specification.

In November, a rough design for a library toparse SMIL documents and somehow presentthem was sketched out. Over the followingweekends and the odd evening or two, codestarted to come together to the point where atthe present time3 a complete implementation ofthe specification is within sight. The imple-mentation has been done without reference tothird-party products, since the whole idea wasthat it was meant to be a fun project and a testof how difficult it might be to implement fromthe specification alone.

Existing SMIL implementations all appear tobe designed along the lines of building a com-plete presentation application for SMIL docu-ments. The library being considered in this pa-per,libsmil , has different goals.

The libsmil library parses a SMIL docu-ment and extracts the data into a format thatcan easily be used by a client application tomake the presentation. It is then up to the clientapplication to start and stop the presentationof the various media objects in windows as di-rected by the library (see Figure 1).

In essence,libsmil manages a bunch ofdata structures containing information aboutthe presentation for the client. From time totime, the library will poke the client and tellit to start or stop presenting on of the mediaobjects in a particular location. Similarly, theclient will call the library whenever the user(or other external stimulus) triggers an eventthat may influence the progress of the presen-tation. This encompasses actions like clickingon a “stop” button or following a hyperlink tosome other location in the document. The li-brary itself is agnostic about the means usedto present the information by the client. Thisshould provide the opportunity to reuse this li-brary in a variety of different situations, sinceit has very few dependencies on its own and

3As this paper is being written in April 2003

Linux Symposium 426

PlayerSMIL

Multimedia libraries

libxml2 glib

libsmil

Figure 1: Basiclibsmil architecture

places few restrictions on the client’s behaviorand operation.

3 An Interaction With A Presenta-tion Client

In order to illustrate the features provided bylibsmil , we consider a typical series of callsbetween the library and a client application.For our purposes here, it suffices to understandthat a standalone SMIL document consists of ahead and abody portion, much as XHTMLdoes. The head of the document contains dataabout the presentation and the body containsthe presentation itself. The explanation in thissection necessarily skips over some of the mi-nor steps, but we will endeavor to touch on allthe highlights.

3.1 Pre-Processing

After initializing libsmil , the client readsin the document and creates a DOM rep-resentation using the GNOME XML library

libxml2 . This responsibility is given to theclient since an internal representation of thedata to be presented may already exist or is be-ing created on the fly. This would be typical,for example, in the scenario where the client isusing SMIL as part of a larger language, suchas XHTML+SMIL (see [8]). For more infor-mation on this type of usage, see section 5.

The DOM is then passed tolibsmil for ini-tial, pre-presentation processing. Initially, thisinvolves parsing all of the contents of theheadelement

The library extracts from the document the re-gions that are going to be used in the presen-tation and what their dimensions are. A SMILdocument can have a number of top-level re-gions (essentially, separate windows for a GUIclient) and each top-level region can containany number of named regions within it. Theseinternal regions can be used to display videos,captions, text overlays on images or videos orwhatever the content author desire. The re-gions may overlap and can have a Z-orderingapplied to them so that one region always ap-pears on top of another region. Each region canalso have attributes specified to indicate howmedia within it is treated in terms of clipping,scrolling or scaling.

A SMIL document can specify any number ofcustomized tests which are used in the bodyof the document. These are a sequence ofnamed data elements which may contain ar-bitrary values. The values can be set by theuser—through some interface on the client—or via a URI. Some of the tests are marked asintended for user setting, whereas others arenominated as “hidden,” meaning that althoughan advanced client may permit them to be set,they will typically take their value either froma default setting or by retrieval from some re-mote location. Solibsmil extracts out allthe custom tests, sets up their default values

Linux Symposium 427

and types (visible or hidden) and puts them intoa data structure for the client to present to theuser upon request.

In the final phase of the pre-presentation scan,the library makes a pass over thebody elementof the document and tries to extract two morepieces of data.

1. A list of all the media types that are re-quired to make the presentation, and

2. a graph of all the timing and synchroniza-tion dependencies between the elementsin the body of the document.

The first item here is straightforward: everymedia object is marked with one of a limitednumber of tags and includes a compulsory at-tribute indicating the type of the media. Build-ing up a list of required media types and link-ing the types to the URIs for the contents isjust a matter of parsing these elements. Thislist is given back to the client, who can use it topre-load the appropriate presentation librariesor prepare itself (and the user) for the fact thatsome items will not be presentable due to thelack of an appropriate output method.

The timing and synchronization data is whatmakes implementing the SMIL specificationinteresting and challenging. It is non-trivial!Rather than interrupt the flow of this section,we will just take it as given that some magichappens and a timing graph is constructedwhich libsmil will use during the presen-tation to control events. In section 4 we willcome back to this problem, since it lies at theheart of the benefits SMIL can provide to ex-ternal applications. The timing graph that isconstructed here is only of use to the libraryand will remain hidden from the client. That isa fair division of knowledge: most of the com-plexity of a presentation lies in getting the tim-ing issues correct and that is ultimately whatthe client is relying on the library for.

After collecting all this data, control is returnedback to the client. The client can now querythe library to extract a list of top-level regionsthat it will need to provide, or to find out whichmedia formats are going to be required. Theclient then initializes any extra libraries it re-quires to make the presentation and registers asingle callback withlibsmil .

3.2 Running The Presentation

After all the data collection in the previoussection, running a presentation is a reasonablysimple process, with a couple of exceptionsmentioned below.

The client program calls smil_run_presentation() to start the action4. Ifthe user does nothing, the majority of thepresentation will consist oflibsmil callingthe function that the client had registered atthe end of the last section. This functioncontains instructions to start or stop the pre-sentation of a particular media element in aparticular region. For continuous media, a“start” instruction may also contain an offset atwhich to begin playing the fragment, a repeatcount and possibly a time multiplier. This lastoption is an advanced feature in SMIL thatpermits authors to alter the natural duration ofa media fragment by, in effect, accelerating ordecelerating time as seen by that item only.Not all players will be able to support thisfeature and presentation authors need to allowfor that possibility.

From time to time, the user may trigger anevent at the user-interface. This could be some-thing like clicking on a hyperlink, hitting a hot-key, or closing a window on the desktop. Theclient will pass this event, along with any asso-

4This must be done in a thread or separate process,since at the same time aslibsmil is creating events,the client must be responding to user interactionandpre-senting the media to the user.

Linux Symposium 428

ciated contextual information such as the nameof the link or region that was clicked, back tothe library. These events will be used by thelibrary to control the future of the presentation,but from the client’s point of view they simplyresult in further calls to the registered callbackfor starting and stopping presentations.

The only slightly complex feature when run-ning a presentation is how to present featuresfrom the Animation modules. These SMILmodules provide a means to change the valueof attributes on elements over time. The val-ues can change linearly and non-linearly overtime. They can jump to discrete values at dif-ferent moments. The value against time graphcan even be specified using Bezier splines.

The Animation modules are most commonlyused when SMIL is incorporated into an-other language profile. So the attributes be-ing changed might be things like the displaystyle property on an XHTML element, or thelength of a line in SVG. Displaying the hostlanguage elements correctly is clearly the jobof the client. However, managing the complexvalue against time relationships is somethingthat the SMIL library should be doing, since itpossesses complete knowledge of the requiredalgorithms.

Currently, when ananimation element isbegun, libsmil calls the client as normalwith a start instruction and supplies the ap-propriate initial values for the attribute(s) inquestion. Subsequently, whenever the client isready to redraw the element being animated,it calls smil_update_animation( . . .)with the given element / attribute pair iden-tifier and retrieves the new value for the at-tribute that is being animated. This is a rarecase of the client pulling the information itneeds, rather than waiting to be notified of anupdate. Due to the small number of users oflibsmil at the time of writing, it is unclear if

this method of running an animation is the best.As more experienced is gathered, this interfacemay change.

The other exception to the normal run algo-rithm outlined above is precipitated by theTransition Effects module. This module pro-vides a number of transitions that can be usedwhen moving between media objects withinthe same region. They include various fadesand screen wipes. A SMIL implementation (aplayer) supporting the Transition Effects mod-ule is required to support at least four transi-tions. However, the specification outlines over100 transitions, with a fallback algorithm forwhen a transition is specified that the playercannot handle.

In the present implementation, a client is re-quired to know the required behavior for eachof the transitions it supports. For example, thelibrary might call the client with an instructionto begin the horizontal windshieldWipe with aduration of three seconds. How the client im-plements this wipe is left up to the client. Itis possible (but not compulsory) for the clientto tell libsmil which wipes it can handle sothat the library can implement the fallback al-gorithm on behalf of the client. This way ofusing the library is consistent with keeping asmany SMIL-related decisions in the library’sdomain as possible.

A client can take this to extremes and tell thelibrary that it can do no transitions at all—thatis, it does not support the Transition Effectsmodule—in which caselibsmil will simplyoptimize away the calls to begin and end transi-tions. This would be appropriate, for example,in a client that is designed to present audio andbraille information for blind users; implement-ing wipes makes no sense in that situation.

This implementation is not perfect—it is onecase where the client is required to have knowl-edge of the SMIL specification in some de-

Linux Symposium 429

tail. The alternative, however, is for the libraryto pass back a bitmap of how a region shouldlook as the wipe progresses. The problem withthis is that it would involve putting knowledgeabout presentational techniques into a librarythat is otherwise devoid of such information.It would also remove some possibilities for theclient to optimize or improve the wipe accord-ing to circumstances. For example, the samewipe performed on a region that is 300 by 200pixels may look different on a PDA than on ascreen capable of much higher resolutions andwith more CPU power available.

In practice (limited as it may be), this imple-mentation appears to work. The minimal fourtransitions required by the module are trivialto implement in a client. Some multimedia li-braries, such as thegstreamer library fromthe GNOME project, also provide ways ofdoing standard wipes (the SMIL specificationuses a number of wipes already specified instandards for the television and motion pictureindustries). Therefore, requiring clients to im-plement their own versions of each wipe maynot be particularly onerous. Once again, con-tinued use oflibsmil should provide betterindications on this front.

4 The Timing And Synchroniza-tion Module

As mentioned in section 3.1, the heart oflibsmil is the timing and synchronizationcode. This is the longest and possibly themost complex portion of the specification. Thelibsmil implementation has been rewrittenthree times so far and although it is approach-ing completeness, testing continues in an effortto try and gain some confidence that the be-havior is correct in all circumstances. A fourth(or further) rewrite is not out of the question asthis code is required to be both easy (or evenjust possible) to maintain and fairly fast, since

it is where most of the library’s work is donewhile a presentation is running.

Without going into too many details, SMILcontains three container elements for holdingcontent that is temporally related. Theseq(“sequential”) element contains items that arepresented one after the other in the order theyare given in the document. Thepar (“paral-lel”) element contains items that are to be pre-sented in parallel, subject to time constraints ontheir begin and end times and lengths. Finally,theexcl (“exclusive”) element wraps contentof which only one item can be playing at anygiven time, although the order is not important.One might useexcl to present a number ofvideo clips from amongst which the user canselect arbitrarily with each new clip replacingthe one that is currently playing.

Within each of these containers, each elementcan have a begin time (absolutely specified orrelative to the start of the container), an endtime, a duration and a repeat count5. Furtherto this, the containing element (theseq , par ,or excl element) can also have begin, end,duration, and repeat counts attached to it. Af-ter all, this container may reside inside anothertemporal container and so on. In fact, this lastpossibility is universally true. All elements areassigned a behavior that is equivalent to oneof the containers. By default, all elements, in-cluding and, in particular,body , act asseqelements. So everything in the document isplayed in order from beginning to end withwell-defined semantics as it concerns schedul-ing.

By and large, scheduling the beginning andends of elements inside aseq or excl ele-ment is fairly straightforward. At least, thesecases certainly fall out easily after the logic for

5There are some restrictions on these values depend-ing on the type of containing container, but we will treatthem as all roughly the same here; no real confusionshould result from doing so.

Linux Symposium 430

<par><video id="vid"

begin="-5s"dur="10s"src="movie.mpg" />

<audio begin="vid.begin+2s"dur="8s"src="sound.au" />

</par>

Figure 2: A samplepar container holdingvideo and audio elements.

handling the contents of apar is implemented.

The difficulty for apar lies in the fact that ele-ments may have multiple begin times (and mul-tiple end time, but we shall omit a detailed dis-cussion of those here). These times may alsobe relative to the begin or end times of otherelements, even those within the samepar con-tainer. Further, an element’s starting time maybe before the starting time of its containing ele-ment. The contained element will not begin be-fore its parent, so the net effect is that when itdoes begin, it will appear to have already beenplaying for some time.

An example may make this clearer (see Fig-ure 2). In this example, when the containingpar element begins, the video will begin play-ing five seconds into its length. Thus it willappear as though it started from the beginningfive secondsbefore the par element started.The video will then play for five seconds, sinceits total duration is ten seconds and it effec-tively started at minus five seconds. The audioelement in this container begins two secondsafter the video begins. This is effectively at mi-nus three seconds, from the point of view of thepar container. Thus the audio will really startplaying three seconds in from its beginning andwill run for a further five seconds, ending atthe same time as the video element. This is avery typical example of how audio and video

<par><img id="foo"

begin="0; bar.begin+2s"dur="3s" .../>

<img id="bar"begin="foo.begin+2s"dur="3s" .../>

</par>

Figure 3: A ping-pong effect.

sequences can be synchronized despite havingpossibly come from different sources.

In this example, we saw a case of an elementthat has a definite starting time (thevideo el-ement) coupled with one that has an indefinitestarting time (theaudio element whose starttime depends upon thevideo element). Itis those elements with indefinite starting (andending) times that can make life difficult forthe implementation. In some cases, such as theabove example, the effective start time of theelement can be easily determined, even as earlyas the pre-presentation pass, since its only de-pendency is on something with a definite start-ing time. However, the start time could bedependent upon a something such as a buttonclick event being sent, which is unresolvableuntil presentation time.

One significant test of any synchronization im-plementation is something like the code frag-ment in Figure 3.

The effect here is that the image labeledfoo isdisplayed at time zero and lasts for three sec-onds, imagebar is displayed initially two sec-onds into the element and lasts for three sec-onds. This triggersfoo to be displayed againat four seconds into the element’s period (thestart time ofbar plus two more seconds) andso on, back and forth between the two image.The duration of three seconds here is relativelymeaningless: it could be anything longer than

Linux Symposium 431

two seconds and the same effect would occur.

Implementing the code to process this frag-ment requires a little planning. It appearsthat all of the indefinite time periods (thebar.begin+2s parts) can be resolved, sinceeverything can be traced back to a dependencyon the instance offoo that starts at time zero.However, there is no definite end to this el-ement (although there may be one hidden inthe omitted portion of the above example). Itwould obviously be a mistake to try and re-solve all of the image start times out to infinity.Instead,libsmil detects that there is a loopin the dependency chain and stops resolving atthat point. It then becomes a matter of resolv-ing portions of the infinite dependency chainas required when the container element is be-ing presented.

The examples that we have considered in thissection are indicative of the problems to besolved by the timing and synchronization code.A glance at [5] shows that there are many morecases to consider, but the logic is fairly wellexplained in the specification. The problem isthat there is just a lot of it and the interactionsbetween cases is complex.

In theory, getting the timing information cor-rect is just a problem in directed graph theory.In practice, it is a maze of twisty passages, allalike, and somewhat difficult to navigate cor-rectly.

5 Using libsmil To ExtendOther Languages

In section 1.2 it was explained that one of thechanges between versions 1.0 and 2.0 of theSMIL specification was that modularity wasintroduced. This was done along the samelines as the XHTML modular design and forthe same reasons—it enables the language to

be extended or for portions to be lifted anddropped into another language profile in or-der to extend the latter. It is natural, therefore,to try and designlibsmil in such a fashionthat it can assist with presenting these exten-sion languages.

On the whole, this has not been too difficult.The languages that one would choose to ex-tend with SMIL are things like XHTML, SVGand MathML—languages which normally arestatic presentations once rendered. SMIL addsthe ability to change attribute values over time,particularly via the Timing and Synchroniza-tion modules and the Animation modules.

Using libsmil to render a document in, say,SVG+SMIL is very similar to rendering a pureSMIL document. The library does a pre-presentation pass over the document to buildup information about the nodes it will influ-ence and to create a time graph, just as in thestandalone case. Once this pass is finished, theclient renders the document using the initialvalues for all attributes. It then callssmil_run_presentation() and waits for theregistered callback to be triggered with theusual instructions about starting or stoppingsome action. In this case, these actions willtypically be things like changing the value ofan attribute, rather than playing a media item.

The main difference from the standalone casethat will arise is when the document beingdisplayed is changed by some event outsidethe control of libsmil . In the standalonecase, all document navigation is controlled bylibsmil ; in the extension case, the SMILlibrary does not have the knowledge of hownavigation works in the external (hosting) lan-guage, so that is up to the client to man-age. Therefore, the client may from time totime call smil_new_document() to loada completely different document orsmil_jump_to_xpath() to move to a location

Linux Symposium 432

within the current document.

For client applications that already have de-cent access to their documents’ parsed objectmodel, adding support for SMIL’s temporal ac-tivity appears not to be too difficult.

6 Applications Of SMIL

In case the reader is still wondering about thepractical benefits of SMIL, which have prob-ably not been made clear in the previous sec-tions, here are a small number of typical usecases.

Recorded presentationsIt is possible to co-ordinate the automatic presentation of aconference speaker’s slides with the au-dio recording of the their talk. The slideswill progress at the right moment. Extranavigation possibilities for both the audioand visual portions of the talk can be pre-sented as well.

Digital Talking Books SMIL is already partof the DAISY 2 [2] and ANSI/NISOZ39.86 [1] talking book standards—thelatter standard being also known asDAISY 3. DAISY 2 requires SMIL 1.0support, while Z39.86 requires a mini-mal SMIL 2.0 implementation. These twostandards provide visually impaired peo-ple and people with reading difficulties ameans to access literature that would oth-erwise be closed to them.

Captioning for video formats Many digitalmovie and video formats do not containsubtitles as sideband information. Some-times, subtitles are provided, but not forthe required language. The ability tosynchronize a video presentation witharbitrary textual captions will providea benefit both to people with hearing

difficulties and to those watching apresentation given in a foreign language.

Educational presentations As authoringtools become available, pulling to-gether disparate sources into a coherentpresentation should become relativelystraightforward. This will permit edu-cators to begin to build up a library ofcoordinated presentations using infor-mation that currently might be scatteredall over the Web. It was not mentionedearlier in this paper, but the media objectsdisplayed by SMIL can be retrievedfrom remote URLs as well as local files.Also, SMIL provides a mechanism forpre-fetching any content that may taketime to download so as not to hold uplater portions of the presentation.

Kiosk and conference display front-endsSMIL provides a simple way to createa menu-based presentation. It can alsorevert to a standard looping presentationonce the requested video or audio hascompleted. This makes it ideal for writingcontrol documents for video kiosks orproduct displays at conventions.

7 Future Work

Development onlibsmil is focused on cre-ating a complete implementation of the speci-fication. Simultaneously, some demonstrationapplications and a small presentation programare being written to show off the library’s fea-tures.

Following on from that work, a number of ob-vious “next steps” present themselves. The fol-lowing list is in no particular order, but all areachievable tasklets.

1. Implement the Timed Text specificationthat is currently being developed by the

Linux Symposium 433

W3C [7]. This will allow for scrollingcaptions and easier synchronization ofcaptions with audio and video.

2. Implement a digital talking book player.Currently, no Open Source implementa-tion of the DTB standards is available.With proprietary software for presentingthese books already available, it is impor-tant to have a source code available imple-mentation around to prevent inadvertentcommercialization of the standard.

3. Write plugins for various browsers. Ini-tially plugins that act like an embeddedPDF reader and display only SMIL doc-uments would the goal. Then integrationwith the main rendering engine for dis-playing XHTML+SMIL and SVG+SMILdocuments (which is a much harder job).

4. Implement any missing pieces of theSMIL Animation Recommendation andthe SMIL DOM interface. These two doc-uments from the W3C provide extensionsto the initial SMIL 2.0 specification. Ex-tendinglibsmil to cover these featuresshould not be too much of a stretch.

8 Playing With libsmil

The libsmil implementation discussed inthis paper can be downloaded from theGNOME CVS repository (see [3] for instruc-tions if you are unfamiliar with accessing thatrepository). The code is in thesmil mod-ule, which contains the library as well as a fewsmall applications and extensive documenta-tion for library hackers and client developersalike.

Once the library has stabilized a little more, tar-ball releases will be made and the downloadsite posted in a few popular locations.

References

[1] ANSI/NISO Z39.86-2002 DigitalTalking Book standardhttp://www.niso.org/standards/resources/Z39-86-2002.html

[2] DAISY 2 Digital Talking Book standardhttp://www.diasy.org

[3] The GNOME CVS repositoryhttp://developer.gnome.org/tools/cvs.html

[4] SMIL 1.0 specification.http://www.w3.org/TR/REC-smil/

[5] SMIL 2.0 specification.http://www.w3.org/TR/smil20

[6] The Synchronized Multimedia group atthe W3C.http://www.w3.org/AudioVideo/

[7] The Timed Text group at the W3C.http://www.w3.org/AudioVideo/TT/

[8] XHTML+SMIL—a W3C Note.http://www.w3.org/AudioVideo/TT/

Benchmarks that Model Enterprise WorkloadsUsing macrobenchmarks to measure and improve Linux scalability for real-world

applications

Vaijayanthimala Anand, Hubertus Franke, Hanna Linder, Shailabh Nagar,

Partha Narayanan, Rajan Ravindran, Theodore Ts’o

IBM Corp.{manand,frankeh,hannal,nagar,partha,rajancr,tytso}@us.ibm.com

Abstract

In this paper we demonstrate the use of mac-robenchmarks in Linux® kernel development.We describe two macrobenchmarks, SPEC-jAppServer2002™ benchmark application andIBM®’s Trade, which are based on the Java™platform and modeling enterprise applicationstypically found in large data centers. This pa-per shows how these macrobenchmarks wereused to analyse potential improvements in theload balancing and yield behaviour of the 2.5kernel’s O(1) CPU scheduler. We also demon-strate how the macrobenchmarks helped debugthe 2.5 kernels and compare their performanceimprovements over the 2.4 series.

1 Introduction

1.1 Microbenchmarks vs. Macrobenchmarks

Performance is a key driver for Linux kerneldevelopment. Several patches have been devel-oped explicitly to improve Linux performanceon various architectures. Most patches whichseek to add a new kernel feature are expected toshow that they minimize, if not eliminate, anynegative performance impact on the system.

Over the years, various benchmarks have

become popular in the kernel developmentcommunity to assess the performance ofpatches. Most of these microbenchmarksmeasure specific aspects of system per-formance, such as tiobench for filesystemperformance[Tiobench] and pipetest[Pipetest]for event delivery. Microbenchmarks have twoadvantages. First, they are typically both easyto set up and run and are free, making them ac-cessible to all developers. This is particularlyimportant for the widely dispersed open sourcekernel community. Second, microbenchmarkscan be pivotal in determining the impact of apatch on a specific kernel subsystem.

The specificity of a microbenchmark limits itssuitability for predicting overall system im-pact. Hence, developers often use microbench-mark suites such as lmbench[lmbench] andContest[Contest]. By running a collection ofmicrobenchmarks, each stressing a differentaspect of the kernel, a more accurate pictureof the overall system impact can be obtained.

Microbenchmarks (singly or in suites) sufferfrom two major disadvantages. First, theydo not adequately capture the dynamic inter-actions between different kernel control pathswhich may be impacted by the same patch.Even if these control paths are tested indi-vidually, their interactions will not be appar-

Linux Symposium 435

ent. Even if the microbenchmarks could bemade to run together, the interactions beingtested would be ad-hoc. Second, microbench-mark suites are less representative of real worldworkloads. As such, while they can be used togain a better understanding of the impact of apatch on a single subsystem, they are not ideal.

Macrobenchmarkssuch as Trade[Trade] andSPECjAppServer2002[SJAS] help fill thisvoid. They exercise different parts of the ker-nel during runtime in ways that are repre-sentative of real world workloads that run onLinux. Such benchmarks are designed to com-pare hardware and software differences basedon performance and cost-performance criteria.However, they can also be used to guide soft-ware development because they permit an or-derly isolation and elimination of system-wideperformance bottlenecks. Macrobenchmarksalso allow non-kernel bottlenecks to be iden-tified, further encouraging an evolutionary ap-proach to kernel development.

Macrobenchmarks have their disadvantages aswell. They are often expensive to purchaseand are not open source. They are not easyto set up and often require multiple machineswith above average physical resource require-ments especially memory and disks. Theymay also need proprietary middleware, such asdatabases and Web Application Servers (here-after referred to as AS), if freely availableopen-source alternatives are not performantenough or do not have the right feature set yetto allow the benchmark to be run.

One notable effort to provide free macrobench-marks is being done by the Open Source De-velopment Lab(OSDL). The OSDL’s DatabaseTest (DBT) benchmark suite[DBT] develop-ment effort is a welcome step in reducing theneed to purchase specialized middleware in or-der to run macrobenchmarks. The ScalableTest Platform, also from OSDL, helps make

enterprise class hardware available to all devel-opers, further easing the hurdles in using mac-robenchmarks.

1.2 J2EE-based Macrobenchmarks

The Java 2 Platform, Enterprise Edition, J2EE[J2EE] framework is a mechanism for creatingdistributed and Java-based enterprise class ap-plications for various business domains such asmanufacturing, supply-chain management, andon-line financial applications. Compared to thetraditional transaction processing benchmarkssuch as TPC-H, TPC-C and TPC-W, the J2EEframework has not received much attention inthe Linux benchmarking efforts. For this paper,two J2EE based macrobenchmarks, Trade andSPECjAppServer2002, are used to investigateLinux kernel performance.

J2EE applications consist of multiple layers.Performance analysis of such applications areinvolved and demanding as they depend onmany factors. A typical J2EE stack is illus-trated in Figure 1.

Application Server

Business App

Java Virtual Machine

Operating System

Hardware/ Network

Figure 1: J2EE stack

The component which implements the actualapplication depends on the AS services. TheAS in turn takes advantage of the underlyingJava Virtual Machine (JVM) implementation.The Java applications call methods from theJava API libraries that provide access to thesystem resources through appropriate systemcalls.

The AS performance depends on many fac-

Linux Symposium 436

tors: caching support, transaction execution ef-ficiency, JVM implementation, Enterprise JavaBeans component pooling mechanisms, effi-ciency of persistent storage mechanisms, JavaDatabase Connectivity, optimized driver sup-port, etc. More information on J2EE bestpractices can be found in the literature suchas Oracle9i and Java Performance [Oracle9i,EnterpriseJava, JavaPerf]. These studies focuson J2EE and its associated components ratherthan the operating system. By contrast, the fo-cus in this paper is on the operating system;specifically, the Linux kernel. Two complexenterprise workloads are used to identify ker-nel performance issues and suggest possiblekernel improvements.

1.3 Description of Trade

Trade [Trade] is a freely available benchmarkdeveloped by IBM. It is designed to measureAS performance. Trade is an end to end bench-mark that models an online financial applica-tion. Specifically, an electronic stock broker-age providing web-based online securities trad-ing.

Two versions of the Trade benchmark, Trade2.7 and Trade 3.0, are used in this study. Trade2.7 is a collection of Java classes, Java servlets,Java Server Pages (JSP), and Enterprise JavaBeans integrated into a single application.

While Trade 2.7 is written based on J2EE 1.2,Trade 3.0 is the third generation of this bench-mark making use of many features of J2EE1.3 [J2EE] including local-interfaces, messagedriven beans, Container-Managed Relationship(CMR), etc. It also incorporates Web Servicesas one of its major enhancements. Many Appli-cation Servers in the industry implement thesefeatures.

This benchmark is used for performance re-search on a wide range of software components

including the Linux operating system, AS, Javaand more.

The Trade benchmark can be run in either atwo tier or in a three tier configuration. In thetwo tier model, the client driver (which sim-ulates clients of the online brokerage applica-tion) runs on one system, while the AS and thebackend database runs on another. The AS ex-ecutes J2EE applications which consist of twoparts: the server side presentation logic and theserver side business logic. In a three tier model,the AS and the backend databases run on sep-arate systems, interconnected by a high speednetwork. We were more interested in the per-formance and scaling of the AS, so we chose todo our testing using a three tier configuration.Figure 2 shows such a configuration.

ClientDriver

DatabaseServer

Application Server

Workload (Trade, SPECjAppServer)(Implemented in Java)

Host 1

HTTP

CLI / SQLnet

Host 3Host 2

DB ClientApplication

Figure 2: Three tier configuration for Trade 3.0and SPECjAppServer2002

The client driver simulates requests of an on-line stock brokerage application, which makesa predefined mix of login, register, buy, sell,and quote requests of online securities. Theserequests come in as HTTP requests to the AS.Trade 3.0 has been configured to use the En-terprise Java Beans (EJB) mode meaning thatall accesses to the back-end pass through theEJB container of the AS as opposed to the useof direct Java database connectivity (JDBC).All the orders are executed in a synchronousmode by the session and entity beans ratherthan being queued for asynchronus execution.The communication between the servlets andEJBs are done using the Remote Method Invo-cation (RMI) protocol. The backend database

Linux Symposium 437

stores 5000 users and 1000 securities applica-tions. Database records are inserted, then mod-ified as the benchmark progresses. To main-tain reproducibility of the benchmark results, adatabase is initialized once and backed up. Thedatabase is restored before each test run.

The Trade application generates a large num-ber of threads, of the order of 160, duringits operation. The metric of interest in thisbenchmark is the number of web pages that areserved by the AS.

1.4 Description of the SPECjAppServer2002Benchmark

SPECjAppServer2002 [SJAS] (hereafter re-ferred to as SJAS) is a benchmark for mea-suring the performance and scalability of J2EE(Java 2 Enterprise Edition) application serversand containers, by emulating the heavyweightmanufacturing, supply chain management, andorder/inventory system representative of a For-tune 500 company. It is a derivative of theECperf 1.1 benchmark [ECperf]. SJAS sup-ports multiple configurations such as single,dual, multiple, and distributed nodes. Wechose dual mode (3-tier configuration) for oursetup: (i) a client driver emulator, (ii) an AStier and (iii) a database backend tier. This pa-per always refers to the 2002 version of SPEC-jAppServer.

SJAS models four logical business entities (do-mains): customer, manufacturing, supplier andservice provider, and corporate. In the cus-tomer domain, large and small orders are dis-tinguished in that they trigger different trans-actions (e.g., credit checks, order change). Themanufacturing domain processes the differentorders and schedules parts with suppliers. Thesupplier domain decides which supplier to useand handles the transaction (e.g., order size,due date) with the supplier. The corporate do-main handles the list of all customers, parts,

suppliers, and credit information. SJAS canbe implemented either in a centralized or dis-tributed mode. In this paper we chose the cen-tralized mode, which allows us to put all fourbusiness entities on a single AS.

The throughput of SJAS is driven by the num-ber of order entries in the customer domain andthe manufacturing domain and is measured inTOPS, which is the average number of suc-cessful total operations per second completedduring the measurement interval. TOPS is lin-early related to the injection rate (IR). The IRrefers to the rate at which business transactionrequests from the order entry application in thecustomer domain are injected into the AS. Thegoal is to drive the injection rate as high as pos-sible. An injection rate is sustainable if at least90% of each type of business transactions com-pletes within a required response time.

Though a full SJAS benchmark run requiresmore with respect to reporting [SJAS], we areusing the sustainable injection rate as a meansto evaluate scalability and changes to the ker-nel.

Note: SPECjAppServer2002 is a trademarkof the Standard Performance Evaluation Corp.(SPEC). The SPECjAppServer2002 results orfindings in this publication have not been re-viewed or approved by SPEC, therefore nocomparison nor performance inference can bemade against any published SPEC result. Theofficial web site for SPECjAppServer2002 islocated athttp://www.spec.org/osg/jAppServer2002 .

1.5 Hardware Configuration

The Trade and SJAS macrobenchmarks arecomplex and require a fair amount of tuningfor getting useful results. Combined with themultiplicity of issues being investigated, it wasdifficult to ensure that all results presented in

Linux Symposium 438

this paper came from the same hardware setup.Four different environments were used to col-lect the experimental data shown in later sec-tions. These environments will be denotedhereafter as Configurations A, B, C, and D.

Configuration A consists of a 4-way 700 MhzPentium(tm) III, 1MB L2 Cache, and 4GBRAM for the AS and a 4 way 700 Mhz Pen-tium III, 1MB L2 Cache and 4GB RAM forthe backend. A 2-way Pentium III 1GHz sys-tem was used to drive the workload.

Configuration B consists of a 4-way PentiumIII 900 Mhz, 2 MB L2 Cache, 2.5GB RAM forthe AS and a 4-way Pentium III 500 Mhz, 512KB L2 Cache, 3.2 GB RAM for the backend.The client was a 2-way Pentium III 850 Mhz,256KB L2 Cache, 2GB RAM system.

Configuration C differed from Configuration Bonly in doubling the number of processors inthe AS tier. Thus, it has an 8-way Pentium III900 Mhz, 2 MB L2 Cache, 2.5GB RAM for theAS, and a 4-way Pentium III 500 Mhz, 512 KBL2 Cache, 3.2 GB RAM for the backend. Theclient remained a 2-way Pentium III 850 Mhz,256KB L2 Cache, 2GB RAM system.

Configuration D includes a 8-way Pentium III900 Mhz, 2MB L2 Cache, 24GB RAM for theAS, and a 8-way Pentium III 700 Mhz, 1MBL2 cache, 8 GB RAM for the backend.

2 Kernel Bug Detection UsingMacrobenchmarks

One benefit of complex macrobenchmarks istheir ability to find bugs in the kernel that oth-erwise might not be found until the kernel isrun on a large real-world system. During ini-tial experiments with the SJAS benchmark, onesuch bug was found, fixed, and included in the2.5.63 kernel.

The sole symptom was a complete system hangof the middle tier, with no oops or diagnosticof any kind produced. The hang could be re-produced by stopping and restarting the appli-cation server between five and ten times. Theproblem was traced using the NMI (Non mask-able interrupt) watchdog timer and taking stacktraces of all CPUs in the system.

The problem turned out to be threads deadlock-ing in the kernel. On any multiprocessor sys-tem, one task (say A) acquired a spinlock withinterrupts disabled. Thereafter A performed anoperation requiring all other processors to flushtheir Translation Lookaside Buffers (TLBs).To flush remote TLBs, task A would send aninter-processor interrupt (IPI) and go into abusy wait for an acknowledgement. However,if the tasks on the other CPUs were busy wait-ing on the same spinlock and also had their in-terrupts disabled, they would never receive theIPI, thus leading to a deadlock.

This issue was resolved by modifying the codeto ensure that the spinlock was not held withinterrupts disabled. The fix was included in the2.5.63 kernel. Although the problem was easyto fix once the cause was determined, it tookthe right set of dynamic interactions, providedin this case by SJAS, to trigger the bug.

3 Comparing 2.4 and 2.5 Kernels

The project was initiated by using Trade 2.7to test 2.4-based distribution kernels as well asthen-current stock 2.5 kernels. Presented in Ta-ble 1 are the results of running Trade 2.7 in athree-tiered mode using configuration A.

The results obtained were unexpected. It wasfound that the 2.4 based distribution kernel(2.4-distro) performed better than the 2.5.59stock kernel. To recheck the results, we ranthe SJAS benchmark on a stock 2.4.20 ker-nel (2.4.20-stock) and compared results with

Linux Symposium 439

Kernel #CPUs TOPS CPU Utilization (%)Version user system idle2.4.20-stock 4 Base1 68 14 172.5.59 4 Base1-3.8% 60 10 282.5.66 4 Base1+11.6% 70 12 162.4.20-stock 8 Base2 47 10 412.5.59 8 Base2+0% 36 6 562.5.66 8 Base2+24.0% 49 9 41

Table 2: Performance and middle tier CPU utilization of SJAS on 2.5.59 and 2.5.66 kernels using2.4.20 as a baseline for 4-way (Base1) and 8-way (Base2) servers

Kernel TOPS %CPU Usageon AS

2.4-distro Base1 872.5.59 Base1-14.4% 662.5.59+D7 Base1+2% 86

Table 1: Trade 2.7 results

the 2.5.59 kernel using Configurations C andD. The results, shown in Table 2, confirm thatthe 2.4.20-stock kernel exhibits better perfor-mance than 2.5.59 with the latter’s TOPS de-creasing by 3.8% on Configuration C and re-maining unchanged in Configuration D.

At a later date, we also compared the perfor-mance on a 2.5.66 kernel and found that it per-formed significantly better than 2.4.20-stockwith an increasein TOPS of 11.6% and 24.0%on Configurations C and D respectively. Ta-ble 2 shows that system time remained approx-imately the same for these two kernels thoughoverall utilization was higher for 2.5.66. Iso-lating the performance changes between 2.5.59and 2.5.66 is part of our future work. Wefelt our first task was to determine why the2.5.59 kernel performed worse than the 2.4.20and 2.4.20-distro, despite several scalabilityenhancements in 2.5.59.

Since distribution kernels have patches addedon top of a 2.4 stock kernel, the profile data wasanalyzed in order to understand the observed

processes context CPU Utilization (%)runnable switches user sys idle

8 15689 74 18 812 18844 76 18 610 15778 71 21 811 16114 74 20 611 17629 74 17 8

Table 3: Output from vmstat for AS on a 2.4-xdistro kernel using a 4-way server and Trade2.7. Number of runnable processes are 2-3times the number of processes.

differences. Comparing the vmstat outputs fora 2.4-x distro kernel (Table 3) to a 2.5.59 ker-nel (Table 4) we clearly see that the latter hasfewer runnable processes in general and oftenhas fewer runnable tasks than processors. Con-sequently, 2.5.59 shows higher idle times. Thedata initiated further investigation of the CPUscheduler behaviour.

In the next step, readprofile data was collectedat a 60 second granularity during the steadyrun of Trade 2.7 on the same configuration asabove. Comparing the data for the 2.4-distrokernel (Table 5) and 2.5.59 (Table 6), we seethatschedule() is the costliest kernel func-tion in both kernels.

The calls toschedule() drew our attentionbecause they were still high on both lists eventhough 2.4-x uses the old scheduler and 2.5.59

Linux Symposium 440

runnable context CPU Utilization (%)tasks switches user sys idle

3 12195 41 10 495 12079 41 9 507 17508 49 10 422 12087 41 9 503 11898 44 9 47

Table 4: Output from vmstat for AS on a 2.5.59kernel running Trade 2.7 on a 4-way server.Number of runnable processes often dip be-low the number of processors and are low com-pared to the 2.4-x data shown earlier.

Ticks Kernel functionNormalized

Ticks23969 Total 0.027071 default_idle 110.482388 schedule 1.52822 csum_partial_copy... 3.31799 send_sig_info 4.54744 save_i387 1.29511 tcp_v4_rcv 0.31

Table 5: Readprofile data for AS on a 2.4-distro kernel running Trade 2.7 on a 4-wayserver. Normalized ticks gives the numberof ticks divided by the size of the function.schedule() figures are high though idletimes are low compared to 2.5.59.

Ticks Kernel functionNormalized

Ticks60332 Total 0.0554048 default_idle 844.50

397 schedule 0.41365 csum_partial_copy... 1.47191 tcp_sendmsg 0.04181 __kfree_skb 0.60177 load_balance 0.19

Table 6: Readprofile data for AS on 2.5.59 run-ning Trade 2.7 on a 4-way server. Normalizedticks gives the number of ticks divided by thesize of the function.schedule() is costlydespite the usage of the O(1) scheduler; alsoidle time is higher than in the 2.4-distro kernel.

uses the O(1) scheduler. To examine our hy-pothesis that the O(1) scheduler was causingthe high idle times, we tested a 2.4.20 ker-nel with and without the O(1) scheduler usingthe same configuration as above. The results,not shown in this paper, were similar to thedata shown earlier and confirmed the hypoth-esis. The 2.4.20 stock kernel produced 20%better throughput than the 2.4.20+O(1) sched-uler. Further, 2.4.20+O(1) had fewer tasks inthe run queue than the number of CPUs in thesystem and 40% idle time, similar to the resultsfound in the 2.5.59 kernel.

Using snapshots of runqueue lengths in allCPUs at each timer tick, it was found thatCPUs were going idle while there wererunnable tasks on other runqueues. The imbal-ance in runqueue lengths across various CPU’swhile using O(1) led us to a careful examina-tion of the load balancing logic of the O(1)scheduler. The analysis is discussed in the nextsection.

Linux Symposium 441

4 Load Balancing

Before discussing the results of various exper-iments, we revisit the load balancing differ-ences between the old 2.4 scheduler and O(1).The 2.4 scheduler uses a single runqueue for allCPUs which leads to high lock contention andlock hold times when the number of tasks andCPUs start increasing. O(1) replaces the sin-gle runqueue with per-CPU runqueues. Whilechoosing the next task to run on a CPU, itdoes not look at remote runqueues, maintain-ing the O(1) behaviour and preserving cacheaffinity. Consequently, it needs to explicitlybalance the load on each runqueue by call-ing a load_balance() function. Work-loads which are sensitive to load imbalance,such as Trade and SJAS, get affected by the ef-fectiveness ofload_balance() . In 2.5.59,load_balance() is called periodically oneach CPU, with the frequency of invocation de-termined by the idleness of the CPU.

To improve the load balancing behaviourof the O(1) scheduler, we tried a se-ries of patches from Ingo Molnar’s D7patch [D7-PATCH] to Andrea Arcangeli’s try_to_wake_up patches (included within his O(1)patch [AA-O1-PATCH]) to a find_busiest_queuepatch, created in-house [FBQ].

4.1 D7 patch

Ingo Molnar’s D7 patch unconditionally mi-grates a task from a remote to the current run-queue if the current CPU is about to go idle.Table 1 shows that this patch helps 2.5.59 per-form 2% better than 2.4-distro for Trade 2.7,more than making up for the 14.4% perfor-mance loss seen by stock 2.5.59. For an SJASworkload, the same patch helps 2.5.59 draw ona par with the 2.4.20 stock kernel, overcom-ing the 3.8% degradation seen by 2.5.59 ver-sus 2.4.20 (Table 7). The 10% degradation of2.4.20+O(1) compared to 2.4.20 in the same

Kernel level CPU % TOPSUsage improved

2.4.20-stock 82% baseline2.4.20+O(1) 66% -10.6%2.5.59-stock 70% -3.8%2.5.59+D7 64% no change

Table 7: SPECjAppServer2002 - v1.14, 4-wayresults on 2.5.59

Kernel level CPU %TOPSUsage improved

2.5.66-stock 82% baseline2.5.66+trytowakeup1 83% +4.3%2.5.66+trytowakeup2 89% +0%2.5.66+busiestqueue 82% -4.3%

Table 8: SPECjAppServer2002 - v1.14 4-wayresults on 2.5.66

table reconfirm the earlier hypothesis that theO(1) scheduler is at least partially responsiblefor the performance loss of 2.5.59 compared to2.4.20.

4.2 Load Balancing on Task Wakeup

The O(1) scheduler used in Andrea Arcan-geli’s 2.4 kernel tree contains two changes todo load balancing on task wakeup events inaddition to the periodic invocations ofload_balance() in the stock kernel’s O(1). Weimplemented these changes as two separatepatches for the 2.5.66 kernel.

The first patch, henceforth called try-

Kernel level CPU %TOPSUsage improved

2.5.66-stock 56% baseline2.5.66+trytowakeup1 60% +4.4%2.5.66+trytowakeup2 72% +3.0%2.5.66+busiestqueue 65% +5.2%

Table 9: SPECjAppServer2002 - v1.14 8-wayresults on 2.5.66

Linux Symposium 442

towakeup1, modifies thetry_to_wake_up() function to invoke load balancingwhenever a task is being woken up. Usingthe task wakeup event as a load balancingtrigger was also motivated by the high countfor calls to tcp_data_wait() ; the highcount causes task wakeups in the profilingdata similar to the one shown in Table 6 for2.5.59. The trytowakeup1 patch improvedSJAS throughput performance by 4.7% onConfiguration C compared to the 2.5.66 stockkernel, as shown by Table 8. Configuration Dshowed a similar 4.3% improvement as seenin Table 9.

The second patch, henceforth called try-towakeup2, tries to explicitly address the prob-lem of CPUs going idle by trying to place thetask being woken up onto an idle CPU if possi-ble. This is in contrast to trytowakeup1 whichis only concerned with pulling tasks to the run-queue of the waker. While trytowakeup2 in-creases SJAS performance by 4.5% in Config-uration D (Table 9), it has no effect in Con-figuration C( Table 8). The behaviour can beexplained by the relative lack of idle CPUs inConfiguration C (4-way AS) compared to Con-figuration D.

The final patch called busiestqueue [FBQ],aimed at improving the aggresiveness of theexisting load balance function itself rather thanchanging the frequency or location of its in-vocation. In the normal O(1) scheduler, thefind_busiest_queue() function is usedby load_balance() to find the remote run-queue with the maximum number of runnabletasks from which tasks can be pulled to thecurrent runqueue. Theload_balance()code checks whether tasks on the remote run-queue are suitable for migration but if none arefound suitable, it does not try to find anotherrunqueue and try again. The busiestqueuepatch remedies this behaviour by modifyingfind_busiest_queue and its invocation

by load_balance() to ensure that all re-mote runqueues are examined for tasks to mi-grate. The results from using the patch aremixed. Configuration C shows a performancedegradation of 5.1%(Table 8) whereas Config-uration D shows a 2.4% improvement(Table 9).Evidently, the patch is too aggressive and theextra cycles spent in trying to find another re-mote runqueue prove too costly. We are in theprocess of finetuning the patch by limiting thenumber of iterations in the search for a busyqueue.

The trytowakeup1 and busiestqueue patches in-creased performance by around 5% on the 8-way Configuration D when applied individu-ally and in combination (data not shown). Thissuggests that one or the other approach is suf-ficient in achieving better load balancing andleads to the question of which one should beused. The answer will lie in the effect of thepatches on other workloads and is part of ourfuture work.

5 Yielding to Other Tasks

The system call sys_sched_yield()causes the calling task to yield execution to an-other task that is ready to run on the currentprocessor. Multi-threaded applications oftenusesys_sched_yield() to improve inter-active response or to improve the performanceof the system by letting the scheduler use theprocessor resources more effectively. This isparticularly true if the applications use tradi-tional userspace locks (not based on futexes).

However, the benefits realized from the useof sys_sched_yield() are heavily depen-dent on the implementation of the system call.The CPU scheduler selects the next task to runand determines how long the yielding task willremain on the runqueue before getting a chanceto run again. The following implementations

Linux Symposium 443

of sys_sched_yield() are feasible:

PA The yielding process is queued right afterthe next task on the same priority queue.Effectively, it yields only to the next taskat the same priority level.

PB The yielding process is queued at the tailof its priority queue making it yield to allrunnable tasks at the same priority level.

PC The yielding process is moved to the pri-ority queue on the expired list effectivelymaking it yield to all runnable tasks in thesystem (as the expired list becomes the ac-tive list only after all runnable tasks haveexhausted their timeslices).

The yielding task rarely knows how long itneeds to yield before it can attempt to acquirea shared resource again as the availability ofthe shared resource depends on external eventsand progress made by competing tasks. For in-stance, an interactive application might see re-duced response time if policy PA were used.But a task polling for a shared resource suchas a userspace spinlock, might benefit from PBor PC which allows the task holding the re-source to run and potentially release it for useby the yielding task. As the CPU scheduler isunaware of the task’s rationale for usingsys_sched_yield() , it cannot decide the bestyielding interval either. Hence different Linuxdistributions have tried all three policies.

To understand the impact of these policies onmacrobenchmarks such as Trade and SPEC-jAppServer2002, we collected profile data tosee the number ofsys_sched_yield()calls issued. Table 10 shows thatsys_sched_yield() accounts for almost onethird of all calls toschedule() when Trade2.7 is run on Configuration A.

Table 11 shows the data collected by instru-menting thesys_sched_yield() for a 1

Ticks Kernel Functions6826403 Total2523245 sys_sched_yield+11d2236660 cpu_idle+3e1312369 schedule_timeout+9d327747 schedule_timeout+184

Table 10: Functions callingschedule() fora 2 minute run of Trade 2.7 on Configuration A

minute run of Trade 2.7 on 2.4.20 using Con-figuration A. In the table,higher, lowerandsamerefer to the number ofsys_sched_yield() calls in which there was at leastone task on a higher, lower and same pri-ority level as the yielding task. The rowlabelled only counts the number ofsys_sched_yield() calls in which the yieldingtask was the only one on its runqueue. We seethat most tasks in the system are on the samepriority queue as the yielding task. Hence, thepolicy adopted bysys_sched_yield() islikely to have a significant impact on perfor-mance.

The 2.5.65 stock kernel uses the PC policy. Weimplemented the other two policies, PA and PBand compared their performance with PA usingTrade 3 in Configuration D. PB and PC turnedout to have the same results for the benchmarkwhich follows from Table 11. As there are veryfew tasks on lower priority levels whensys_sched_yield() is called, PB and PC are ef-fectively the same policy. Hence only PA andPC results are shown in Table 14. We see thatthe pages per second (pg/s) drops by 32.6% ifPA is used instead of the default PC policy.CPU usage (usg) and efficiency (effncy) alsosee a corresponding drop. Similar results areseen for SJAS (not shown). Using PA insteadof PC decreases TOPS by 10% on a 4-way.

The reasons become clear from the vmstat out-puts of PA and PC shown in Tables 12 and 13respectively. The number of context switches

Linux Symposium 444

Relative Priority CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7Same 145103 157055 163064 156379 162783 161733 167366 177876Only 117653 112196 112387 105653 101420 96053 108830 92293Higher 26 34 28 29 31 25 33 36Lower 929 937 1000 1073 1036 1016 1156 1132

Total 263711 270222 276479 263134 265270 258827 277385 271337

Table 11: sys_sched_yield call count and the distribution of tasks on priority queues relative tothe yielding task, using Trade 2.7 on 2.4.20 in Configuration A. The data indicates that most tasksin the system were on the same priority queue as the yielding task.

increases almost fourfold (from approximately6700 to 27000) when PA is used. As PA causesthe yielding process to get scheduled muchsooner, the shared resource it waits on is gener-ally not available, thus leading to frequent con-text switches as it yields again and again. Forsuch an application, the default policy of let-ting all other runnable tasks run once is a goodchoice.

The kernel development community hasbeen discussing alternative policies forsys_sched_yield() in order to improve re-sponse time for interactive applications. Theresults shown here indicate that such changesmight adversely affect macrobenchmarkslike Trade. However, this is only true untilapplication servers start using the new fastuser-level mutex (futex) feature provided bythe 2.5 kernels.

6 Conclusions and Future Work

In this paper, we have examined twomacrobenchmarks, Trade and SPEC-jAppServer2002. Both benchmarks modelcomplex workloads utilizing the J2EE frame-work, which are popular in many enterprisedata centers. We have shown a case studyof a kernel bug that was triggered by thesebenchmarks and which would have been hardto find otherwise.

procs system cpur in cs us sy id

14 6067 27204 63 10 279 5868 29230 60 9 31

12 5337 24765 61 8 3010 6021 27753 61 9 305 5947 27496 64 10 25

Table 12: vmstat output collected while us-ing Policy A showing high context switchesand high idle times.r, in, andcs refer to thenumber of runnable tasks, interrupts, contextswitches respectively, whileus, sy, andid referto the percentage of time spent by CPUs in usermode, system mode, and idling respectively.

procs system cpur in cs us sy id

18 7788 6903 85 14 026 7168 6686 84 11 624 8010 6798 87 12 123 8083 6727 87 13 022 7934 6212 87 13 1

Table 13: vmstat output collected for Trade 3running on Configuration D, while policy PC isused to implementsys_sched_yield() .The other labels are explained in the captionfor Table 12. Context switches and idle timeare significantly lower for PC compared to PA.

Linux Symposium 445

Kernel Policy Pg/s Usg Effncy2.5.65 PC Baseline 95% 100%2.5.65 PA -32.6% 75% 85%

Table 14: Comparison ofsys_sched_yield() implementations using Trade 3 onConfiguration D.

The macrobenchmarks were also used to re-veal deficiencies in the load balancing logic ofthe 2.5 kernel’s O(1) CPU scheduler. Variouspatches were used to increase the aggressive-ness of load balancing and reduce the probabil-ity of CPUs going idle despite the presence ofrunnable tasks in the system. Based on our ob-servations, we suggest the following four loadbalancing policies might be of help for work-loads sensitive to load imbalance such as Tradeand SJAS:

• Load balance during initial placement oftasks by choosing the idle processor

• Load balance during wakeup by choosingthe idle processor

• Load balance the queues aggressively(similar to patches described above) whena processor goes idle

• Consider providing aggressive load bal-ancing through a configuration option

More patches will be produced to implementthe above catagory of improvement and theinvestigation will continue to find a fair loadbalancer to improve these workloads for SMP(Symmetrical Multi Processor) and NUMA(Non Uniform Memory Access) systems. Anyload balancing patches proposed will need tobe tested using different workloads to makesure that they do not degrade performance byunnecessary balancing.

The final part of this paper examined dif-ferent implementations of thesys_sched_

yield() call and concluded that the exist-ing 2.5.65 implementation performed well formacrobenchmarks such as Trade and SJAS.

There is still much work to be done in explor-ing how the kernel can more efficiently sup-port J2EE-based workloads. As we have seen,these workloads tend to be very sensitive toscheduler issues, and changes which benefitone workload may actually cause harm to otherworkloads.

Further tuning of the application and improve-ments in the Linux kernel has improved theCPU utilization of these benchmarks. Hence,while initial attempts to use spinlock meter-ing to find lock contention was not fruitful, weanticipate that future work in improving thebenchmark score of these workloads will in-clude finding and fixing lock contention prob-lems.

We have used, and are continuing to use, mac-robenchmarks as a method for finding potentialareas for improvement in the Linux 2.5 kernel,especially as it relates to the Linux scheduler.We hope we have demonstrated to the readerthat more complex benchmarks are a usefultool for the kernel developer interested in im-proving the performance and scalability of theLinux kernel.

7 Acknowledgments

The authors would like to thank the followingpeople for their help: Jianwen Alex Feng, BillHartner, Wilfred Jamison, Sandra Johnson, andRick Lindsley.

8 Legal Statement


IBM is a registered trademark of International Busi-

Linux Symposium 446

ness Machines Corp. in the United State and/orother countries.

Java and all Java-based trademarks are trademarksof Sun Microsystems, Inc. in the United States,other countries, or both.


Pentium is a trademark of Intel Corporation in theUnited States, other countries, or both.

SPECjAppServer2002 is a trademark of the Stan-dard Performance Evaluation Corp. (SPEC).The SPECjAppServer2002 results or findings inthis publication have not been reviewed or ap-proved by SPEC, therefore no comparison norperformance inference can be made against anypublished SPEC result. The official web sitefor SPECjAppServer2002 is located athttp://www.spec.org/osg/jAppServer2002 .


References

[AA-O1-PATCH] Andrea Arcangeli,http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20rcaa1

[FBQ] Rick Lindsley,http://www.ibm.com/linux/ltc/patches/?patch_id=865

[Contest] Con Kolivas,http://members.optusnet.com.au/ckolivas/contest/

[D7-PATCH] Ingo Molnar,http://people.redhat.com/mingo/O(1)-scheduler/sched-2.5.59-D7

[DBT] Open Source Development Lab,http://sourceforge.net/projects/osdldbt

[J2EE] Java 2 Platform, Enterprise Editionhttp://java.sun.com/j2ee/overview.html

[ECperf] http://java.sun.com/j2ee/ecperf/

[EnterpriseJava] Kingsum Chow, Ricardo Morin,and Kumar Shiv,Enterprise JavaPerformance: Best Practices, IntelTechnology Journal, Vol 07 (Feb 2003) p.32–48

[JavaPerf] Java Performance Tuning,http://www.javaperformancetuning.com

[lmbench] Carl Staelin and Larry McVoy,http://sourceforge.net/projects/lmbench

[Oracle9i] Oracle9i Application Server,Architecting for J2EE performance,http://otn.oracle.com/products/ias/pdf/ArchitectingForJ2EEPerformance.pdf

[Pipetest] Ben LaHaise,http://www.kvack.org/~blah/ols2002.tar.gz

[PMQS] Hubertus Franke, Shailabh Nagar, MikeKravetz, Rajan Ravindran,PMQS:ScalableLinux Scheduling for High End Servers,ALS’01, Annual Linux Symposium, Oakland,11/2001

[SJAS] Standard Performance Evaluation Corp.,http://www.specbench.org/jAppServer2002/

[Tiobench] Mika Kuoppala,http://sourceforge.net/projects/tiobench

[Trade] Application Server Benchmark,http://www-3.ibm.com/software/webservers/appserv/benchmark3.html

Large Free Software Projects and BugzillaLessons from GNOME Project QA

Luis VillaXimian, Inc.

[email protected], http://tieguy.org/

Abstract

The GNOME project transitioned, during theGNOME 2.0 development and release cycle,from a fairly typical, mostly anarchic free soft-ware development model to a more disciplined,release driven development model. The ed-ucational portion of this paper will focus ontwo key components that made the GNOMEQA model successful: developer/QA interac-tion and QA process. Falling into the first cat-egory, it will discuss why GNOME develop-ers bought in to the process, how Bugzilla wasmade easier for them and for GNOME as awhole, and why they still believe in the pro-cess despite having been under Bugzilla’s lashfor more than a year. Falling into the second,some nuts and bolts: how the bugmasters andbug volunteers fit into the development pro-cess, how we coordinate, and how we triageand organize. Finally, the paper will discusshow these lessons might apply to other largeprojects such as the kernel and Xfree86.

1 Introduction

During the GNOME 2.0 development and re-lease cycle in 2002, the GNOME project grewfrom a fairly typical, fairly anarchic free soft-ware development model to a more disciplined,release driven development model. A key com-ponent of this transition was the move towardsorganized use of Bugzilla as the central repos-

itory for quality assurance and patch track-ing. This was not a process without problems-hackers resisted, QA volunteers did too lit-tle, or too much, we learned things we needto know too late, or over-engineered the pro-cess too early. Despite the problems, though,GNOME learned a great deal and as a result,GNOME’s latest releases have been more sta-ble and reliable than ever before (even if farfrom perfect yet :)

The purpose of this paper isn’t to teach some-one to do QA, or to impress upon the readerthe need for good QA—other tomes have beenwritten on each of those subjects. Instead, itwill focus on QA in a Free Software context—how it works in GNOME, what needed to bedone to make it work both for hackers andfor QA volunteers, and what lessons can belearned from that experience. To explain thesethings, I’ll focus on three main sections. Thefirst will be a very brief history of GNOME’stransition to a more Bugzilla-centric develop-ment style, in order to provide some back-ground for the rest of the paper. The secondpart will focus on the lessons learned from thistransition. If a reader needs to learn how tomanage QA and Bugzilla for a large Free Soft-ware project (either as bugmaster or as a devel-oper), this section should serve as a fairly con-cise guide to GNOME’s free software QA bestpractices—explaining how developers and QAshould work together, what processes make forgood free software QA, and how a free soft-

Linux Symposium 448

ware project can build a QA community thatworks. Finally, in the third section, the paperwill attempt to discuss how GNOME’s lessonsmight be applied to the usage of Bugzilla inother projects.

2 Background

Before going further, I’ll offer a brief bit ofbackground on the GNOME project and howit came to be a project where QA was an inte-grated part of the Free Software developmentprocess.

2.1 The GNOME Code Base

GNOME is not the kernel or X, but it is onroughly the same order of magnitude in termsof code and complexity. And it is continuingto grow as we integrate the functionality ex-pected by modern desktop users. Desktop is,of course, a very vague term, but at the cur-rent time, GNOME includes a file manager,a menu system, utilities, games, some mediatools, and other things you’d expect from themodern proprietary OSes. Of course, there isalso a development infrastructure as well, in-cluding accessibility for the disabled, interna-tionalization, advanced multimedia, documen-tation, and support for many standards.

To provide all of this, GNOME is a whole lotof code. A very basic build is more than 60modules, each with its own build and depen-dencies. Wheeler’s sloccount in a ‘basic’ build(no web browser and no multimedia infrastruc-ture) shows 8,000 .c or .h files and roughly 2.0million lines of code.[Wheeler]. When count-ing word processing, web browsing, spread-sheets, media handling, and e-mail and calen-daring (all of which are provided by fairly ro-bust, complex GTK/GNOME apps) the totalgrows to roughly 4.2 million lines of code.

2.1.1 The GNOME User Base

To a free software hacker, of course, users arenot just users—they are potential volunteers.The ‘modern desktop users’ GNOME devel-opers like and/or hate to talk about are actu-ally quite numerous, which means a large baseof people who generate bug reports (especiallystack traces) and who also can be possibly per-suaded to do QA work. For context, XimianGNOME 1.4 had a million and a half installa-tions. By late in the 2.0 cycle daily rpm buildsof CVS HEAD drew thousands of downloadsa day, and each tarball pre-release of GNOME2.0 was downloaded and built by tens of thou-sands of testers. So, even during the relativelyunstable runup to 2.0, thousands of users werepounding gnome’s pre-releases on a daily ba-sis, and many of those became willing helpersin the QA process- over 1,500 people sub-mitted bugs during the 2.0 release cycle, andseveral thousand more crashes were submittedanonymously. This type of QA is difficult foranything but the largest proprietary softwarecompanies to match, and has been invaluableto GNOME.

2.1.2 QA in GNOME 2.0

As can be imagined, this much code, withthis many users, trying to excersize a lot offunctionality, while developers seek to makethings more functional, usable and accessi-ble, generates a lot of crash data and bug re-ports. Between January 2002 and the releaseof GNOME 2.0 on June 26th, 2002, slightlyover 10,000 bugs were tagged as 2.0 bugs andan additional 13,000 non-2.0 bugs were filed.The QA team triaged or closed over 17,000of those. Over 5,000 more were eventuallymarked as fixed, including over 1,000 deemedhigh priority by the QA team. For a projectwith a fairly small active core development

Linux Symposium 449

team, these were huge numbers. It’s only be-cause of volunteer help in identifying and triag-ing bugs that dealing with this at all was possi-ble. During the 2.0 cycle, regular ‘bug days’—12 hour long meetings of new volunteers—anda mailing list helped coordinate and recruit vol-unteers.

3 So what did we learn?

None of this was a pretty or clean process; lotsof mistakes were made and not quite as manylessons learned. But we did learn a number ofgeneral rules for volunteer driven, high-volumebug handling. The gist of these lessons canbe summarized simply—QA volunteers mustwork with their community to find, identify,and track the most important bugs. But the de-tails are more complex and will, I hope, makethis paper worth reading.

3.1 QA and Hackers

While ESR may have his critics, he was un-doubtedly right in observing that we are all inthis to scratch our itches, whatever those itchesmay be. Free software QA is a slightly oddbird in this light—QA volunteers are in it toscratch the itch of higher quality software, butthey can’t do it themselves. That means payinga lot more attention to the needs of others thanmay be typical for free software. Following aresome of the GNOME team’s lessons learned.

3.1.1 Rule 1: free software QA must sup-port the needs and desires of devel-opers in order to succeed

This seems fairly obvious, but it is also fairlyeasy to forget or ignore. Free software beginsand ends with developers who are having fun.There may be others involved for reasons other

than ‘fun’, but if QA’s sole purpose is to whineabout what QA thinks are the important flaws,volunteers will leave, and leave quickly. QAmust think first and foremost not about theirown goals, but about the goals of developers.

To put it another way: developers think theycan do their thing without QA (which, in freesoftware, they mostly can) and QA absolutelycannot do its job (which is getting bugs fixed,not just finding bugs) without developers. IfQA forgets this in proprietary software, devel-opers have to suck it up. If QA forgets this infree software, developers will ignore them, orworse, walk away from the project.

This is not to say that QA must be silent ser-vants of hackers, never giving feedback or theirown input. QA volunteers can be and should betrusted individuals whose advice is valued. Butthat will happen more quickly and more easilyif the goal of supporting and aiding hackers isalways first and foremost.

3.1.2 Rule 2: QA needs guidance frommaintainers

In order for QA volunteers to serve the needs ofmaintainers and developers in general, main-tainers and developers must clearly commu-nicate their priorities. This falls out prettycleanly from rule 0: if a project doesn’t knowwhere the project is going, or what the project’sdevelopers want, then it is going to be veryhard for QA to help reach those goals. Thisalso means that when a project is conflicted,QA teams may not be of as much utility as aproject expects.

‘What a project wants’, from a QA perspec-tive, is usually pretty obvious in GNOME—stability, stability, stability. If a program canbe crashed, or a button doesn’t work, everyonetypically agrees that this should raise a red flag.

Linux Symposium 450

But past that, things often get murkier—somemaintainers may, for example, care deeplyabout code quality, while others may be deeplyinvolved with fixing usability issues. And howdoes one weigh a difficult to reproduce crasheragainst, say, a build problem on Solaris? That’snot typically an easy or fun call to make; it’snearly an impossible one to make correctly un-less a QA volunteer has guidance from the de-veloper.

In proprietary QA, these answers are usuallypretty easy—compare against a design doc andgo. In free software QA, where design docsare often lacking in details if they exist at all,developers must do the best they can to com-municate to QA what exactly the priorities areso that when the QA team finds a problem theycan classify it correctly.

3.1.3 Rule 3: QA must persuade hackersthey are useful and intelligent

When I came on board the Evolution team, theuniversal response was ‘oh, someone is goingto mark duplicates for us, that’s nice.’ When Ileft, I was very flattered to know that the teamwas worried about a lot more than duplicates.The difference between coming and going wasnot just that I was effective, but that the veryfirst thing I did was work very hard to learn thelay of the land in Evolution. Instead of readingone bug and deciding ‘this is bad’, I read nearlytwo thousand bugs before doing anything morethan rudimentary marking of duplicates in thebug database. This is an extreme example, ofcourse, but it is the direction Free Software QAvolunteers should lean if they can.

In contrast, some first-time GNOME volun-teers have dropped in, read one or two bugs,and decided ‘oh, this bug is hugely important’,and tried to mark it as such. Worse, some willtry to guess at the source of problems in code

they’ve never looked at, or (this is inevitablythe most irritating to developers) they’ll de-clare that something ‘must be easy to fix’. In-variably, this leads to irritation from developerswho have seen a lot more issues and have, un-surprisingly, a much better grasp of their owncode. The best way to avoid this is to work hardto always make the right call, especially whenfirst working on a project. There aren’t the ob-vious checks of functionality and code reviewthat typically establish trust between hackers—so very sound and conservative judgement hasto substitute at first.

3.1.4 Rule 4: Bugzilla cannot be the end-all and be-all of communication be-tween hackers and QA

Bugzilla is a wonderful tool, that allows forgreat communication and incredibly flexibleways to sort, parse, and otherwise manglebugs. But it doesn’t speak to mailing lists,and it can’t selectively poke hackers aboutimportant issues. QA volunteers must ac-tively seek out other, non-Bugzilla forms ofcommunication—mailing lists and IRC, pri-marily, but also web pages and other forums.Use these channels to draw attention to QA andto QA processes—most important new or out-standing bugs, important recent fixes, new fea-tures or reports in Bugzilla, or even simple ‘thismany bugs were opened and this many closedlast week.’ By doing this, a QA team can estab-lish an identity as a ‘regular’ part of the devel-opment process even amongst developers whoaren’t familiar or comfortable with Bugzilla.

This was a lesson learned the hard way inGNOME. During the 2.0 cycle, the bug squadassembled and emailed regular Friday statusreports to GNOME’s main development list. Itwas well recieved by hackers, but like manythings in free software, it wasn’t completelyappreciated until (during the 2.2 cycle) it was

Linux Symposium 451

gone, done away with by lack of attention onthe part of the bug squad and the mistaken be-lief that it wasn’t very useful to developers.Developers let us know, and as a result we’lltry to bring

3.2 Triage

Triage is a word that has been thrown aroundin this paper a fair amount—before going fur-ther it may be useful to define it. In medicine,battlefield triage is the process of separatingthe very badly wounded from those who arelightly wounded and those who are so woundedthat they will die regardless of treatment. Ina free software context, it’s the process ofseparating and identifying bugs that are mostsevere and/or most useful to developers outfrom the inevitable mountain of bugs that willcome in for any popular project. Specifically,in GNOME, we triage by setting ‘priority’,‘severity’, and ‘milestone’ fields in Bugzilla.Like communication, GNOME has learned afair amount about this in the past year.

3.2.1 Rule 5: bugs need to be triaged, notjust tracked

When I came into GNOME and Evolution,both projects knew that having a Bugzilla wasa good thing. So, they dutifully entered bugs intheir bug tracking systems—they tracked bugs.But neither project had useful definitions ofseverity and priority—they couldn’t or didn’ttriage their bugs. So what they had, when theyneeded to know what came next, was a largelist of bugs in basically random order. Not sur-prisingly, that wasn’t very helpful and so bugsended up getting entered in to Bugzilla andnever read again. Developers ended up main-taining lists outside of Bugzilla to help themfigure out what to do next—a silly duplicationinefficiency in projects that can’t really afford

much inefficiency.

If this kind of thing is happening, it indicatesthat bugzilla is not being used properly. Thesolution is to carefully define priorities, sever-ities, and milestones, and use them religiously,by looking at every bug and making at leastan attempt to judge how bad it is and when itshould be fixed by. When it comes down torelease time, having consistently marked bugswith these priorities means that it will be mucheasier to say ‘these things must be fixed, thesewe fix if we have time, these we pretend justdon’t exist.’ And that will leave you with muchbetter software.

3.3 Rule 6: triage must be tied to release goals

This is a whole lot like Rules 0 and 1, so I’ll bebrief. It’s worth repeating, though—triage isbasically the art of determining what is impor-tant, and if QA and hackers frequently disagreeon what is important, QA will get ignored.This greatly reduces the space for personalfreedom in QA—several volunteers have comeinto GNOME, picked up on a pet theme andmarked those bugs up, and I’ve spent a greatdeal of time apologizing for them. Bugzillacannot be allowed to become a soapbox, foranything except the goals maintainers have al-ready agreed to. If there is dissent on thosegoals, take it to the lists—Bugzilla is not agood forum for setting or arguing goals.

(In proprietary software, this is easy—‘projectgoals’ are in a tightly defined project spec thatmust be followed. Bug volunteers, especiallythose coming from a proprietary background,must remember that this just isn’t possible inFree Software.)

Linux Symposium 452

3.3.1 Rule 7: triage new bugs agressively,or Bugzilla will quickly terrify main-tainers

The initial temptation of almost all the QA vol-unteers I’ve dealt with is to assume that a bugthey’ve just read is extremely important, andshould be prioritized to reflect that. In someways, this is true—we do see a lot of very uglybugs, that in an ideal world given infinite re-sources and time would be high priority. Butwe live in a world of volunteers and spare time,so marking bugs as more important than oth-ers should be done only carefully and conser-vatively.

Most free software hackers work on theirprojects in blocks of very short periods of time.That means that if Bugzilla is their TODOlist, the smaller and the more sorted it is, themore beautiful. In practice, we’ve found thatit means that once maintainers trust their QA,they tend to only look at high priority bugs.This can be scary- it puts a lot of power in thehands of QA, and messing up by deciding thata bug is not important enough for a maintainerto look is seemingly very bad. This is utterlytrue in proprietary QA—if a QA guy screwsup and punts something that he shouldn’t, theremay not be much of a system of checks andbalances to catch the error. Free Software QAsaves us from such a fate—punt a bug or markit low priority, and if it is important, ten otherpeople will file it or add comments. The mas-sive volume of bugs we get is a constant check.

For example, in GNOME, we regularly seecrashes that a maintainer or QA volunteer (oroften the original reporter) decides is com-pletely impossible to reproduce. We knockthem down in importance or close them in thatcase. Often, they actually are impossible toreproduce- build problem, transient issue thatgot fixed the next day, or other such. But insome cases, after everyone has thrown up their

hands, you’ll continue to see reports of thecrash. The ‘mistake’ of triaging or punting theoriginal crash can then be revisited—thanks tothe volume of bugs we recieve, we’ve gottenample confirmation that maybe it wasn’t sucha bogon after all.

This isn’t perfect, of course—in Evolution,for example, we get relatively few bugs onfirst-time installation, so a single punt on aninstallation issue may obscure much deeperand more important issues that won’t be filedagain for some time. But, unfortunately, it’ssomething that frequently must be done—thealternative is often for maintainers to queryBugzilla and face massive lists that are quicklyoverwhelming. QA can and should serve as abuffer for that if necessary.

3.3.2 Rule 7: closing old bugs, even com-pletely unread, is unpleasant but OK

GNOME’s QA was publicly flamed severalmonths ago by someone (we’ll call them‘james w. z.’) for mass-closing old GNOMEbugs without substantially reviewing them.This was an unfortunate thing that we hate todo, but it was justified. In the typical free soft-ware cycle, a project starts off too unstable andwith too few users to get many bug reports. Af-ter the project builds and grows, you still haveall the old bugs from the early period, and anincreasing number of users and bug reporters,many of whom are filing bugs you can’t possi-bly have time to fix or even sometimes look atbefore your next rewrite and release.

Faced with an escalating number of bugs, avolunteer-driven project that can’t easily bringin more resources has two options: mass closeold bugs with an ‘if you still see this in our lat-est release, please reopen the bug’, or let theDB grow so large that it is unusable for hack-ers and QA volunteers alike. From these two,

Linux Symposium 453

the choice is obvious if unpleasant. Further-more, as ‘james’ reminded us, this isn’t some-thing that is easy for bug filers to understand.But when doing it, remind yourself: if it is stilla bug, someone will file it again.

3.3.3 Rule 8: triage rules can’t be just inone person’s head

As already mentioned, the first step I tookwhen moving in to GNOME was to revise andrewrite GNOME’s definitions for priorities.Previously, they’d been fairly broad andinspecific. The new priorities gave specificexamples, and tried to group problems intospecific classes as well. This was an importantfirst step for sane triage across Bugzilla.But it was not enough—nearly all judgmentcalls by volunteers ended up coming backto me for validation, since the definitionsdid not include a lot of my experience andjudgement—just examples and definitions.So, I’ve started (and the QA volunteers haverewritten and completed) a GNOME triageguide [http://developer.gnome.org/projects/bugsquad/triage/ ].This document attempts to put a lot of collec-tive wisdom down onto paper, and makes iteasier for new volunteers to come in and getstarted, and for old volunteers and developersto understand more precisely what should begoing on.

This will hold true for any project withoutstrong guidelines, I believe—either a largegroup of volunteers will inconsistently applytheir own judgments (confusing developers) orthe project will become overly dependent onone person, which will eventually again leadto inconsistency as the mass of bugs becomestoo much for that one person to handle. Again,this was a lesson learned by GNOME only af-ter 2.0- during the 2.0 cycle, much of the triagewisdom stayed in my head and when I had less

time (during the 2.2 cycle) the process grew abit creaky, because triage often blocked on myavailability to answer judgment calls.

3.4 Some Miscellaneous (But Important) Ob-servations on Free Software QA

There are a few other important lessonsGNOME has learned that aren’t rules, per se,but which everyone trying to do Free SoftwareQA should always keep in mind.

3.4.1 Observation 1: volunteers and hack-ers are expensive, and bugs are cheap

You could also think of this as ‘volunteers arescarce, bug filers are like locusts.’ This hasa number of implications for Free SoftwareQA- many of the rules I’ve previously citedare almost the direct results of this observation.Many others I haven’t cited also fall out of it. Ifyou keep this simple observation (almost morea law than a rule) in mind, you’ll find the otherswith time.

3.4.2 Observation 2: triage is an imperfectart

Despite the immediately previous suggestionsabout how to make triage consisten, it mustbe understood that triage is an imperfect art,where a certain amount of inconsistency is in-evitable.

As already mentioned, the best way to triage isto read a lot of bugs first, to gain an appreci-ation for what types of bugs a project is see-ing and how severe they are. But even afterhaving read 20,000 or so bugs in the past year,over four projects, drawing the line even be-tween seemingly simple things like ‘is this anenhancement or a bug’ is a frequent borderlinejudgement call for me.

Linux Symposium 454

Everyone involved in the QA process—bug re-porters, bug fixers, and bug triagers (both ca-sual and regular) must learn to accept this andwork with it. The important lesson here isthat volunteers should not be held to an impos-sible standard—both volunteers and develop-ers must understand that differences of opinionwill happen and aren’t the end of the world.There will be thousands more bugs to workwith if one gets screwed up. :)

3.4.3 Observation 3: QA is winning whenpeople are interested in the process,not just the results

So how can one know when QA is starting towin? At what point can a QA volunteer sit backand say ‘the hard part is done, now all I have todo is read bugs?’ I’d suggest that one importantmetric is noting the point when the standard re-sponse by developers to bug reports is ‘put it inBugzilla.’ GNOME moved very slowly in thisdirection, but that’s now pretty much the stan-dard response on mailing lists when a bug is re-ported to a list—‘take it to Bugzilla.’ There areother parts of the process as well—bug days,noting bug numbers in CVS commits or codecomments, and an overall commitment by de-velopers to working with QA volunteers.

4 And these rules apply to otherlarge projects how?

4.1 XFree

A few months back, XFree had a dis-cussion on their mailing list about use ofBugzilla to track XFree86 bugs. The responsewas. . . underwhelming. Why? The main fearswere pretty straightforward: ‘will we get lotsof useless bug email?’, ‘will people try to con-trol what we do?’, and of course ‘what benefitdo we get?’

The answers to these questions aren’t alwaysobvious to a project just embarking on doingserious Free Software QA for the first time.Being more public can definitely open main-tainers up to more mail. Obviously, this canhappen—as I discussed, it’s even possible thatless buggy software will get more bug re-ports. That’s not truly a requirement—even ifBugzilla is used only to triage and track bugsthat come in through other forums (say, openonly to developers, and used to track issues re-ported to a mailing list) it can still be of greatuse to a project, assuming that other rules Ilaid out about supporting developer goals anddefining the triage process well are followed.GNOME actually allows anonymous bug sub-mission, the opposite end of the spectrum, andwhile this is far from perfect, it has helped usmake huge leaps and bounds in terms of stabil-ity by encouraging stack-trace submission.

Concerns about ‘control’ were equallyunfounded—even borderline paranoiac. Xfree,like all other Free Software projects, is con-trolled by hackers and hackers alone. If ahacker decides that QA volunteers aren’t tobe trusted, or simply disagrees with triagedecisions, they can ignore them and move on.The burden is on those running the QA processto prove that their bugs are valid and useful.I’ve given some suggestions on how this canhappen already.

Finally, the most obvious and at the same timemost difficult question—“what benefit do weget?” I got into Linux because I’d heard about,heard it didn’t crash, and one night Outlookcrashed 10 times, while I was trying to writea single email to a professor. So for me, morestable software is an obvious benefit of work-ing in QA. That is, admittedly, not for every-one. Answering this question really requiressome introspection on the part of hackers andmaintainers—if you want to make softwarethat is good for your users (virtually no matter

Linux Symposium 455

how you define good), then your project wantsa QA process and wants Bugzilla. If you are infree software purely because you want to writecool hacks, or because in free software, no onecan tell you what to do, Bugzilla may not befor your project. But that’s an answer onlyyou can answer. Frankly, on reading the XFreelists, it was not altogether clear that many ofthe XFree hackers were particularly concernedabout their users. If that is the case (and thatis most definitely their prerogative as authorsof the code) then perhaps Bugzilla is not forthem. No matter how hard they try, it would behard for QA volunteers to support the pursuitof power or cool hacks.

4.2 Kernel

Reading kernel traffic, I was very pleased tosee that Andrew Morton had proposed not justa list of bugs, but actually defined what he feltshould be the standard for “when should wego to 2.6.0?” I’m a long time k-t reader, andthis was the first time I’d seen something of thesort. Defining and agreeing on that is part ofmy Rule 0—QA has to support development,and developers have to tell QA what they want.The list had even been split into rudimentary“can’t ship without fixing” and “it would benice” lists—a big first step towards solid triage.

It was sort of disappointing, though, to readthrough the details—the vast majority of is-sues had no bugs associated with them, andsquirelled down at the bottom was “and thereare several hundred open Bugzilla bugs.” Thiswas the kind of opportunity kernel bug peo-ple should have seized on (and possibly havesince this paper has been written, of course.)Bugzilla is perfectly designed to track thesekinds of issues and their progress. Some in-trepid volunteer could easily have volunteeredto enter every issue into Bugzilla and assignit a high priority and assign it to the ownerof the issue. From there, patches could have

been attached and tracked, punting it from onelist to another would have been as simple aschanging a single field, outsiders could easilydiscover what bugs had and hadn’t been fixedalready, and a host of other things. Insteadof ongoing IRC status meetings where manythings were reported fixed or irrelevant, a sim-ple query could have reported a list before themeeting that could be updated by all partici-pants in parallel if need be. (And of course,no more diffs to show what had changed—again, simple, dynamic query to show what haschanged over any period of time.)

Similarly, the “several hundred open Bugzillabugs” was a great invitation for someone towork with Bugzilla, trawl them (it only takesa weekend, at worst, to read a couple hun-dred bugs, once you’ve got the knack) and startmaking preliminary suggestions to maintainersaobut important bugs that were in Bugzilla butnot on the list. Remember rule two—persuadethe hackers you are useful. Filling in the blankson information they knew must be there butdidn’t have time to find themselves is a greatstep towards that, and reading the (currentlysmall) open bugs to get perspective would havebeen a great start for those looking to help outand do effective triage.

Pre-release is the best time for QA andBugzilla—priorities are typically clear cut,hackers are most pressed for time and so mostappreciative of the help, and hackers are themost motivated to work on bug-fixing insteadof pie-in-the-sky features. Hopefully, someoneinvolved in the kernel community will find thisgeneral advice useful and can take advantageof this relatively rare time in the kernel cycle.

5 Conclusion

Free Software QA is a slightly different beast,playing with different sets of data inputs and

Linux Symposium 456

different sets of motivations than a typical QAprocess. As a result, making QA central to therelease process is not easy for any Free Soft-ware project, and it’s even harder to stay withit once it is successful, since success breedsdifficulty. But it can be done if communica-tion, motivation and technique are all broughtin line with each other. GNOME did, and ben-efited immensely from it. It is my hope thatother large projects will be able to learn fromour lead.

6 Acknowledgments

Thanks to Ximian and Sun, for allowing me towork so extensively with the GNOME commu-nity.

Thanks to the Bugsquad and all the volun-teers who preceded it, first for doing so muchwork for their own communities, and secondfor keeping me sane while I worked on Evolu-tion and GNOME. And thirdly for suggestingthe title of my talk.

Thanks to all those who proceeded me, atMozilla and GNOME, for giving me some-thing to work with—tools, skills, and data.

Thanks to ed on gimpnet, for helping me fightthrough the structure of this paper.

7 Availability

This paper and slides from the associated pre-sentation will be available from

http://tieguy.org/talks/

References

[Wheeler] From David Wheeler’sSLOCCount—http://www.dwheeler.com/sloccount/

Performance Testing the Linux KernelThe re-aim workload

Cliff WhiteOpen Source Development Labs

[email protected], http://www.osdl.org/archive/cliffw

Abstract

Good performance testing requires good testsand good procedures. This paper discusses ex-periences creating and using an automated testenvironment.

The paper also describes work done at OpenSource Development Labs (OSDL™) in re-writing and modernizing the AIM7 and AIM9benchmarks. The intent is to make the bench-marks relevant for modern hardware by mak-ing it flexible and extensible.

This paper talks about how to create a testingenvironment, how to automate it, and how toselect and evaluate potential tests. The papertalks about the differences between low-level(micro) workloads and application-modeling(macro) workloads, using OSDL Scalable TestPlatform tests as examples, and talk about thedifference between tests that focus on specificareas and tests that exercise broad areas.

1 Introduction

Performance testing in kernel context is neces-sary to show that a projected improvement isin fact an improvement. Performance tests areused to measure large-scale application perfor-mance and small-scale system routine and sys-tem call performance.

There are two areas not specifically addressed

by performance testing. One area is compli-ance which the Linux Standard Base and LinuxTest Project test suites both address. The otherarea is reliability—demonstrating the ability tosustain proper operation over long time spans.

A goal of the OSDL’s Scalable Test Platform isto measure performance, over and over again.To do this, we run publicly available work-loads, and we create a few of our own. Thispaper describes work being done to revise anold workload suite, the AIM tests.

2 Creating a Proper Test Environ-ment

A good test must be repeatable. It is very im-portant that multiple runs of a test on the samehardware with the same kernel produce thesame results. OSDL’s STP creates this repeat-able environment by re-loading the test ma-chines with a new OS before every run. Thus,every test starts from an identical state. Fornon-STP testing, the system is set up for re-peated runs by using a Makefile. Whenevera new test is created, the first thing created isa setup/tear-down Makefile. In the Makefile,careful track is kept of everything added to thesystem for the test.

When running the test, care should be takento understand and control the test environment.There are a few areas to consider:

Linux Symposium 458

• Networking – This should be obvious, butany test of networking should be run on aprivate network, where no other traffic im-pacts the test measurement. This is espe-cially important when a test is controlledor monitored via a network.

• Other shared resources – Might includeshared storage arrays, or other devices.Again, it is best to use dedicated hardwareor stop other users before testing.

• State of the system prior to test startup– This is especially important for repeat-ing test results. Rebooting the systemprior to every test run is one way toassure a known state. However, manytests are very influenced by cache effectsand this must be considered. When test-ing database workloads, it is common towarm the database cache prior to takingany performance measurement.

• Repeat testing for repeatability – A singletest result might be useful. A repeatabletest result is much more usable. Statisti-cally, three runs are about the minimumfor good data, five or more runs are better.

2.1 Experiences from the STP

Here is some advice, culled from experiencesadding tests to the OSDL’s Scalable Test Plat-form.

• When scripting for automation, error re-covery is everything. Error reportingis more important. Error discovery ismost important. You will find that mak-ing things happen in a script is easy—knowing when things have not happenedand doing the right thing thereafter ishard.

• Be very paranoid. Review and sanitycheck test results frequently. Hardwarefailures can be very sneaky; repeatingknown tests can be a good way to spotflaky hardware.

• When running a large or even a mediumnumber of systems, administration toolsare very important, especially tools thatallow you to look at health over time.

• When testing kernels, sometimes the mostinteresting tests are the ones that do notrun at all.

• Likewise, be aware of timeoutconditions—the tests that never com-plete can also be interesting. You shouldhave timeout conditions for each phase ofan automated process.

• Build the tools to parse and present resultswhen you build the test. If possible, buildthe tool so you can compare multiple runs.

• Likewise, instrument the test when youbuild the test. Add readprofile or opro-file if possible. However, be aware of theimpact of your instrumentation; touching/proc too frequently can impact your re-sults.

• Results presentation is very important.design the report so that the most impor-tant data is the easiest to see.

• Establish a baseline run you can use forcomparison purposes.

• Compare frequently to that baseline. Testresults in isolation are less interesting thancomparisons to known conditions.

• Establish your hardware baselines in asmuch detail as you can. In a perfect world,what is the maximum rate your disk sub-system can deliver? Knowing these rates

Linux Symposium 459

can help you determine when a test is us-ing real hardware and when a test is run-ning from disk cache.

2.2 Macro and Micro Workloads

STP uses two very different types of tests whentesting kernel performance. These tests are di-vided into macro and micro workloads. Microworkloads are tests that exercise a very smallpiece of the system, such as a single systemcall. These tests focus on the low-level perfor-mance details.

A macro workload is a simulation of a real-world task. Macro workloads are sometimescreated from real customer workloads, or maybe designed to a specification, such as theTransaction Processing Performance Council’sTPC-N specification. These large-scale work-loads might include OLTP applications, deci-sion support systems or reservation and inven-tory systems. (OSDL’s macro workloads in-clude the Database test suite—the subject ofanother paper at this conference.)

It is important that we do not confuse the re-sults of macro and micro workloads, or attemptto extrapolate too much real-world behaviorfrom micro measurements. Micro benchmarksare usually developer-focused and not veryuseful for understanding customer needs.

When looking at macro benchmarks, avoidconfusing simulation with reality, and extrapo-lating results beyond the specific configurationand problem tested. Many macro benchmarksare grounded in real customer needs and situa-tions, but some are designed more for market-ing price/performance comparisons.

2.2.1 The AIM Suite

The AIM suite was created by AIM Technol-ogy in the 1980’s. The AIM company spon-sored a yearly ‘Hot Iron’ [DEC] award forhardware manufactures, with prizes awarded invarious price/performance categories. To quotefrom a press release[HP]:

Since 1981, AIM Technology hasprovided vendors and end users com-prehensive, unbiased performancetesting to help users determine thebest fit between their applicationneeds and available systems and con-figurations.. . . AIM is an independentorganization, as opposed to a ven-dor consortium, which allows AIMto bring an expert eye to performancemeasurement, not restrained by ob-jectives of consortium members.

The company no longer exists and the awardsare no longer being given out. SCO acquiredrights to the AIM technology in 2000, andplaced suites VII and IX of the test under theGPL. From the SCO web page[SCO]:

The AIM benchmark technology hasproved useful for more than a decadein measuring performance of hard-ware and versions of the Unix operat-ing system. The benchmarks were li-censed by nearly all of the vendors ofUnix system hardware. More than 70companies used these benchmarks tocompare and tune products. In addi-tion, because of the stressful multidi-mensional nature of the AIM work-load many OS and hardware vendorshave used the benchmarks as part oftheir quality assurance process.

Linux Symposium 460

The AIM suite combines features of bothmacro and micro tests. The suite consistsof a list of sub-tests, also known as “jobs.”Each job exercises a specific area of systemfunctionality, such as file I/O, shared memory,process creation, and compute-intensive mathtasks. Each job consists of a C function whichis linked to the driver code. A job may looprepeatedly internal to the C function. (For ex-ample, each addition test does 1.5 million in-ternal loops.) Lists of jobs are grouped into a“workload,” contained in a workload file.

The test runs in two modes. In the single-usermode (AIM suite IX), each job in the work-load file is executed serially by the test driver.An alarm is set, and the job is executed repeat-edly until the alarm expires. The alarm timeis referred to as the “test period.” Performanceis calculated by multiplying the number of jobexecutions by the internal job loop count, anddividing by the test period. Results reportedare iterations per test period and operations persecond for each job.

In the multiuser mode, the driver forks a num-ber of subprocesses, giving each an identicallist of jobs. The length of the job list is vari-able, the default is 100 jobs per child. The jobsare identical to the single user case. Each job isgiven a weight (the ‘hit’ count). This weight isused to calculate the fraction of the total workperformed by each subtest, the total work is thesum of all job weights.

A typical test executions consists of a series ofpasses wherein the number of subprocesses isincreased on each pass. Each subprocess runsa randomly-ordered set of jobs until its list isexhausted. The driver waits for all the childprocesses to exit, and records the time betweenchild start and child exit. From this data twonumbers are calculated—the jobs per minute(JPM) and jobs per child process per minute(JPM_C)[SGI].

As the system load increases the jobs perminute increases until it reaches a peak. Ifthe number of child processes continues to in-crease, the work per child per minute begins todecrease. Depending on the command line op-tions, the test run terminates when child workdecreases below a threshold or the number ofchild processes reaches the maximum desired.Results reported are parent time, total childtime, jobs per minute, and jobs per minute perchild.

The AIM suite provides a set of building blocks(the sub-tests) that can be combined to create asimulation of a real-world workload. The oldtest has several examples of these workloads,including simulations of databases, file servers,and compute servers. The workload can be ad-justed by altering test weight or changing thetest mix.

3 Re-aim – AIM rework

3.1 The driver

The AIM code has remained untouched since1991. I re-wrote the driver portion of the codeso that I could understand it, maintain it and en-hance the list of sub-tests. After studying theold code for a time, I choose to write a newdriver, preserving as much of the functionalityof the old driver as was useful. No doubt a dif-ferent coder could have continued to maintainthe existing structure, I choose not to.

The old driver used global data structures andstatic defines to control the size of the test list,the number of test arguments, and other details.The static definitions were replaced with dy-namicly linked lists for flexibility. The AIM7and AIM9 tests use almost the same list oftests, so a common driver was desirable.

For convenience, the GNU autoconf tools wereused for the build and install system. The fol-

Linux Symposium 461

lowing parts of the old AIM framework are es-sentially unchanged:

• The method of statically linking test mod-ules to the driver engine code, and callingthose modules through a function pointer.

• The method for calculating workload taskdistribution and weighting.

• The method for calculating disk file sizeand distribution.

• The majority of the test module code (notchanged at this time).

• The method of calculating the results isunchanged, however the timing methodand location of timestamps relative todriver sleep() has.

• The adaptive timer remains the default.

The current driver has the following options,shown in Table 1 which may help explain us-age.

Most of the parameters can be specified in aconfiguration file, options in this file are ig-nored if the command line option is present.Disk directories and disk file sizes must bespecified in a configuration file.

Several things were noted while re-doing thedriver.

• The old multi-user test ran until thejobs/child/minute was less than 1.0. Thisis quite a load on modern systems, result-ing runs greater than eight hours to at-tain convergence. This length of a runis generally not useful for such a per-formance test so the default crossover isjobs/child/minute less than 10.0, with asecond switch to set this to a quick test

Options Description-d(x), –debug(x) Turns on debugging output,

1 is default-v, –verbose Produces more output-s(x), –startusers Number of users at start-e(x), –endusers Number of users at end-i(x), –increment Number to increment by-f(s), –file(s) Workfile name,

(default ’workfile’)-l(s), –list(s) Config file name,

(default ’reaim.config’)-c, –crossover Run to crossover,

(JPM/user less than 10.0)-q, –quick Run to quick crossover,

(JPM/user less than 60.0)-x Runs until max JPM detected-j(x),–jobs(x) Number of jobs in tasklist,

(minimum is workfile size)-m, –multiuser AIM7 style, default-t, –timeroff No adaptive timer-o, –oneuser Runs AIM9 style

single thread-p, –period Length for single thread-r, –repeat Iterate entire test-h, –help This message

Table 1: Re-aim Options

Linux Symposium 462

value of 60.0. On a 2-CPU 800MHZ Pen-tium III system, the quick test convergesat 15→50 users, depending on workload.The default crossover point is 50→200users depending on the workload.

• The jobs per child is now adjustable, witha default value of 100. This can be usedto cause a set number of child processesto do more or less work without changingthe workload.

• In a perfect world, all children (doingequal work) should receive equal favorfrom the scheduler. In reality, as the num-ber of children exceeds the number ofCPUs, unfairness occurs and the child exitis serialized. In addition, the child exittiming is collected serially by the parentusingwait() . The maximum and min-imum child exit times are recorded to re-flect this. This variance also appears in thestandard deviation calculated by keeping arunning total across all child exits.

• Timestamps are collected with thetimes() function. The parent timefigure is effectively wall clock time forthe test. This function also allows us toextract the system and user time as seenby the child process. This informationis reported as a running total. The childtime thus exceeds the parent time in thereport.

• Filesize and poolsize (see below) are nowset in the configuration file. If either isspecified in the workfile, that setting over-rides the configuration file, maintainingold behavior.

• A method for detecting the maximum jobsper minute was added. When the-x op-tion is used, the jobs per minute rate istested by taking the standard deviationacross the last five test iterations. If the

standard deviation is less than 1.0 percentof the average, the test exits. In addition,if the the JPM rate drops more than 1.0 be-low the average, the test exits. Maximumjobs per minute are always reported.

3.2 Math tests

Time changes everything. Years ago, whencomputing was frequently referred to as“number-crunching,” math performance wasan exciting topic. Today, in the kernel context,when run single threaded, these math tests tellus very little. Fluctuations in the single-user(AIM9) integer math test times are undoubt-edly due to non-math causes, and do not typ-ically reflect a change in the kernel. The multi-user is a bit different—when we examine themultiuser case we see that all these test run en-tirely in user space. If we think of each subtestas a part of a larger workload, these user spacefunctions are quite useful.

Table 2 shows typical parent times and childsystem and user times when running these testson a 2-CPU system (Linux-2.4.18).

Test_Name Parent Child Sys Child usradd_short 5.96 0.00 1.18add_double 15.71 0.00 3.13add_float 10.58 0.00 2.09add_int 16.90 0.00 3.37add_long 16.93 0.00 3.38mul_short 0.50 0.00 0.09mul_long 0.42 0.00 0.07mul_int 0.40 0.00 0.07mul_float 17.43 0.00 3.48mul_double 17.55 0.00 3.48div_double 15.40 0.00 3.07div_float 15.83 0.00 3.07div_int 18.94 0.00 3.76div_long 18.83 0.00 3.76div_short 18.94 0.00 3.76

Table 2: Re-aim Math Tests

Linux Symposium 463

Number Parent Jobs perForked Time Minute

Without math tests10 75.23 797.55

With math tests10 58.07 1064.23

Table 3: Database Load Comparisons

Number Parent Jobs perForked Time Minute

Equal Weight10 39.17 1531.7820 66.41 1806.96

Disk:math - 4:110 57.38 1045.6620 91.33 1313.92

Disk:math - 1:410 26.53 2261.5920 49.07 2445.49

Table 4: Effects of Test Weight

Adding these user space workloads to the mul-tiuser test produces these results, shown in Ta-ble 3:

This appears a bit counter-intuitive—we havea longer test list, but it runs faster! Remem-ber that the number of tests per child is con-stant (100 in this case). Adding the short user-space math tests to the workload actually de-creases the amount of work per child. Here aresome further examples of how changing a sim-ple mix can change the run time. We’ll startwith four tests, equally weighted, then we willset the disk test weight to four times the weightfor the math tests, then do the reverse. Resultsshown in Table 4:

There are fifteen of these math tests, all aretight loops. No changes in these tests areplanned.

Num Parent Child ChildForked Time SysTime UTime

10 23.70 7.64 4.1020 26.29 15.30 8.2730 29.02 23.02 12.3040 31.55 31.48 16.3850 35.91 39.54 20.33

Table 5: High System Time Load

3.3 Other Tests

The math tests are notable for consumingmostly user time. There is another list oftests that consume mostly system time. Thesetests include the various memory tests (brk,shared memory) and the various system calltests. (create/close, link, fork, exec.) Combin-ing these tests into into a single workload doesconsume more system time, as seen in Table 5:

The current list of system-call focused test is abit short. Repeated runs of various workloadshave not yielded memory consumption at rea-sonable user levels.

Another current question involves the shell_rtntests, which currently use the shortest possibleshell script. In addition, the three functionscalling the shell are identical. The reason forthis duplication is unknown.

The intent is to examine other open sourcesof test routines for incorporation into this runframework.

3.4 Disk Tests

The disk tests in the old AIM test consist ofthree groups: basic block I/O tests, the sametests with an added sync, and the sync I/Otests. Each test determines file size from aglobal variable,disk_iteration_count .There are two configuration variables that con-trol this, FILESIZE and POOLSIZE (speci-

Linux Symposium 464

fied in kilobytes or megabytes). If POOL-SIZE is zero, each child will write or reada total of FILESIZE bytes. If POOLSIZE isnon-zero, child file size is equal to FILESIZE+ (POOLSIZE/number_of_children). Thuswhen POOLSIZE is non-zero, I/O per childwill be reduced on each increase in child count.

For example, specifying a FILESIZE of 10Kand a POOLSIZE of 100K will result in a sin-gle child creating a 110K byte file on each diskdevice listed. Two children will create a 60Kfile, etc. 24 children will create a 14K file, con-suming 328KB per disk device.

The old AIM tests follow this sequence:

• creat() file

• write file

• close() file descriptor

• open() file descriptor

• do test

This results in the disk test running entirelyfrom cache. I added a second set of disk testsusing this method:

• creat() file

• write file

• close() file descriptor

• sync()

• open() file descriptor

• do test

This simple change noticeably impacted per-formance:

Random Disk Writesdisk_rw without sync() 21922 (1K) per seconddisk_rw with sync() 1218.78 (1K) per second

The first number is more indicative of real-world hardware performance, but the cache-only version of the tests may be of greater in-terest to kernel developers.

The third category of disk tests performs thesame operations, but descriptors are openedwith the O_SYNC flag. (The read-only testis not performed, of course.) This test is oflesser interest, due to the relative slowness ofO_SYNC.

The current disk tests do all IO at 1K blocksizes. Future improvements to the disk testsuite include:

• Tests that use O_DIRECT and raw IO.

• Tests that use a common file created dur-ing the test setup or prior to the test run,requiring noticeable non-cached IO.

• Tests that produce measurable read activ-ity, period. This is a weakness of thecache-intensive design of the current tests.Many test runs show little or no real readIO—files are created, read and destroyedtoo quickly.

• Tests that attempt to consume a noticeablepercentage of the cache.

• Temporary file creation is currently serial-ized, multiple devices should work in par-allel.

The final test is disk_src, which does a seriesof directory searches. This test is of interestdue to its use of dcache. Future enhancementsinclude creating a script which will allow othertrees to be searched by disk_src, in place of thecurrent fakeh.tar.

Linux Symposium 465

Run Time Change2 seconds 2.39%4 seconds 2.17%8 seconds 2.02%15 seconds 1.52%30 seconds 1.46%45 seconds 1.62%

Table 6: Single user variation—3 runs each

3.5 Comparison of AIM9 Duration

This comparison attempts to show the usefulduration for the single user (AIM9) test run.A proper duration should produce stable re-sults from run to run. To test this, a singleuser test was run three times using a list offifty-four tests. Average change between testswas compared across the three runs, as shownin Table 6. (Note: Each test must completeone full loop.) While the run-to-run perfor-mance does stabilize slightly when the test du-ration exceeds fifteen seconds, run-to-run sta-bility does not improve noticeably beyond thatpoint. This has been reflected in the choice ofdefault settings for the single user run duration(10 seconds).

4 Run results

4.1 List of the workloads

Appendix A has a list of the various workloadswith run times on several sample configura-tions.

4.2 Comparisons – 2.5

Table 7 is a quick comparison of a 2.5 patchset, which is a subset of one of Martin Bligh’strees. We can see by this quick comparison thatthe patch does improve performance. The test

Forks JPM-mjb JPM delta10 1167.50 1074.15 8.3%20 1240.24 1219.13 1.7%22 1252.72 1219.14 2.7%

100 1247.36 1203.68 3.6%

Table 7: Comparison of 2.5.68 and 2.5.68-mjb0.5

was run on a small 2-CPU system, with 1GBof physical memory and IDE disks.

5 Conclusions

I have described the work that has been done tochange from AIM to Re-aim. I intend to spenda great deal more time adding to the list of testcases and otherwise improving the usefulnessof the tests.

6 Availability

The Re-aim code is available from Source-forge:

http://sourceforge.net/projects/re-aim-7

Or via BitKeeper:

bk://bk.osdl.org/aimrework

7 Trademarks

Linux is a trademark of Linux Torvalds

OSDL is a trademark of Open Source DevelopmentLabs, Inc. All other marks and brands are the prop-erty of their respective owners.

Linux Symposium 466

8 Acknowledgements

Thanks to Ruth Forester and John Hawkes foradvice, and OSDL for support.

References

[SCO] Web page announcing AIM suiterelease,http://www.caldera.com/developers/community/contrib/aim.html , 2000.

[HP] Press Release with AIM description,http://www.compaq.com.hk/press/release/99press/990623.html ,1995.

[DEC] Press Release with AIM description,http://wint.decsy.ru/alphaservers/digital/v0000022.htm , 1995.

[SGI] Ruth Forester, et al. “FilesystemPerformance and Scalability in Linux2.4.17,”Proceedings of the 2002USENIX Annual Technical Conference,Berkeley, CA 2002http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf .

A Re-aim Results

A.1 Example runs

This appendix shows how various workloadsperform on some sample systems. Work-loads were run until max sustainable jobs werereached. The results shown below are themaximum users obtained by each workload.Several iterations are shown in some casesto demonstrate typical run termination—theadaptive timer was used for these runs. See the

source package for a listing of each workfile.Some of the workload have arbitrary names re-flecting time. This is not intended as a hard-ware comparison.

We notice that for several of the workloads,scaling is roughly linear across the three con-figurations. For other workloads, most notice-able the fserver and Dbase, performance on thequad system jumps markedly. However, theadaptive timer skews the increment such thatcomparisons may not be relied upon—any truecomparison should be made without the adap-tive timer. (The adaptive timer was used in thiscase to reduce total run time.) The additionaldisks on the Quad system appear to impact runtimes. The other systems under test use diskswhich are shared by the system. (/tmp or/usr/tmp ) The quad system has 5 spindlesof disk devoted to the tests. The actual testreport includes data on standard deviation andconfidence levels. These columns have beenremoved, due to text formatting requirements.

The systems:

1. Single CPUPIII - 600MHZ384KB RAMsingle IDE diskLinux-2.5.68 -stockFILESIZE 10kPOOLSIZE 100k

2. Dual CPUPIII - 868MHZ1GB RAMDual IDE diskLinux-2.5.68 - stockFILESIZE 10kPOOLSIZE 1m

3. Quad CPUPIII - 700MHZ

Linux Symposium 467

4GB RAM5 SCSI disksLinux-2.4.20 - stockFILESIZE 10kPOOLSIZE 1m

A.2 The Workloads

workfile.all_utimeTable 8. All these tests runentirely in user space.

workfile.alltestsTable 9. The full test list.

workfile.computeTable 10. From the oldtest. Simulation of a compute-intensive server.31.7% of this workload are tests from theall_utime list.

workfile.dbase Table 11 Simulation of adatabase load. 21.8% percent of this workloadare tests from the all_utime list.

workfile.diskTable 12. The disk tests with noother work. All tests in this list are weightedequally. Notice the difference between thisworkload and the fserver workload, which in-cludes other subtests.

workfile.fivesecTable 13 A completely artifi-cial grouping of tests, based on their run dura-tion when tested on a UP system.

workfile.fserverTable 14 Simulation of a fileserver. 21.8% of this mix is 100% user timetests, which matches the dbase workfile.

workfile.fivesecTable 15 A completely artifi-cial grouping of tests, based on their run dura-tion when tested on a UP system.

workfile.sharedTable 16. Simulation of amulti-user shared server, assumed to be sup-porting telnet clients. 39.7% of the work mixare 100% user time tests.

workfile.shortTable 17 A completely artificial

Max Jobs per minuteSingle - 1044.37 (1 user)Dual - 2938.27 (7 users)Quad - 4896.00 (12 users)Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle5 29.32 0.00 29.32 1043.66Dual14 29.27 0.01 56.72 2927.23Quad20 25.04 0.03 100.04 4888.18

Table 8: All User Time Workload

Max Jobs per minuteSingle - 1839.22 (118 users)Dual - 4233.31 (345 users)Quad - 7207.78 (281 users)

Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle222 674.92 79.09 583.36 1835.42Dual545 727.78 288.98 973.43 4223.53Quad281 217.54 219.01 611.41 7207.78343 317.68 494.41 750.89 6024.74

Table 9: All Tests Workload

grouping of tests, based on their run durationwhen tested on a UP system.

Linux Symposium 468

Max Jobs per minuteSingle - 803.29 (5 users)Dual - 1429.25 (7 users)Quad - 4708.68 (753 users)Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle6 45.57 1.90 43.63 797.89Dual10 43.35 3.73 78.56 1397.92Quad753 969.10 246.37 3620.94 4708.681007 1300.27 348.33 4838.78 4693.19

Table 10: Compute Workload

Max Jobs per minuteSingle - 806.63Dual - 1186.52Quad - 1124.23Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle10 73.64 3.92 68.51 806.63Dual53 265.33 38.30 457.17 1186.52Quad383 2626.73 7089.40 2706.01 866.10

Table 11: Dbase Workload

Max Jobs per minuteSingle - 1059.20Dual - 2753.83Quad - 9723.69Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle52 309.29 19.06 11.38 1059.2065 396.09 24.01 14.32 1033.86Dual259 592.52 272.62 45.07 2753.83349 816.90 370.08 61.38 2691.52386 927.06 423.03 67.48 2623.13464 1131.55 518.55 81.22 2583.36Quad352 374.70 766.36 53.39 5918.33510 330.43 899.31 76.65 9723.69807 867.55 3087.03 119.11 5860.30

Table 12: Disk Workload

Max Jobs per minuteSingle - 2014.13Dual - 3872.86Quad - 10995.93Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle24 70.78 16.60 12.21 2014.1332 101.06 22.16 16.34 1880.8635 110.06 24.29 17.73 1888.9741 130.73 28.48 20.89 1862.92Dual136 208.59 158.24 51.78 3872.86180 287.05 210.50 68.99 3724.79198 319.77 231.98 75.75 3678.02Quad432 265.98 698.35 170.78 9647.64550 297.11 875.30 214.95 10995.93799 1023.44 3577.29 314.54 4637.36

Table 13: FiveSec Workload

Max Jobs per minuteSingle - 1617.78Dual - 4267.88Quad - 149.06Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle17 63.68 9.74 37.24 1617.7819 72.01 11.17 41.57 1598.9423 89.66 13.41 50.25 1554.54Dual328 465.73 241.80 483.01 4267.88367 525.04 272.13 540.58 4235.91449 639.48 339.93 661.62 4254.93531 756.40 402.28 782.39 4254.18Quad141 5732.32 4832.17 283.59 149.06145 5968.47 4803.54 290.69 147.22146 6112.96 5288.60 292.91 144.74

Table 14: Fserver Workload

Linux Symposium 469

Max Jobs per minuteSingle - 952.59Dual - 2945.31Quad - 5007.89Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle12 81.63 3.42 69.14 952.59Dual187 411.42 57.24 730.43 2945.31Quad48 62.11 12.72 231.40 5007.89

Table 15: Long Workload

Max Jobs per minuteSingle - 1177.14Dual - 2232.94Quad - 2153.06Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle12 59.33 5.09 51.69 1177.1416 80.37 6.84 68.84 1158.64Dual28 72.98 23.81 95.95 2232.9434 103.61 26.71 116.53 1909.85Quad132 520.91 386.16 436.88 1474.80182 624.65 585.43 628.17 1695.73291 786.61 1135.61 1002.55 2153.06400 1409.01 3649.24 1380.71 1652.22

Table 16: Shared Workload

Max Jobs per minuteSingle - 45333.33Dual - 166909.09Quad - 222545.45Num Parent Child Child Jobs perForked Time SysTime UTime MinuteSingle6 0.82 0.38 0.42 44780.49Dual9 0.33 0.27 0.39 166909.09Quad4 0.11 0.24 0.23 222545.458 0.26 0.47 0.45 188307.69

Table 17: Short Workload

Stressing Linux with Real-world Workloads

Mark WongOpen Source Development Labs

[email protected]

Abstract

Open Source Development Labs (OSDL™)have developed three freely available real-world database workloads to characterize theperformance of the Linux kernel. These cus-tomizable workloads are modeled after theTransaction Processing Performance Council’sindustry standard W, C, and H benchmarks,and are intended for kernel developers with orwithout database management experience. TheScalable Test Platform can be used by thosewho do not feel they have the resources to run alarge scale database workload to collect systemstatistics and kernel profile data. This paper de-scribes the kind of real world activities simu-lated by each workload and how they stress theLinux kernel.

1 Introduction

OSDL1 is dedicated to providing developerswith resources to build enterprise enhance-ments into the Linux® kernel and its OpenSource solution stacks. This paper focuseson real-world database workloads, based onindustry standard benchmarks, used to stressLinux.

OSDL currently provides three workloads, de-rived from specifications published by theTransaction Processing Performance Council2

1http://www.osdl.org/2http://www.tpc.org/

(TPC). TPC is a non-profit corporation createdto define database benchmarks with which per-formance data can be disseminated to the in-dustry.

TPC benchmarks are intended to be used as acompetitive tool. All published TPC bench-mark results must comply with strict publi-cation rules and auditing to ensure fair com-parisons between competitors. Furthermore,it is required that all the hardware and soft-ware used in a benchmark must be commer-cially available with a full disclosure report ofthe pricing of all the products used as well assupport and maintenance costs.

As a basis for our three workloads, we used theTPC Benchmark∗ W (TPC-W∗), TPC Bench-mark C (TPC-C∗), and TPC Benchmark H(TPC-H∗). Each benchmark is briefly de-scribed here. TPC-W is a Web commercebenchmark, simulating the activities of Webbrowsers accessing an on-line book reseller forbrowsing, searching or ordering. TPC-C is anon-line transaction processing (OLTP) bench-mark, simulating the activities of a suppliermanaging warehouse orders and stock. TPC-H is an ad hoc decision support benchmark,simulating an application performing complexbusiness analyses that supports the making ofsound business decisions for a wholesale sup-plier.

It is impractical for most Linux kernel devel-opers to adhere to TPC rules, simply for costalone. The executive summaries of the pub-

Linux Symposium 471

lished results on the TPC Results Listing Webpage3 show system costs in the millions of dol-lars.

To illustrate, the highest performing TPC-W result at the 10,000 item scale fac-tor4 [TPCW10K] uses forty-eight dual-processor Web servers, twelve dual-processorand two single-processor Web caches, andone system with eight processors and ap-proximately 150 hard drives for the databasemanagement system. This does not include thesystems emulating Web browsers that drive thebenchmark.

The highest performing TPC-C [TPCCRESULT] result uses thirty-twoeight-processor system with 108 hard driveseach, four four-processor systems withfourteen hard drives each for the databasemanagement system, and sixty-four dual-processor clients. This does not include thesystems emulating the terminals required todrive the benchmark.

The highest performing TPC-H result at the10,000 GB scale factor [TPCH10000GB] usessixty-four dual-processor systems with four gi-gabytes of memory each and a total of 896 harddrives.

2 Database Test Suite

The Database Test Suite5 is a collection ofsimplified derivatives of the TPC benchmarksthat simulate real-world workloads in smallerscale environments than that of a full blownTPC benchmark. These workloads are not verywell suited to compare databases or systemsbecause the variations allowed in running the

3Current results can be viewed on the Web at:http://www.tpc.org/information/results.asp

4The scale factordetermines the initial size of thedatabase.

5http://www.osdl.org/projects/performance/

workload would result in an apples-to-orangescomparison. However, these workloads can beused to compare the performance between dif-ferent Linux kernels on the same system.

The amount of database administration knowl-edge and resources needed to run one of theseworkloads may still be intimidating to some,but the OSDL Scalable Test Platform6 (STP)offsets these concerns. How the STP can beused is discussed towards the end of this paper.

These tests were initially developed on Linuxwith SAP DB but each test kit is designed toallow them to be usable with any database,such as MySQL7 or PostgreSQL8, with someporting work. Members of the PostgreSQLcommunity are currently contributing to theDatabase Test Suite so that it may be used withPostgreSQL and also run on FreeBSD9.

Each test provides scripts to collect sys-tem statistics using sar, iostat, and vmstat.Database statistics for SAP DB are also col-lected by using the tools that are provided withthe database. The data presented for eachworkload in this paper are primarily profiledata to show what parts of the Linux kernel isexercised. Keep in mind that the profile datapresented here characterizes the workload fora specific set of parameters and system config-urations. For example, as we review the profiledata we will see that Database Test 2 appearsto be stressing a SCSI disk controller driver,yet the workload can be customized so thatthe working set of data can fit completely intomemory to put the focus of the workload on thesystem memory and processors as opposed tothe storage subsystem.

6STP can be accessed through the Web athttp://www.osdl.org/stp/

7http://www.mysql.com/8http://www.postgresql.org/9http://www.freebsd.org/

Linux Symposium 472

2.1 Database Test 1 (DBT-1)

Database Test 1 (DBT-1) is derived from theTPC-W Specification [TPCW], which typi-cally consists of an array of Web servers, host-ing the on-line reseller’s store front, that in-terfaces with a database management system,shown in Figure 1. An array of systems is alsorequired to support the benchmark by simulat-ing Web browsers accessing the Web site.

Figure 1: TPC-W Component Diagram

DBT-1, on the other hand, focuses on the ac-tivities of the database. There are no Webservers used, and a simple database caching ap-plication can be used to simulate some of theeffects that a Web cache would have on thedatabase. Since no Web servers are used, emu-lated Web browsers are not fully implemented.Instead, a driver is implemented to simulateusers requesting the same interactions againstthe database that a Web browser would makeagainst a Web server.

Figure 2 is a component diagram of the pro-grams used in DBT-1. The Driver is a multi-threaded program that simulates users access-ing a store front. The Application Server isanother multi-threaded application that man-ages a pool of database connections and han-dles interaction requests from the Driver. TheDatabase Cache is yet another multi-threadedprogram that extracts data from the databasebefore the test starts running. If the cachingcomponent is not used, the Application Serverqueries the database directly. Each componentof DBT-1 can run on separate or shared sys-tems, in other words, in a three-tier or one-tier

environment.

Figure 2: DBT-1 Component Diagram

The emulated users behave similarly to theTPC-W emulated browsers by executing in-teractions that search for products, make or-ders, display orders, and perform administra-tive tasks. There are a total of fourteen in-teractions that can occur, as shown in Table 1with their frequency of execution. Each inter-action, except Customer Registration, causesread-only or read-write I/O activity to occur.The overall effect is that 80% of the inter-actions executed cause read-only I/O activitywhile the remaining 20% of the interactionsexecute also cause writing to occur. AdminRequest, Best Sellers, Home, New Products,Order Inquiry, Order Display, Product Detail,Search Request, and Search Results are theread-only interactions. Admin Confirm, BuyRequest, Buy Confirm, and Shopping Cart arethe read-write interactions. Customer registra-tion does not interact with the database. It ismaintained in the workload to keep the inter-action mix close to the TPC-W specification.

An emulated user maintains state between in-teractions to simulate a browser session. A ses-sion lasts an average of fifteen minutes, whichcan be tester defined, over a negative exponen-tial distribution. The state maintained includesthe emulated user’s identification and a shop-ping cart identifier. Each emulated user alsorandomly picks a think time from a negativeexponential distribution, with a specified aver-age, to determine how long to sleep between

Linux Symposium 473

Interaction ExecutedAdmin Confirm 0.09 %Admin Request 0.10 %Best Sellers 11.00 %Buy Confirm 0.69 %Buy Request 0.75 %Customer Registration 0.82 %Home 29.00 %New Products 11.00 %Order Display 0.25 %Order Inquiry 0.30 %Product Detail 21.00 %Search Request 12.00 %Search Results 11.00 %Shopping Cart 2.00 %

Table 1: DBT-1 Database Interactions Mix

interactions.

This workload generally exhibits high proces-sor and memory activity mixed with with net-working and low to medium I/O activity thatis mostly read-only. The parameters that canbe controlled to alter the characteristics of theworkload are the scale factor of the database,the number of connections the ApplicationServer opens to the database, the number ofemulated users created by the driver, and thethink time between interaction requests.

The following results for DBT-110 are collectedfrom a four-processor Pentium III Xeon∗ sys-tem with 1 MB of L2 cache and 4 GB of mem-ory in the OSDL STP environment. The sys-tem is configured in a one-tier environmentwith SAP DB using a total of eleven raw diskdevices and to run against Linux 2.5.67. Adatabase with 10,000 items and 600 users iscreated, where 400 users are emulated using anaverage think think of 1.6 seconds

Table 2 displays the top 20 functions called

10http://khack.osdl.org/stp/271067/

sorted by the frequency of clock ticks. Theprofile data is collected using readprofile wherethe processors in the system become 100% uti-lized, as shown in Figure 3. Other than thescheduler, we can see that TCP network func-tions are called most frequently in this work-load.

Function Ticksdefault_idle 944946schedule 15835__wake_up 7376tcp_v4_rcv 6871__copy_to_user_ll 6456tcp_sendmsg 6386mod_timer 4666__copy_from_user_ll 4482tcp_recvmsg 4297__copy_user_intel 3602tcp_transmit_skb 3369ip_queue_xmit 3230dev_queue_xmit 3105ip_output 2936__copy_user_zeroing_intel 2806tcp_data_wait 2711tcp_rcv_established 2675generic_file_aio_write_nolock 2614fget 2536do_gettimeofday 2420. . . . . .total 1124848

Table 2: DBT-1 Profile Ticks

Table 3 displays the top 20 functions with thehighest normalized load, which is calculatedby dividing the number of ticks a function hasrecorded by the length of the address space thefunction occupies in memory. By comparingTable 2 and Table 3, we can see that__copy_to_user_ll and __copy_from_user_ll appear in both tables. This implies thatthe user address space is accessed frequentlyin this workload.

Linux Symposium 474

Function Loaddefault_idle 14764.7812__wake_up 153.6667__copy_to_user_ll 57.6429system_call 52.5682get_offset_tsc 45.9688syscall_exit 40.1818__copy_from_user_ll 40.0179fget 31.7000restore_fpu 30.6875fput 26.4062ipc_lock 24.6250__copy_user_intel 22.5125sock_wfree 19.6094__copy_user_zeroing_intel 17.5375local_bh_enable 16.7396mod_timer 16.2014schedule 15.4639do_gettimeofday 15.1250sockfd_lookup 14.8393device_not_available 13.4146

Table 3: DBT-1 Normalized Profile Load


Database Test 2 (DBT-2) is derived from theTPC-C Specification [TPCC], which typicallyconsists of a database server and a transactionmanager used to access the database server.There is also an array of systems required tosupport the benchmark by simulating terminalsaccessing the database. This benchmark can berun in one of two configurations, as shown inFigure 4 and Figure 5. The only difference be-tween these two configuration is that the emu-lated terminals access the database through atransaction manager, labeled as theClient inFigure 4, while the emulated terminals accessthe database directly in the Figure 5.

DBT-2 can be also be configured to run in oneof two ways. The first way is shown in Fig-ure 6 where the Driver, a multi-threaded pro-

0 500 1000 1500 2000Elapsed Time (seconds)

0

20

40

60

80

100

% U

tiliz

ed

usersystemidlewait

Figure 3: DBT-1 Processor Utilization

Figure 4: TPC-C Component Diagram 1

gram that creates a thread for every terminalemulated, accesses the database through theClient program, a transaction manager that is amulti-threaded program that manages a pool ofdatabase connections. The second way, shownin Figure 7, combines the functionality of theClient program into the Driver program so thatthe driver can connect directly to the database.In either case, the workload can be run in asingle- or multi-tier environment.

DBT-2 consists of five transactions that createorders, display order information, pay for or-ders, deliver orders, and examine stock levels.Table 4 lists each transaction and the frequencythat each is executed by the emulated termi-nals. The Deliver, New-Order, and Paymenttransactions are read-write transactions, whilethe Order-Status and Stock-Level transactionsare read-only transactions.

This workload can be customized so that the

Linux Symposium 475

Figure 5: TPC-C Component Diagram 2

Figure 6: DBT-2 Component Diagram 1

working set of data is contained completely inmemory. If the working set of data is cachedcompletely in memory, the system puts a heavyload on processors and memory usage. In sit-uations where the working set of data is notcompletely cached, random I/O activity in-creases, while in both cases, sequential writesto the database logging device occurs through-out the test.

There are several parameters that can be cus-tomized to alter the characteristics of the work-load. There are constant keying and thinkingtimes between interactions that can be testerdefined. The keying time simulates the timetaken to enter information into a terminal andthe thinking time simulates the time taken for atester to determine the next transaction to exe-cute.

By default, every terminal that is emulated isassigned a district and a warehouse to work outof. This can be changed so that a terminal ran-domly picks a warehouse and district in a spec-ified range for every transaction. The effect thishas on the workload is that a single emulatedterminal is likely to access a greater amount ofdata in the database over the course of a test.

Figure 7: DBT-2 Component Diagram 2

Transaction ExecutedDelivery 4.0 %New-Order 45.0 %Order-Status 4.0 %Payment 43.0 %Stock-Level 4.0 %

Table 4: DBT-2 Database Transaction Mix

Given the same amount of memory, this wouldcreate a workload less likely to be cached inmemory and more likely to incur an increasedamount of I/O activity. It would also createmore lock contention in the database. By re-ducing the number of emulated terminals andby limiting the range of data an emulated ter-minal accesses in the database, the workloadcan be cached into memory, effectively creat-ing a workload that only performs synchronouswrites on the database logging device.

The following results for DBT-211 are collectedfrom a four-processor Pentium III Xeon sys-tem with 1 MB of L2 cache and 4 GB of mem-ory in the OSDL STP environment. The sys-tem is configured in a one-tier environmentwith SAP DB using twelve raw disk devicesto run against Linux 2.5.67. Sixteen termi-nals are emulated to randomly select a districtacross six distinct warehouses with a keyingand thinking time of zero seconds.

Table 5 displays the top 20 functions called


Linux Symposium 476

sorted by the frequency of clock ticks andTable 6 displays the top 20 functions withthe highest normalized load. If we comparethese tables like we did with the resultsfrom DBT-1, we see thatbounce_copy_vec , __blk_queue_bounce , scsi_request_fn , scsi_end_request , andpage_address appear in both tables. Thedefault_idle function also appears atthe top of both tables. This implies that theprocessors on the system are not fully utilizedand that the system may be stressing thestorage subsystem. Figure 8 confirms thatthe processors are approximately 5% to 10%idle and are approximately 40% to 50% busywaiting for I/O.

Function Ticksdefault_idle 5695427bounce_copy_vec 84136schedule 55663__blk_queue_bounce 28391scsi_request_fn 23052do_softirq 21760__make_request 20124try_to_wake_up 10511scsi_end_request 10161system_call 9734dio_bio_end_io 9241scsi_queue_next_request 9066ipc_lock 6856sys_semtimedop 5858do_anonymous_page 5693kmem_cache_free 5587free_hot_cold_page 4768page_address 4752buffered_rmqueue 4648try_atomic_semop 4406. . . . . .total 6211231


Function Loaddefault_idle 88991.0469bounce_copy_vec 1051.7000system_call 221.2273syscall_exit 105.5455do_softirq 104.6154ipc_lock 85.7000dio_bio_end_io 82.5089kmem_cache_free 69.8375scsi_end_request 63.5063__wake_up 55.9375schedule 54.3584get_offset_tsc 54.1562__blk_queue_bounce 47.9578bio_put 46.3542scsi_request_fn 42.3750fget 37.5125restore_fpu 36.7812page_address 33.0000generic_unplug_device 29.7143device_not_available 29.6098



Database Test 3 (DBT-3) is derived from theTPC-H Specification [TPCH], which typicallyconsists of a single database server that isqueried by an application in a host-based orclient/server configuration, as shown in Fig-ure 9 and Figure 10.

There are twenty-two queries that provide busi-ness analyses for pricing and promotions, sup-ply and demand management, profit and rev-enue management, customer satisfaction, mar-ket share, and shipping management. In addi-tion to the twenty-two queries, there are tworefresh functions that load new sales informa-tion into the database.

This workload consists of loading a database,running a series of queries against the database,

Linux Symposium 477

0 10 20 30 40 50Elapsed Time (minutes)

0

10

20

30

40

50

60

70

80

90

100%

Util

ized

usersystemidlewait

Figure 8: DBT-2 Processor Utilization

Figure 9: TPC-H Host-Based Component Dia-gram

and loading new sales information into thedatabase. There are three distinct tests in whichthese actions occur, the Load Test, Power Test,and Throughput Test. The Load Test createsthe database tables and loads data into them.The Power Test executes each of the twenty-two queries and two refresh functions sequen-tially. The Throughput Test executes a speci-fied number of processes that executes each ofthe twenty-two queries in parallel and an equalnumber of processes that executes only the re-fresh functions.

There are several ways that this workload canbe customized. The scale factor of the databasecan be selected so that at some point dur-ing a test, the working set of data becomescached into memory. The number of streams

Figure 10: TPC-H Client/Server ComponentDiagram

for the throughput test may have to be adjustedaccording to the available system resources.The TPC-H Specification requires a minimumnumber of streams to be used depending onthe scale factor of the database and ideally thenumber of streams should be selected so thatthe highest throughput metric can be achieved.However, for DBT-3, selecting the number ofstreams can be determined by how the Linuxkernel is stressed by the workload. In any case,if the working set of data is not cached, largesequential I/O activity occurs in the Power andThroughput Test. Also, each of the twenty-two queries can be modified to meet differentneeds. For example, a query can be modifiedto answer another type of business question.

The following results for DBT-312 are collectedfrom a four-processor Xeon∗ system with 256KB of L2 cache and 4 GB of memory in theOSDL STP environment. The system is con-figured in a host-based environment runningLinux 2.5.67 with hyper-threading enabled anda patch that allows the DAC960 driver to havedirect memory access into high memory. A1 GB database was created and only statisticsfrom a Throughput Test with eight streams arereported here.

Table 7 displays the top 20 functionscalled sorted by the frequency of clockticks. Similar to DBT-2, the prominence of


Linux Symposium 478

default_idle with __make_requestand DAC960_BA_InterruptHandlersuggests that the system is also busy waitingfor I/O. Table 8 displays the top 20 func-tions with the highest normalized load andagain, similar to DBT-2, seeingDAC960_BA_InterruptHandler as one of themore prominent functions on this list alsosupports the theory that the kernel is spendinga significant amount of time attempting toprocess I/O requests. Figure 11 shows that theprocessors are waiting for I/O 40% to 80% oftime throughout the middle of the ThroughputTest.

Function Tickspoll_idle 24160697__make_request 21161schedule 16354generic_unplug_device 14701DAC960_LP_InterruptHandler 12160system_call 7361kmap_atomic 3783fget 3148get_user_pages 3099do_direct_IO 2968dio_await_one 2955bio_alloc 2850device_not_available 2767direct_io_worker 2551blockdev_direct_IO 2226find_vma 2020__generic_file_aio_read 1924follow_page 1891__copy_to_user_ll 1817. . . . . .total 24322403


3 PLM and STP

Using the Database Test Suite can be an inti-mating task for those inexperienced with ad-

Function Loadpoll_idle 383503.127system_call 167.2955generic_unplug_device 140.0095DAC960_LP_InterruptHandler 74.1463device_not_available 67.4878fget 44.9714kmap_atomic 34.3909fput 31.1154find_vma 24.3373unlock_page 20.9059restore_fpu 20.4857get_offset_tsc 19.6667syscall_call 19.4545io_schedule 18.8542__make_request 18.7431dio_await_one 18.5849current_kernel_time 18.5303math_state_restore 17.7846mempool_alloc_slab 15.9524kmem_cache_alloc 15.8158


ministering database management systems, orlarge systems may not be readily available fortesting. Rather than implementing one of theDatabase Test Suite workloads on their ownsystem, Linux kernel developers can test theirkernel patches by using the Patch LifecycleManager13 (PLM) and the STP. In order to usePLM or STP, you must sign up as an associateof the OSDL, free of charge, through the Webathttp://www.osdl.org/ .

PLM can be used to store patches for the Linuxkernel that can be used by STP for testing. Cur-rently, PLM automatically copies Linus Tor-valds’s tree as well as Andrew Morton’s, Mar-tin Bligh’s, Alan Cox’s, and the ia64 patch sets.PLM also executes filters against each patchentered into the system, to verify the patch ap-

13PLM can be accessed through the Web athttp://www.osdl.org/cgi-bin/plm/

Linux Symposium 479

0 1000 2000 3000 4000 5000 6000 7000Elapsed Time (s)

0

20

40

60

80

100%

Util

ized

usersystemidlewait

Figure 11: DBT-3 Throughput Test ProcessorUtilization

plies to a kernel or another patch, and verifiesthat the kernel can still be compiled with thatpatch.

STP currently implements all three of theworkloads in the Database Test Suite14. DBT-1can be run on systems with 2, 4 or 8 proces-sors, DBT-2 and DBT-3 can be run on systemswith 4 processor. The 4 and 8 processor sys-tems also have arrays of external hard drives at-tached. Each test in STP generates a Web pageof results with links to raw data and charts, aswell as profile data if desired. E-mail notifica-tion is also sent to the test requester when a testhas completed.

4 Comparing Results

While the OSDL Database Test Suite is de-rived from TPC benchmarks, results from theDatabase Test Suite are in no way compara-ble to results published from TPC benchmarks.Such comparisons should be reported to theTPC ([email protected] ) and to the OSDL([email protected] ).

14DBT-3 is currently being developed for STP andshould be available by the time this paper is published.

References

[TPCC] TPC Benchmark C StandardSpecification Revision 5.0, February 26,2001.

[TPCH] TPC Benchmark H StandardSpecification Revision 1.5.0, 2002.

[TPCH10000GB]NCR 5350 Using TeradataV2R5.0 Executive Summary, Teradata adivision of NCR, March 12, 2003.

[TPCCRESULT] ProLiant DL760-900-256PClient/Server Executive Summary,Compaq Computer Corporation,September 19, 2001.

[TPCW] TPC Benchmark W SpecificationVersion 1.6, August 14, 2001.

[TPCW10K] Netfinity 5600 with Netfinity6000R using Microsoft SQL Server 2000Executive Summary, InternationalBusiness Machines, Inc., July 1, 2000.

Trademarks

OSDL is a trademark of Open Source DevelopmentLabs, Inc.


∗ All other marks and brands are the property oftheir respective owners.

Xr: Cross-device Rendering for Vector Graphics

Carl WorthUSC, Information Sciences Institute

[email protected]

Keith PackardCambridge Research Laboratory, HP Labs, HP

[email protected]

Abstract

Xr provides a vector-based rendering API withoutput support for the X Window System andlocal image buffers. PostScript and PDF fileoutput is planned. Xr is designed to produceidentical output on all output media while tak-ing advantage of display hardware accelerationthrough the X Render Extension.

Xr provides a stateful user-level API with sup-port for the PDF 1.4 imaging model. Xr pro-vides operations including stroking and fillingBézier cubic splines, transforming and com-positing translucent images, and antialiasedtext rendering. The PostScript drawing modelhas been adapted for use within C applications.Extensions needed to support much of the PDF1.4 imaging operations have been included.This integration of the familiar PostScript op-erational model within the native applicationlanguage environment provides a simple andpowerful new tool for graphics application de-velopment.

1 Introduction

The design of the Xr library is motivated by thedesire to provide a high-quality rendering in-terface for all areas of application presentation,

from labels and shading on buttons to the cen-tral image manipulation in a drawing or paint-ing program. Xr targets displays, printers andlocal image buffers with a uniform renderingmodel so that applications can use the sameAPI to present information regardless of themedia.

The Xr library provides a device-independentAPI, and can currently drive X WindowSystem[10] applications as well as manipulateimages in the application address space. It cantake advantage of the X Render Extension[7]where available but does not require it. Theintent is to add support for Xr to producePostScript[1] and PDF 1.4[5] output.

Moving from the primitive original graphicssystem available in the X Window System toa complete device-independent rendering envi-ronment should serve to drive future applica-tion development in exciting directions.

1.1 Vector Graphics

On modern display hardware, an application’sdesire to present information using abstract ge-ometric objects must be translated to physicalpixels at some point in the process. The laterthis transition occurs in the rendering processthe fewer pixelization artifacts will appear asa result of additional transformation operations

Linux Symposium 481

on pixel-based data.

Existing application artwork is often generatedin pixel format because the rendering opera-tions available to the application at runtime area mere shadow of those provided in a typicalimage manipulation program. Providing suffi-cient rendering functionality within the appli-cation environment allows artwork to be pro-vided in vector form which presents high qual-ity results at a wide range of sizes.

Figure 1: Raster and vector images at originalsize (artwork courtesy of Larry Ewing and Si-mon Budig)

Figures 1-3 illustrate the benefits of vector art-work. The penguin on the left of Figure 1 isthe familiar image as originally drawn by LarryEwing[3]. The penguin on the right is an Xrrendering of vector-based artwork by SimonBudig[2] intended to match Ewing’s artwork asclosely as possible. At the original scale of theraster artwork, the two images are quite com-parable.

Figure 2: Raster image scaled 400%

However, when the images are scaled up, thedifferences between raster and vector artworkbecome apparent. Figure 2 shows a portion ofthe original raster image scaled by a factor of 4with the GIMP [6]. Artifacts from the scalingare apparent, primarily in the jaggies aroundthe contour of the image. The GIMP did applyan interpolating filter to reduce these artifactsbut this comes at the cost of blurring the im-age. Compare this to Figure 3 where Xr hasbeen used to draw the vector artwork at 4 timesthe original scale. Since the vector artwork isresolution independent, the artifacts of jaggiesand blurring are not present in this image.

1.2 Vector Rendering Model

The two-dimensional graphics world is fortu-nate to have one dominant rendering model.With the introduction of desktop publishingand the PostScript printer, application devel-opers converged on that model. Recent exten-sions to that model have been incorporated inPDF 1.4, but the basic architecture remains thesame. PostScript provides a simple paintersmodel; each rendering operation places newpaint on top of the contents of the surface. PDF1.4 extends this model to include Porter/Duffimage compositing [9] and other image ma-nipulation operations which serve to bring thebasic PostScript rendering model in line withmodern application demands.

Figure 3: Vector image scaled 400%

Linux Symposium 482

PostScript and PDF draw geometric shapes byconstructing arbitrary paths of lines and cubicBézier splines. The coordinates used for theconstruction can be transformed with an affinematrix. This provides a powerful compositingtechnique as the transformation may be set be-fore a complex object is drawn to position andscale it appropriately. Text is treated as pre-built path sections which couples it tightly andcleanly with the rest of the model.

1.3 Xr Programming Interface

While the goal of the Xr library is to providea PDF 1.4 imaging model, PDF doesn’t pro-vide any programming language interface. Xrborrows its imperative immediate mode modelfrom PostScript operators. However, instead ofproposing a complete new programming lan-guage to encapsulate these operators, Xr uses Cfunctions for the operations and expects the de-veloper to use C instead of PostScript to imple-ment the application part of the rendering sys-tem. This dramatically reduces the number ofoperations needed by the library as only thosedirectly involved in graphics need be provided.The large number of PostScript operators thatsupport a complete language are more than ad-equately replaced by the C programming lan-guage.

PostScript encapsulates rendering state in aglobal opaque object and provides simple op-erators to change various aspects of that state,from color to line width and dash patterns. Be-cause global objects can cause various prob-lems in C library interfaces, the graphics statein Xr is held in a structure that is passed to eachof the library functions.

The translation from PostScript operators tothe Xr interface is straightforward. For ex-ample, the lineto operator translates to the Xr-LineTo function. The coordinates of the lineendpoint needed by the operator are preceded

by the graphics state object in the Xr interface.

2 API and Examples

This section provides a tour of the applicationprogramming interface (API) provided by Xr.Major features of the API are demonstrated inillustrations, and the source code for each illus-tration is provided in Appendix A.

2.1 Xr Initialization

#include <Xr.h>

#define WIDTH 600#define HEIGHT 600#define STRIDE (WIDTH * 4)

char image[STRIDE*HEIGHT];

intmain (void){

XrState *xrs;

xrs = XrCreate ();

XrSetTargetImage (xrs, image,XrFormatARGB32,WIDTH, HEIGHT, STRIDE);

/* draw things using xrs ... */

XrDestroy (xrs);

/* do something useful with image(eg. write to a file) */

return 0;}

Figure 4: Minimal program using Xr

Figure 4 shows a minimal program usingXr. This program does not actually do usefulwork—it never draws anything, but it demon-strates the initialization and cleanup proce-dures required for using Xr.

After including the Xr header file, the first Xrfunction a program must call is XrCreate. Thisfunction returns a pointer to an XrState object,which is used by Xr to store its data. The

Linux Symposium 483

XrState pointer is passed as the first argumentto almost all other Xr functions.

Before any drawing functions may be called,Xr must be provided with a target surface toreceive the resulting graphics. The backend ofXr has support for multiple types of graphicstargets. Currently, Xr has support for render-ing to in-memory images as well as to any XWindow System “drawable”, (eg. a window ora pixmap).

The program calls XrSetTargetImage to directgraphics to an array of bytes arranged as 4-byte ARGB pixels. A similar call, XrSetTar-getDrawable, is available to direct graphics toan X drawable.

When the program is done using Xr, it signi-fies this by calling XrDestroy. During XrDe-stroy, all data is released from the XrState ob-ject. It is then invalid for the program to usethe value of the XrState pointer until a new ob-ject is created by calling XrCreate. The resultsof any graphics operations are still available onthe target surface, and the program can accessthat surface as appropriate, (eg. write the im-age to a file, display the graphics on the screen,etc.).

2.2 Transformations

All coordinates passed from user code to Xr arein a coordinate system known as “user space”.These coordinates are then transformed to “de-vice space” which corresponds to the devicegrid of the target surface. This transformationis controlled by the current transformation ma-trix (CTM) within Xr.

The initial CTM is established such that oneuser unit maps to an integer number of devicepixels as close as possible to 3780 user unitsper meter (˜96 DPI) of physical device. Thisapproach attempts to balance the competingdesires of having a predictable real-world in-

terpretation for user units and having the abilityto draw elements on exact device pixel bound-aries. Ideally, device pixels would be so smallthat the user could ignore pixel boundaries, butwith current display pixel sizes of about 100DPI, the pixel boundaries are still significant.

The CTM can be modified by the user to po-sition, scale, or rotate subsequent objects to bedrawn. These operations are performed by thefunctions XrTranslate, XrScale, and XrRotate.Additionally, XrConcatMatrix will compose agiven matrix into the current CTM and XrSet-Matrix will directly set the CTM to a givenmatrix. The XrDefaultMatrix function can beused to restore the CTM to its original state.

Figure 5: Hering illusion (originally discov-ered by Ewald Hering in 1861)[11]. The radiallines were positioned with XrTranslate and Xr-Rotate

In Figure 5, each of the radial lines was drawnusing identical path coordinates. The differentangles were achieved by calling XrRotate be-fore drawing each line. The source code forthis image is in Figure 11.

2.3 Save/Restore of Graphics State

Programs using a structured approach to draw-ing will modify graphics state parameters ina hierarchical fashion. For example, while

Linux Symposium 484

traversing a tree of objects to be drawn a pro-gram may modify the CTM, current color, linewidth, etc. at each level of the hierarchy.

Xr supports this hierarchical approach tographics by maintaining a stack of graphicsstate objects within the XrState object. TheXrSave function pushes a copy of the currentgraphics state onto the top of the stack. Mod-ifications to the graphics state are made onlyto the object on the top of the stack. The Xr-Restore function pops a graphics state objectoff of the stack, restoring all graphics parame-ters to their state before the last XrSave opera-tion.

This model has proven effective within struc-tured C programs. Most drawing functions canbe written with the following style, wrappingthe body of the function with calls to XrSaveand XrRestore:

voiddraw_something (XrState *xrs){

XrSave (xrs);/* draw something here */XrRestore (xrs);

}

This approach has the benefit that modifica-tions to the graphics state within the functionwill not be visible outside the function, leadingto more readily reusable code. Sometimes asingle function will contain multiple sectionsof code framed by XrSave/XrRestore calls.Some find it more readable to include a newindented block between the XrSave/XrRestorecalls in this case. Figure 12 contains an exam-ple of this style.

2.4 Path Construction

One of the primary elements of the Xr graphicsstate is the current path. A path consists of one

or more independent subpaths, each of which isan ordered set of straight or curved segments.Any non-empty path has a “current point”, thefinal coordinate in the final segment of the cur-rent subpath. Path construction functions mayread and update the current point.

Xr provides several functions for construct-ing paths. XrNewPath installs an empty path,discarding any previously defined path. Thefirst path construction called after XrNewPathshould be XrMoveTo which simply moves thecurrent point to the point specified. It is alsovalid to call XrMoveTo when the current pathis non-empty in order to begin a new subpath.

XrLineTo adds a straight line segment to thecurrent path, from the current point to the pointspecified. XrCurveTo adds a cubic Bézierspline with a control polygon defined by thecurrent point as well as the three points speci-fied.

XrClosePath closes the current subpath. Thisoperation involves adding a straight line seg-ment from the current point to the initial pointof the current subpath, (ie. the point specifiedby the most recent call to XrMoveTo). Call-ing XrClosePath is not equivalent to addingthe corresponding line segment with XrLineTo.The distinction is that a closed subpath willhave a join at the junction of the final coin-cident point while an unclosed path will havecaps on either end of the path, (even if the twoends happen to be coincident). See Section 2.5for more discussion of caps and joins.

It is often convenient to specify path coordi-nates as relative offsets from the current pointrather than as absolute coordinates. To allowthis, Xr provides XrRelMoveTo, XrRelLineTo,and XrRelCurveTo. Figure 6 shows a render-ing of a path constructed with one call to Xr-MoveTo and four calls to XrRelLineTo in aloop. The source code for this figure can beseen in Figure 13.

Linux Symposium 485

Figure 6: Nested box illusion (after a figure byAl Seckel[11]). Constructed with XrMoveToand XrRelLineTo

As rectangular paths are commonly used, Xrprovides a convenience function for adding arectangular subpath to the current path. Acall to XrRectangle(xrs, x, y, width,height) is equivalent to the following se-quence of calls:

XrMoveTo (xrs, x, y);XrRelLineTo (xrs, width, 0);XrRelLineTo (xrs, 0, height);XrRelLineTo (xrs, -width, 0);XrClosePath (xrs);

After a path is constructed, it can be drawnin one of two ways: stroking its outline(XrStroke) or filling its interior (XrFill).

2.5 Path Stroking

XrStroke draws the outline formed by strokingthe path with a pen that in user space is cir-cular with a radius of the current line width,(as set by XrSetLineWidth). The specificationof the XrStroke operator is based on the con-volution of polygonal tracings as set forth byGuibas, Ramshaw and Stolfi [4]. Convolutionlends itself to efficient implementation as theoutline of the stroke can be computed within anarbitrarily small error bound by simply using

piece-wise linear approximations of the pathand the pen.

As subsequent segments within a subpath aredrawn, they are connected according to oneof three different join styles, (bevel, miter, orround), as set by XrSetLineJoin. Closed sub-paths are also joined at the closure point. Un-closed subpaths have one of three different capstyles, (butt, square, or round), applied at ei-ther end of the path. The cap style is set withthe XrSetLineCap function.

Figure 7 demonstrates the three possible capand join styles. The source code for this figure(Figure 12) demonstrates the use of XrSetLine-Join and XrSetLineCap as well as XrTranslate,XrSave, and XrRestore.

Figure 7: Demonstration of cap and join styles

2.6 Path Filling

XrFill fills the area on the “inside” of the cur-rent path. Xr can apply either the winding ruleor the even-odd rule to determine the meaningof “inside”. This behavior is controlled by call-ing XrSetFillRule with a value of either XrFill-RuleWinding or XrFillRuleEvenOdd.

Figure 8 demonstrates the effect of the fill rulegiven a star-shaped path. With the winding rulethe entire star is filled in, while with the even-

Linux Symposium 486

odd rule the center of the star is considered out-side the path and is not filled. Figure 15 con-tains the source code for this example.

Figure 8: Demonstration of the effect of the fillrule

2.7 Controlling Accuracy

The graphics rendering of Xr is carefully im-plemented to allow all rendering approxima-tions to be performed within a user-specifiederror tolerance, (within the limits of machinearithmetic of course). The XrSetTolerancefunction allows the user to specify a maximumerror in units of device pixels.

The tolerance value has a strong impact on thequality of rendered splines. Empirical testingwith modern displays reveals that errors largerthan 0.1 device pixels are observable. The de-fault tolerance value in Xr is therefore 0.1 de-vice pixels.

The user can increase the tolerance value totradeoff rendering accuracy for performance.Figure 9 displays the same curved path ren-dered several times with increasing tolerancevalues. Figure 14 contains the source code forthis figure.

2.8 Paint

The example renderings shown so far have allused opaque “paint” as the source color for alldrawing operations. The color of this paint canbe controlled with the XrSetRGBColor func-tion.

Figure 9: Splines drawn with tolerance valuesof .1, .5, 1, 5, and 10

Xr supports more interesting possibilities forthe paint used in graphics operations. First, thesource color need not be opaque; the XrSetAl-pha function establishes an opacity level for thesource paint. The alpha value ranges from 0(transparent) to 1 (opaque).

When Xr graphics operations combine translu-cent surfaces, there are a number of differentways in which the source and destination col-ors can be combined. Xr provides support forall of the Porter/Duff compositing operators aswell as the extended operators defined in theX Render Extension. The desired operator isselected by calling XrSetOperator before com-positing. The default operator value is Xr-OperatorOver corresponding to the Porter/DuffOVER operator.

Finally, the XrSetPattern function allows anyXrSurface to be installed as a static or repeat-ing pattern to be used as the “paint” for subse-quent graphics operations. The pattern surfacemay have been initialized from an external im-age source or may have been the result of pre-vious Xr graphics operations.

Figure 10 was created by first drawing small,vertical black and white rectangles onto a 3X2surface. This surface was then scaled, filtered,and used as the pattern for 3 XrFill operations.This demonstrates an efficient means of gener-ating linear gradients within Xr.

Linux Symposium 487

Figure 10: Outline affects perception of depthfrom shading, (after an illustration by IsaoWatanabe[14]). This example uses XrFill withXrSetPattern

2.9 Images

In addition to the vector path support, Xr alsosupports bitmapped images as primitive ob-jects. Images are transformed, (and optionallyfiltered), by the CTM in the same manner as allother primitives. In order to display an image,an XrSurface object must first be created forthe image, then the surface can be displayedwith the XrShowSurface function. XrShow-Surface places an image of the given width andheight at the origin in user space, so XrTrans-late can be used to position the surface.

In addition to the CTM, each surface also hasits own matrix providing a transformation fromuser space to image space. This matrix can beused to transform a surface independently fromthe CTM.

The XrShowSurface function has another im-portant use besides allowing the display of ex-ternal images. When using the Porter/Duffcompositing operators, it is often desirable tocombine several graphics primitives on an in-termediate surface before compositing the re-sult onto the target surface. This functionalityis similar to the notion of transparency groupsin PDF 1.4 and can be achieved with the fol-lowing idiom:

XrSave (xrs);XrSetTargetSurface (xrs, surface);/* draw to intermediate surface with

any Xr functions */XrRestore (xrs);XrShowSurface (xrs, surface);

In this example an intermediate surface is in-stalled as the target surface, and then graphicsare drawn on the intermediate surface. WhenXrRestore is called, the original target surfaceis restored and the resulting graphics from theintermediate surface are composited onto theoriginal target.

This technique can be applied recursively withany number of levels of intermediate surfaceseach receiving the results of its “child” surfacesbefore being composited onto its “parent” sur-face.

Alternatively, images can be constructed fromdata external to the Xr environment, acquiredfrom image files, external devices or even thewindow system. Because the image formatsused within Xr are exposed to applications, thiskind of manipulation is easy and efficient.

3 Implementation

As currently implemented, Xr has good sup-port for all functions described here. The majoraspects of the PostScript imaging model thathave not been discussed are text/font support,clipping, and color management. Xr does in-clude some level of experimental support fortext and clipping already, but these areas needfurther development.

The Xr system is implemented as 3 major li-brary components: libXr, libXc, and libIc.LibXr provides the user-level API described indetail already.

LibXc is the backend of the Xr system. Itprovides a uniform, abstract interface to sev-

Linux Symposium 488

eral different low-level graphics systems. Cur-rently, libXc provides support for drawing tothe X Window System or to in-memory im-ages. The semantics of the libXc interface areconsistent with the X Render Extension so it isused directly whenever available.

LibIc is an image compositing library that isused by libXc when drawing to in-memory im-ages. LibIc can also be used to provide supportfor a low-level system whose semantics do notmatch the libXc interface. In this case, libIc isused to draw everything to an in-memory im-age and then the resulting image is providedto the low-level system. This is the approachlibXc uses to draw to an X server that does notsupport the X Render Extension.

The libIc code is based on the original code forthe software fallback in the reference imple-mentation of the X Render Extension. It wouldbe useful to convert any X server using that im-plementation to instead use libIc.

These three libraries are implemented in ap-proximately 7000 lines of C code.

4 Related Work

Of the many existing graphics systems, severalrelate directly to this new work.

4.1 PostScript and Display PostScript

As described in the introduction, Xr adopts(and extends) the PostScript rendering model.However, PostScript is not just a renderingmodel as it includes a complete program-ming language. Display PostScript embeds aPostScript interpreter inside the window sys-tem. Drawing is done by generating PostScriptprograms and delivering them to the windowsystem.

One obvious benefit of using PostScript every-

where is that printing and display can easily bedone with the same rendering code, as long asthe printer supports PostScript. A disadvantageis that images are never generated within theapplication address space making it more diffi-cult to use where PostScript is not available.

Using the full PostScript language as an inter-mediate representation means that a significantfraction of the overall application developmentwill be done in this primitive language. In ad-dition, the PostScript portion is executed asyn-chronously with respect to the remaining code,further complicating development. Integratingthe powerful PostScript rendering model intothe regular application development languageprovides a coherent and efficient infrastructure.

4.2 Portable Document Format

PDF provides a very powerful renderingmodel, but no application interface. Generat-ing PDF directly from an application would re-quire some kind of PDF API along with a PDFinterpreter. The goal for Xr is to be able to gen-erate PDF output files while providing a cleanapplication interface.

A secondary goal is to allow PDF interpretersto be implemented on top of Xr. As Xr is miss-ing some of the less important PDF operations,those will need to be emulated within the inter-preter. An important feature within Xr is thatsuch emulation be reasonably efficient.

4.3 OpenGL

OpenGL[12] provides an API with muchthe same flavor as Xr; immediate modefunctions with an underlying stateful library.OpenGL doesn’t provide the PostScript ren-dering model, and doesn’t purport to supportprinting or the local generation of images.

As Xr provides an abstract interface atop many

Linux Symposium 489

graphics architectures, it should be possible tolayer Xr on OpenGL.

5 Future Work

The Xr library is in active development. Every-thing described in this paper is currently work-ing, but much work remains to make the librarygenerally useful for application development.

5.1 Text Support

Much of the current design effort has beenfocused on the high-level drawing model andsome low-level rendering implementation forgeometric primitives. This design effort wassimplified by the adoption of the PostScriptmodel. PostScript offers a few useful sug-gestions about handling text, but applicationsrequire significantly more information aboutfonts and layout. The current plan is to requireapplications to use the FreeType [13] libraryfor font access and the Fontconfig [8] libraryfor font selection and matching. That shouldleave Xr needing only relatively primitive sup-port for positioning glyphs and will push issuesof layout back on the application.

5.2 Printing Backend

Xr is currently able to target the X WindowSystem, (with or without the X Render Ex-tension), as well as local images. Still miss-ing is the ability to generate PostScript or PDFoutput files. Getting this working is importantnot only so that applications can print, but alsobecause there may be unintended limitationsin both the implementation and specificationcaused by the essential similarity between thetwo existing backends.

One of the goals of Xr is to have identical out-put across all output devices. This will re-quire that Xr embed glyph images along with

the document output to ensure font matchingacross all PostScript or PDF interpreters. Em-bedding TrueType and Type1 fonts in the out-put file should help solve this problem.

5.3 Color Management

Xr currently supports only the RGB colorspace. This simplifies many aspects of thelibrary interface and implementation. Whileit might become necessary to add support formore sophisticated color management, suchdevelopment will certainly await a compellingneed. One simple thing to do in the meantimewould be to reinterpret the device-dependentRGB values currently provided as sRGB in-stead. Using ICC color profiles would per-mit reasonable color matching across deviceswhile not adding significant burden to the APIor implementation.

6 Availability

Xr is free software released under an MIT li-cense fromhttp://xr.xwin.org .

7 Disclaimer

Portions of this effort sponsored by the Defense Advanced Research

Projects Agency (DARPA) under agreement number F30602-99-1-

0529. The views and conclusions contained herein are those of the au-

thors and should not be interpreted as representing the official policies

or endorsements of the Defense Advanced Research Projects Agency

(DARPA) or the U.S. Government.

References

[1] Adobe Systems Incorporated.PostScriptLanguage Reference Manual. AddisonWesley, 1985.

Linux Symposium 490

[2] Simon Budig. The linux-pinguin again.http://www.home.unix-ag.org/simon/penguin .

[3] Larry Ewing. Linux 2.0 penguins.http://www.isc.tamu.edu/~lewing/linux .

[4] Leo Guibas, Lyle Ramshaw, and JorgeStolfi. A kinetic framework forcomputational geometry. InProceedingsof the IEEE 1983 24th AnnualSymposium on the Foundations ofComputer Science, pages 100–111. IEEEComputer Society Press, 1983.

[5] Adobe Systems Incorporated, editor.PDF Reference: Version 1.4.Addison-Wesley, 3rd edition, 2001.

[6] Peter Mattis, Spencer Kimball, and theGIMP developers. The GIMP: The GNUimage manipulation program.http://www.gimp.org .

[7] Keith Packard. Design andImplementation of the X RenderingExtension. InFREENIX Track, 2001Usenix Annual Technical Conference,Boston, MA, June 2001. USENIX.

[8] Keith Packard. Font Configuration andCustomization for Open Source Systems.In 2002 Gnome User’s and DevelopersEuropean Conference, Seville, Spain,April 2002. Gnome.

[9] Thomas Porter and Tom Duff.Compositing Digital Images.ComputerGraphics, 18(3):253–259, July 1984.

[10] Robert W. Scheifler and James Gettys.XWindow System. Digital Press, thirdedition, 1992.

[11] Al Seckel.The Great Book of OpticalIllusions. Firefly Books Ltd., 2002.

[12] Mark Segal, Kurt Akeley, and Jon Leach(ed). The OpenGL Graphics System: ASpecification. SGI, 1999.

[13] David Turner and TheFreeType Development Team. Thedesign of FreeType 2, 2000.http://www.freetype.org/freetype2/docs/design/ .

[14] Isao Watanabe. 3-d shape and outline.http://www.let.kumamoto-u.ac.jp/watanabe/Watanabe-E/Illus-E/3D-E/index.%html .

A Example Source Code

This appendix contains the source code thatwas used to draw each figure in Section 2. Eachexample contains a top-level “draw” functionthat accepts an XrState pointer, a width, and aheight. The examples here can be made intocomplete programs by adding the code fromthe example program of Figure 4 and insertinga call to the appropriate “draw” function.

Linux Symposium 491

voiddraw_hering (XrState *xrs,

int width, int height){#define LINES 32.0#define MAX_THETA (.80 * M_PI_2)#define THETA (2 * MAX_THETA / (LINES-1))

int i;

XrSetRGBColor (xrs, 0, 0, 0);XrSetLineWidth (xrs, 2.0);

XrSave (xrs);{

XrTranslate (xrs, width / 2, height / 2);XrRotate (xrs, MAX_THETA);

for (i=0; i < LINES; i++) {XrMoveTo (xrs, -2 * width, 0);XrLineTo (xrs, 2 * width, 0);XrStroke (xrs);

XrRotate (xrs, - THETA);}

}XrRestore (xrs);

XrSetLineWidth (xrs, 6);XrSetRGBColor (xrs, 1, 0, 0);

XrMoveTo (xrs, width / 4, 0);XrRelLineTo (xrs, 0, height);XrStroke (xrs);

XrMoveTo (xrs, 3 * width / 4, 0);XrRelLineTo (xrs, 0, height);XrStroke (xrs);

}

Figure 11: Source for Hering illusion of Fig-ure 5

voiddraw_caps_joins (XrState *xrs,

int width, int height){

static double dashes[2] = {10, 20};int line_width = height / 12 & (~1);

XrSetLineWidth (xrs, line_width);XrSetRGBColor (xrs, 0, 0, 0);

XrTranslate (xrs, line_width, line_width);width -= 2 *line_width;

XrSetLineJoin (xrs, XrLineJoinBevel);XrSetLineCap (xrs, XrLineCapButt);stroke_v_twice (xrs, width, height);

XrTranslate (xrs, 0, height/4-line_width);XrSetLineJoin (xrs, XrLineJoinMiter);XrSetLineCap (xrs, XrLineCapSquare);stroke_v_twice (xrs, width, height);

XrTranslate (xrs, 0, height/4-line_width);XrSetLineJoin (xrs, XrLineJoinRound);XrSetLineCap (xrs, XrLineCapRound);stroke_v_twice (xrs, width, height);

}

voidstroke_v_twice (XrState *xrs,


XrMoveTo (xrs, 0, 0);XrRelLineTo (xrs, width/2, height/2);XrRelLineTo (xrs, width/2, -height/2);

XrSave (xrs);XrStroke (xrs);XrRestore (xrs);

XrSave (xrs);{

XrSetLineWidth (xrs, 2.0);XrSetLineCap (xrs, XrLineCapButt);XrSetRGBColor (xrs, 1, 1, 1);XrStroke (xrs);

}XrRestore (xrs);

XrNewPath (xrs);}

Figure 12: Source for cap and join demonstra-tion of Figure 7

Linux Symposium 492

voiddraw_spiral (XrState *xrs,


int wd = .02 * width;int hd = .02 * height;int i;

width -= 2;height -= 2;

XrMoveTo (xrs, width - 1, -hd - 1);for (i=0; i < 9; i++) {

XrRelLineTo (xrs, 0, height-hd*(2*i-1));XrRelLineTo (xrs, -(width-wd*(2*i)), 0);XrRelLineTo (xrs, 0,-(height-hd*(2*i)));XrRelLineTo (xrs, width-wd*(2*i+1), 0);

}

XrSetRGBColor (xrs, 0, 0, 1);XrStroke (xrs);

}

Figure 13: Source for nested box illusion ofFigure 6

voiddraw_splines (XrState *xrs,


int i;double tolerance[5] = {.1,.5,1,5,10};double line_width = .08 * width;double gap = width / 6;

XrSetRGBColor (xrs, 0, 0, 0);XrSetLineWidth (xrs, line_width);

XrTranslate (xrs, gap, 0);for (i=0; i < 5; i++) {

XrSetTolerance (xrs, tolerance[i]);draw_spline (xrs, height);XrTranslate (xrs, gap, 0);

}}

voiddraw_spline (XrState *xrs, double height){

XrMoveTo (xrs, 0, .1 * height);height = .8 * height;XrRelCurveTo (xrs,

-height/2, height/2,height/2, height/2,0, height);

XrStroke (xrs);}

Figure 14: Source for splines drawn with vary-ing tolerance as in Figure 9

voiddraw_stars (XrState *xrs,


XrSetRGBColor (xrs, 0, 0, 0);

XrSave (xrs);{

XrTranslate (xrs, 5, height/2.6);XrScale (xrs, height, height);star_path (xrs);XrSetFillRule (xrs, XrFillRuleWinding);XrFill (xrs);

}XrRestore (xrs);

XrSave (xrs);{

XrTranslate (xrs,width-height-5, height/2.6);

XrScale (xrs, height, height);star_path (xrs);XrSetFillRule (xrs, XrFillRuleEvenOdd);XrFill (xrs);

}XrRestore (xrs);

}

voidstar_path (XrState *xrs){

int i;double theta = 4 * M_PI / 5.0;

XrMoveTo (xrs, 0, 0);for (i=0; i < 4; i++) {

XrRelLineTo (xrs, 1.0, 0);XrRotate (xrs, theta);

}XrClosePath (xrs);

}

Figure 15: Source for stars to demonstrate fillrule as in Figure 8

Linux Symposium 493

voiddraw_gradients (XrState *xrs,

int img_width, int img_height){

XrSurface *gradient;double width, height, pad;

width = img_width / 4.0;pad = (img_width - (3 * width)) / 2.0;height = img_height;

gradient=make_gradient(xrs,width,height);

XrSetPattern (xrs, gradient);draw_flat (xrs, width, height);XrTranslate (xrs, width + pad, 0);XrSetPattern (xrs, gradient);draw_tent (xrs, width, height);XrTranslate (xrs, width + pad, 0);XrSetPattern (xrs, gradient);draw_cylinder (xrs, width, height);

XrRestore (xrs);

XrSurfaceDestroy (gradient);}

XrSurface *make_gradient (XrState *xrs,

double width, double height){

XrSurface *g;XrMatrix *matrix;

XrSave (xrs);

g = XrSurfaceCreateNextTo (XrGetTargetSurface (xrs),XrFormatARGB32, 3, 2);

XrSetTargetSurface (xrs, g);

XrSetRGBColor (xrs, 0, 0, 0);XrRectangle (xrs, 0, 0, 1, 2);XrFill (xrs);



XrRestore (xrs);

matrix = XrMatrixCreate ();XrMatrixScale (matrix,

2.0/width, 1.0/height);XrSurfaceSetMatrix (g, matrix);XrSurfaceSetFilter (g, XrFilterBilinear);XrMatrixDestroy (matrix);

return g;}

voiddraw_flat (XrState *xrs, double w, double h){

double hw = w / 2.0;

XrRectangle (xrs, 0, hw, w, h - hw);

XrFill (xrs);}

voiddraw_tent (XrState *xrs, double w, double h){


XrMoveTo (xrs, 0, hw);XrRelLineTo (xrs, hw, -hw);XrRelLineTo (xrs, hw, hw);XrRelLineTo (xrs, 0, h - hw);XrRelLineTo (xrs, -hw, -hw);XrRelLineTo (xrs, -hw, hw);XrClosePath (xrs);

XrFill (xrs);}

voiddraw_cylinder (XrState *xrs, double w, double h){


XrMoveTo (xrs, 0, hw);XrRelCurveTo (xrs, 0, -hw,

w, -hw, w, 0);XrRelLineTo (xrs, 0, h - hw);XrRelCurveTo (xrs, 0, -hw,

-w, -hw, -w, 0);XrClosePath (xrs);

XrFill (xrs);}

Figure 16: Source for 3 gradient-filled shapes of Figure 10

relayfs: An Efficient Unified Approach forTransmitting Data from Kernel to User Space

Tom [email protected]

Karim [email protected]

Robert [email protected]

Richard [email protected]

Michel [email protected]

Abstract

Linux has several mechanisms for relaying in-formation about the system and applications tothe user. Some examples include printk andother syslog events, evlog, ltt, oprofile, etc.Each subsystem has its own method for relay-ing information from the kernel to user space.Some of these mechanisms have difficulties,e.g. logging of printk messages is unreliable.In addition to selected difficulties, the replica-tion of code and maintenance is undesirable. Inthis paper we describe a high-speed data relayfilesystem that satisfies the buffering require-ments of the above subsystems while providinga unified, efficient, and reliable relay mecha-nism. relayfs allows subsystems to log data ef-ficiently and safely using lockless technologythat is designed to scale well on multiproces-sor systems. relayfs includes the flexibility tobe expanded should other subsystems need ad-ditional services, but has a simple design in-tended to meet the needs of currently availablesubsystems. In this paper we discuss the ar-chitecture, implementation, and usage of re-layfs. relayfs uses channels that allow data tobe directed to a suitable buffer or buffers forthe subsystems that allocated the channel. Wedescribe the kernel API and file naming con-ventions, address init-time issues, and discussperformance trade-offs available using relayfs.

Finally we demonstrate how existing subsys-tems use relayfs to log their data.

1 Introduction

Sharing data between the kernel and user-spaceapplications requires buffering. For relativelysmall transfers, such as passing variables orsmall data arrays, the normal system call APIis sufficient; the overhead required for safelytransferring data across the kernel boundary isacceptable. For larger data transfers, the sub-system generating the data is responsible forproviding a buffering and transfer mechanismto deliver the data to user space. With in-creasing system speed and growing demand formore information regarding the kernel’s opera-tion, many of the conventional buffering mech-anisms have reached their limits. Further, thebuffering and transfer code for each of the sub-systems is replicated and needs to be indepen-dently maintained.

To address these challenges, we designed andimplemented relayfs. relayfs is a unified, re-liable, efficient, and simple mechanism fortransferring large amounts of data between thekernel and user space. The algorithms usedfor relayfs have been designed to handle bothhigh frequency and large data applications suchas kernel tracing. To satisfy these demand-

Linux Symposium 495

xlog/channel1

xlog/channel3

xlog/channel2

printk/channel1

.

.

.

relayfs

printk

xlog

User−space

Kernel Subsystems /mnt/xlog/channel1

/mnt/xlog/channel2

/mnt/xlog/channel3

/mnt/printk/channel1

Figure 1: relayfs architecture

ing clients, relayfs is a high-speed, low-impactdata relay filesystem. Data logged by a sub-system using relayfs appears under the direc-tory where the filesystem is mounted. For ex-ample, if relayfs is mounted under /mnt, printkdata may appear in /mnt/printk/data. Althoughthis paper focuses on the use of relayfs for ker-nel subsystems, user-space clients can also beserved by relayfs.

relayfs provides the abstraction of achanneltothe subsystem using its services. Channels andfiles have a 1-to-1 mapping. A given subsystemmay log data to multiple channels. For exam-ple, a tracing subsystem may log the majorityof its data to one channel and create a controlchannel for less frequent but informative eventssuch as new process creation. printk could, forexample, use different channels to log informa-tion from devices versus the memory system.Alternatively, a kernel developer could createa separate channel for just the events in newlywritten (and currently being debugged) code,so as to view only those.

relayfs has a simple design and mechanisms

intended to meet the demands of current sub-systems while affording flexibility in meetingnew requirements of future subsystems. re-layfs provides the option of using locking orlockless data logging. The lockless option isvery efficient but introduces potential difficul-ties as discussed later. relayfs provides thechoice of using per-processor buffers or a sin-gle buffer shared across the machine. It alsoprovides choice between block or packet deliv-ery of events. The trade-offs of these optionsare discussed later as well.

In addition to these options, relayfs reducescode replication among subsystems by provid-ing mechanisms common to the subsystems.These include handling overflow issues, pro-viding efficient timestamping on events, pro-viding efficient delivery to user space, and han-dling dynamic allocation and de-allocation ofmemory needed to provide the buffering. Testsconducted using relayfs have shown it can han-dle significant amounts of data with very lowoverhead.

The rest of the paper is structured as follows.

Linux Symposium 496

Section 2 describes the channel architectureand the lockless algorithm. Section 3 describeshow the code is structured and where to find theimplementation files. Section 4 describes theinterface to relayfs, the different options avail-able, and how it handles overflows. Section 5describes the impact and performance of re-layfs. Section 6 describes how to modify someexample subsystems to use relayfs. Section 7concludes.

2 Architecture

Figure 1 presents the relayfs architecture. Ker-nel subsystems create and use different relayfschannels to log their data, while user-space ap-plications see those same channels as files lo-cated in the mounted relayfs filesystem.

2.1 Channels

The main building block of relayfs is a channel.To relay data to user-space applications, kernelsubsystems allocate channels to transfer theirdata. Multiple channels can be used to imple-ment any data multiplexing desired, includingusing one channel per CPU to implement per-CPU buffering.

The buffering implemented by relayfs is trans-parent to the client subsystem. Writing to thechannel requires that the following be speci-fied: channel ID, pointer to the data, and sizeof data. relayfs does not parse the data beingrelayed; it just transfers bytes. Because somerelayfs clients may implement a particular dataprotocol (for example, special markings onbuffer ends), relayfs provides callback func-tions for its clients when significant changes inthe channel buffers occur. The callbacks areoptional.

From user space, channels are accessed as files.relayfs implements the standard operations re-

quired for normal file manipulation. For exam-ple, to gain access to a relayfs file, applicationscan perform read() operation on the file speci-fying a number of bytes. Alternatively the ap-plication can mmap() in the file and referencethe contents of the file via memory pointer.

2.2 Lockless Event Logging

An important feature of relayfs is its abil-ity to write data to buffers without requiringlocks. Previous lockless logging schemes[2]used fixed-length events with valid bits. Thereare several advantages to using variable-lengthevents (assuming the random access problemis solved, see Section 4.2). The algorithmintegrated into relayfs allows variable-lengthevents to be logged without locking.

Each process attempts toreserveenough spacein the buffer immediately after the currentindex for the event it wants to log. Oncethe process makes a successful reservation,it may proceed to log its data. To reservespace, each process attempts to atomically in-crement the current index using acompareand store . The process that successfullyincrements (as determined by the return valueof thecompare and store operation) theindex has the right to proceed to log data intothe buffer; failing processes retry. Figure 2shows on the left, in step 1, two processes,A and B, attempting to log events of differentlengths after the current index from the initialconfiguration in step 0. Each process attemptsto increment the current index by the size of theevent being logged. The process that succeeds,in this case B, will log the event immediatelyfollowing the old current index (see step 2).This will be followed by process A’s data, as-suming no other competing processes attemptto log more data (see step 3). Because it is im-portant to guarantee monotonically increasingtimestamps, processes must re-determine thetimestamp during each attempt to atomically

Linux Symposium 497

proc A

B A

proc B

B 2

3

current index

current index

current index

current index

current index

1

0

Figure 2: Illustration of Lockless Event Logging

eventReserve(length, *indexPtr, *timestampPtr)integer: oldIndex, newIndexEvtCtl: *evtCtlPtr

update evtCtlPtrdo

oldIndex = evtCtlPtr->indexnewIndex = oldIndex + lengthif (newIndex >= buffer end)

eventReserveSlow(length, indexPtr, timestampPtr)// generates filler event, sets timestamp, moves to new buffer

return*timestampPtr = getTimestamp()

while (!CompareAndStore(&(evtCtlPtr->index), oldIndex, newIndex))*indexPtr = oldIndex & INDEXMASK // confine index to buffer bounds

eventLog(majorID, minorID, data)integer: index, timestamp, length

length = length of dataeventReserve(length, &index, &timestamp)evtArray[index] = logEvtHeader(timestamp, length, majorID, minorID)evtArray[index+1 ... index+length] = dataeventCommit(index, length) // optional, see explanation in text

Figure 3: Pseudo code for lockless event logging

increment the index.

The memory used for logging is logically di-vided into buffers. Once a buffer is full, thelogging facility proceeds to the subsequentbuffer and the previous buffer is available tobe written out. The pseudo code appears inFigure 3 and complete C code can be obtainedby downloading relayfs from the relayfs website[1].

Despite having good performance, complica-tions can arise from using the lockless algo-

rithm. A process’s execution may be inter-rupted after it has reserved space to log anevent, but before it actually performs the log.The interruption can occur because the pro-cess is preempted, blocks for a long time, oris killed. Depending on where (in the sequenceof code in Figure 3) the process is interrupted,different problems occur. If the process hashad a chance to write the event header, but notthe data, then only the data will be unrecov-erable. If, however, the process has not yetlogged the event header then it is possible the

Linux Symposium 498

rest of the buffer will be uninterpretable. Onlyby locking, making the kernel perform the log,and disabling interrupts can this problem beprevented (in practice there are low-level ker-nel events that would still exhibit the problem).There are methods that avoid these difficulties.

If the reason the process’s execution was inter-rupted was due to preemption, then it is likelythe process will run again soon and finish fill-ing in the event before another entity notices,thus posing no real problem. If the reason forthe process’s interruption was because the pro-cess was killed, then the data will never finishbeing logged. The last line of pseudo-code de-tects this situation. TheeventCommit func-tion updates a per-buffer count of the amountof data that has been logged to that buffer.The count is zeroed during thestart newbuffer code. When the code responsiblefor writing the data (to a network stream, file,etc.) writes this buffer, it can compare theamount of data logged to this buffer with thebuffer’s size and report an anomaly if they donot match. If the reason the process was inter-rupted was because of a long blocking opera-tion, it is possible that both the current bufferwill not have enough data logged, and that thesame buffer, when reused in the future, willhave too much (because the long-blocked pro-cess was unblocked and logged data into a re-cycled buffer). Again the per-buffer counts candetect this situation.

Besides a per-buffer count there are other pos-sible ways to detect or minimize the occurrenceof corrupted data. For example, it is possible touse a flag in a per-process data structure to in-dicate to the kernel that a process should notbe killed while the flag is set. Other possibili-ties include zero-filling a buffer before use, orkeeping a side array of valid bits for the headerdata. In practice the probability of corrupt-ing a buffer, and the ease with which tools canhandle the situation, reduces the issue’s impor-

tance except perhaps in setups where recreat-ing the situation generating the logged data isvery difficult. In those cases the locking ver-sion of the event logging may be the best op-tion.

3 Implementation

The relayfs code is structured as follows: thepublic API and common relay code are con-tained in fs/relayfs/relay.c, with the scheme-specific code in fs/relayfs/relay_lockless.cand fs/relayfs/relay_locking.c. The filefs/relayfs/inode.c implements the VFS layer ontop of the relay channel code.

relayfs can be compiled either directly into thekernel or as a kernel module.

4 Interface and Use

relayfs is used both as a temporary repositoryfor logged data and as a filesystem from whichuser-space clients may retrieve logged data.The first of these is supported by set of kernel-space APIs. For the second, relayfs is mountedas a filesystem:

mount -t relayfs relayfs /mnt/relay

Kernel subsystems (also referred to as kernelclients) create and write to channels via thekernel API described below. The contents ofthese channels are available to user-space pro-grams via a standard file abstraction that canbe read using mmap() or read(). relayfs pro-vides automatic support for locking or locklesslogging, overflow handling, data delivery, andtimestamping.

4.1 Interface

This section describes the basic usageof the kernel and user APIs. Com-

Linux Symposium 499

plete details can be found in Documenta-tion/filesystems/relayfs.txt. To initiate datalogging, a kernel client creates a channel rela-tive to the mountpoint of the relayfs filesystemvia relay_open():

int channel_id =relay_open("file", ...);

This would cause the creation of a relayfs filenamed /mnt/relay/file, assuming relayfs wasmounted at /mnt/relay.

relayfs does not impose any namespace con-ventions; clients may choose names as theywish. We recommend the adoption of the con-vention whereby a client specifies a top-leveldirectory name that is closely associated withthe corresponding subsystem. For example, theLinux Trace Toolkit would manage the names-pace under /mnt/relay/trace, printk would use/mnt/relay/printk, driver debugging channelsmight use /mnt/relay/debug/drivers/mydriver,etc.

A kernel client can then log a variable-lengthdata item to the channel via relay_write(),given the channel id.

relay_write(channel_id, data,count,...);

In user space, a program can open the relayfsfile and wait in a read() loop waiting for data.

fd = open("/mnt/relay/file", ...);

while(1) {n = read(fd, buf, sizeof(buf));if(n <= 0) {

close(fd);break;

}}

Alternatively, the file can be mmap()’ed anddirectly accessed via a pointer to the mappedbuffer when data is ready.

fd = open("/mnt/relay/file", ...);char *map = mmap(..., fd, ...);

void on_ready(count) {write(diskfile, map, count);

}

There are five callbacks that can optionally beregistered by the kernel client when a chan-nel is opened. These are used to notify theclient when significant events occur (bufferstart, buffer end, event deliver, buffers full,buffer resize) and are described below.

4.2 Channel and data management schemes

The channel data for a given channel is inter-nally managed via one of two schemes definedat channel creation. They arelocklessor lock-ing. The lockless scheme is described in Sec-tion 2. The locking scheme is a simple two-buffer ping-pong scheme. One of the buffers isthe current write buffer, into which events arewritten, and the other is the current read buffer,from which events are read (by, for instance, auser daemon). When the current write bufferis filled, it becomes the current read buffer andthe current read buffer becomes the new writebuffer. The key feature of the locking schemeis that the channel is locked (the exact seman-tics are described below) while space is allo-cated for the event and for the duration of thewrite.

The reason two schemes exist is that while itwould be ideal to have all channels managedby the lockless scheme, the availability of thelockless scheme depends on the availability ofa cmpxchg instruction or a generic equivalent,which does not exist on all Linux platforms.Thus, the locking scheme is available as a fall-back scheme for those platforms that cannotsupport the lockless scheme.

relayfs channels are implemented as circularbuffers divided into a number of sub-buffers.If a scheme has not been explicitly specified,the channel creation code will choose lock-less. The number and size of these sub-buffers

Linux Symposium 500

are specified at channel creation (or with cer-tain restrictions can be dynamically sized after-ward). For efficiency reasons, in the locklesscode, both of these are a power of 2. Havingeach buffer size be a power of 2 allows a cheaplogicaland comparison to determine if the endof a buffer has been reached. Having the num-ber of buffers be a power of 2 allows a cheaperoperation to be performed to ensure the indexremains within the memory allocated for thebuffers. Both of these operations occur on thefast path of every event log and thus are criti-cal. The index value could be interpreted as astraight index into a single circular buffer, butfor other reasons multiple buffers are prefer-able. For example, breaking up the buffers intomultiple sub-buffers rather than a single largebuffer allows flexibility in buffer processing,allows more timely delivery of events to userspace, and minimizes the impact of potentialdata corruption.

One aspect of using the lockless scheme is theuse of variable-length events. There are trade-offs between using fixed-length or variable-length events. Fixed-size events allow for sim-pler logging and reading out as the consumerof events always knows the starting point ofan event. This allows validity bits to be used,and allows invalid events to be skipped. Fixed-length events allow easy random access tothe data stream, aiding reading and display-ing large files. The disadvantages of fixed-length events are that they waste space, theytake longer to write (to disk or network) be-cause extra data needs to be written for shortevents, and they make it complicated to logdata that is larger than the fixed size. Thelockless scheme obtains the benefits of eachby ensuring that events never cross medium-scale alignment boundaries. We insert fillerevents as necessary to align the event stream.Data analysis tools can skip to any of the align-ment points in a large event buffer and can be-gin interpreting events from that point. This

unused

unused

unused

unused

unused

reserved

uncommitted 110110

1

0

offsetbuf idindex

3 bits 18 bits

bufs_produced: 38bufs_consumed: 35buffer_complete[]:

alloc_size: 2M

rchan_info:n_bufs: 8buf_size: 256K

Figure 4: Buffer layout.

technique provides the advantages of variable-length events and still allows fast access to allparts of a large file. A filler event is a headerwith a length equal to the remainder of the cur-rent buffer; no data need be logged. For theclients we have studied this alignment wasteslittle space. Other applications that have fewlarge events and whose events frequently endon buffer boundaries will exhibit similar be-havior. This technique does not provide com-pletely random access, but is a close enoughapproximation that it allows post-processingtools to make it appear to the end user that thestream is completely random access.

Figure 4 shows a 2M buffer divided into 8 sub-buffers of 256K (in this case, 256K would bethe alignment boundary described above). Thissplits the index logically into a 3-bit portioncontaining the buffer id and 18 bits contain-ing the current offset within the buffer. Therchan_info struct, retrieved via relay_info(),

Linux Symposium 501

contains a snapshot of the current state of thechannel. The data in Figure 4 shows that 35sub-buffers have been consumed and 38 havebeen produced. The buffer_complete arrayshows the completeness state of all of the sub-buffers not including the current one. As Fig-ure 4 shows, sub-buffer 3 is not yet completeas there is still a pending write. This is alsothe reason buffers_consumed has not caught upwith buffers_produced. The user-space clientsuspends processing the sub-buffers when itencounters the incomplete buffer until suchtime as the buffer becomes complete, or is“lapped”, in which case the buffer contentswould be lost.

4.3 Buffer start/end callbacks

Two of the five channel callbacks registeredby the kernel client exist to allow the clientto be notified when sub-buffer boundaries arecrossed, i.e. when buffer switches occur. Thesecallbacks are invoked in the slow path of eventlogging, which is executed when an eventwrite would overflow the current sub-buffer.The buffer_end() callback is invoked to allowthe kernel subsystem the opportunity to per-form end-of-buffer processing on the just-filledbuffer. To allow space for data even when abuffer is exactly filled, there needs to be spacereserved at the end of the buffer into whichthe client can write an unused count. This isthe purpose of the end_reserve parameter to re-lay_open. It specifies the number of bytes atthe end of each sub-buffer that should be leftalone by the logging algorithm and left avail-able for the client to write data. Similarly, thebuffer_start() callback is invoked to give theclient an opportunity to write header data at thebeginning of a sub-buffer. The start_reserveparameter to relay_open() allows the client tospecify a value for this purpose. One of theparameters of the buffer_start() callback is thebuffer id, which has a value of zero for the very

first buffer in the channel. Clients can check forthis value and optionally write channel headerdata in the position reserved for it by the chan-nel_start_reserve parameter of relay_open().

4.4 Channel attributes

The characteristics of a given channel are de-rived from a set of channel attributes specifiedwhen the channel is opened. These are:

• scheme: lockless or locking. Indicateswhich of the logging algorithms to use. If’any’ is specified, relayfs attempts to usethe lockless scheme; if unavailable it re-verts to the locking scheme.

• SMP usage: global or SMP. Appliesonly if the locking scheme is being used.Global indicates that the channel is be-ing shared across multiple processors. Inthis case lock acquisition involves a spin-lock_irqsave/restore. The SMP option in-dicates per-processor channels and thuslock acquisition can use the cheaper lo-cal_irqsave/restore.

• delivery mode: bulk or packet. If bulk de-livery mode is specified, the kernel clientis notified via the delivery callback whencomplete buffers become available. Ifpacket delivery mode is specified, the de-livery callback is invoked after each write.Bulk delivery is suited for clients logginglarge volumes of data. They would be no-ticeably affected by callbacks after eachevent, but still may be interested in thebeginnings and ends of buffers. Less de-manding clients may require notificationfor each event logged. Clients specifycallbacks to, for example, signal a user-level program indicating data is ready.This is a typical mode of operation for abulk data client and is somewhat less typ-ical for packet client as its user-space pro-

Linux Symposium 502

gram is often polling for the next piece ofdata. Either type of client may be inter-ested in filtering the data and/or dispatch-ing events to other channels or subsys-tems. Typically, packet clients would bemore interested in data filtering due to themore manageable volume of events.

• timestamping: TSC or gettimeofday()deltas. This attribute allows the client tochoose the granularity and cost of times-tamping. Timestamping is optional andtimestamps are not written to a channelunless explicitly requested. Events aretimestamped by relay_write() timestampevents using either the efficient TSC (orequivalent cheap clock on other architec-tures), or a slower but globally consistentgettimeofday() time delta method. get-timeofday() is also a fallback in the caseswhere the TSC, or an equivalent clockcounter, is unavailable on a given archi-tecture. In brief, part of the task of writ-ing an event involves obtaining the cur-rent TSC, or the current gettimeofday()value. If the gettimeofday() option is cho-sen each timestamp is the difference be-tween the current time and the time whenthe buffer was started. The value loggedis either the TSC value or this differenceand is written at the offset within the eventslot specified by the td_offset parameterof relay_write() (if the parameter value isnegative, the timestamp is not written).The start and end buffer gettimeofday()and TSC (if applicable) values are avail-able as parameters to the buffer_start()and buffer_end() kernel client callbacks.If a client is interested in timestamping,it can write these values into the reservedspace for later inter-buffer correlation.

4.5 Channel overflow handling

As with any buffering scheme, data may bewritten into a channel faster than the channel’sclients can read it out. relayfs channel clientshave three options for dealing with an over-flow:

• do nothing, writers overwrite old data(flight recorder mode)

• suspend writing into the channel, causingloss of new events

• resize the channel, making more space forwriters

The first two options are controlled by the re-turn value of the buffers_full() callback. Thiscallback provides the client with the option ofwhat action to take if the consumer has not keptpace with the logging. A value of 0 directs re-layfs to continue logging events, overwritingthe oldest data. A value of 1 directs relayfsto discard subsequent events until the over-flow situation has been resolved. In this case,an events_lost count is kept and is availablevia relay_info(). Once the consumer (usuallythe user-space daemon reading from the chan-nel) has caught up, the relayfs client can callrelay_resume() to allow the channel to con-tinue logging events. To implement this, theclient needs to keep the channel informed ofhow many buffers it has read. It incrementsthe count of buffers consumed by calling re-lay_buffers_consumed(n_buffers). This valueis compared with the channel’s count of buffersproduced (tracked on the buffer-switch slowpath) to determine whether a buffers-full con-dition exists. If the difference is greater than orequal to the number of sub-buffers in the chan-nel, the buffers are considered full and the call-back is invoked.

The third option, resizing the channel, is avail-able to clients that have specified non-zero val-ues for the resize_min and resize_max parame-ters to relay_open() when the channel was cre-ated. If (during the buffer-switch slow path)

Linux Symposium 503

relayfs detects that the channel is almost full(if 3/4 of the sub-buffers remain unread, by de-fault), the needs_resize() callback is invokedwith parameter values containing a suggestednew sub-buffer size and/or sub-buffer count,which can be used to expand the buffer space.The client can use these values (or ignorethem and supply its own) to allocate a newbuffer for the channel via the API function re-lay_resize_channel(). This function can block,so it should not be called with spinlocks held.If called from user context, it directly allo-cates the new buffer, which is available uponreturn. If called from within interrupt context,the allocation is put onto a work queue, andthe client is notified upon completion via an-other call to the needs_resize() callback. Oncethe new buffer is allocated, the client can callrelay_replace_buffer() to replace the channel’sbuffer. This function can be called from anycontext. Clients call it when they can guaranteethe replacement does not interfere with otherchannel activity, such as outstanding writes.Reducing the buffer size follows a similar path.When relayfs detects that the “almost-full”condition has not existed for a period of time (1minute by default), the needs_resize() callbackis invoked with the new suggested (smaller)sub-buffer values. Buffer reduction is handledby the client in a similar manner to buffer ex-pansion. Clients can choose to ignore the de-tails of buffer resizing. To do so, they specifya non-zero value for the autoresize parameterto relay_open() causing buffer re-allocation re-quests to be placed onto a work queue. The re-sizing strategy reflects empirical observationsthat channel traffic tends to be bursty in naturewith sudden activity creating immediate short-term need for increased buffer capacity, whichafter a short period is no longer needed.

Config. avg time differenceper run

(in seconds)

1 756.12 766.7 +1.40%3 771.3 +2.01%

Figure 5: LTT on relayfs test results.

5 Testing

To test the efficiency of relayfs, we ran LTT,a very demanding client of relayfs, while per-forming 10 kernel compiles under each of thefollowing conditions:

1. not tracing anything (baseline)

2. tracing everything, daemon not writing todisk

3. tracing everything, daemon writing todisk

Testing was performed on a 4-way 700MHzPentium III system. The tests generated largeamounts of data for relayfs to process. Approx-imately 200 million events comprising about2 gigabytes were generated during each 10-compile run. As can be seen from the results inFigure 5, the overhead of relayfs was at most1.4 percent (a portion of this overhead is infact due to tracing code), demonstrating the ef-ficiency of relayfs.

6 Example client subsystems

In this section we examine some practical usesof relayfs. In particular, we discuss how re-layfs can be used as an engine for printk andLTT, how it can be used for driver debugging,and how it could become a replacement for rv-malloc and rvfree.

Linux Symposium 504

6.1 printk

We have developed a version of printk that re-places the static printk buffer with a dynam-ically resizeable relayfs channel. This solvesthe lost printk problem by providing reliableprintk logging services. A dynamically resize-able channel prevents lost printk messages innormal usage, but it can also prevent the lossof init-time printk messages. At init-time (be-fore free_initmem()), the contents of the staticprintk buffer are copied into the relayfs chan-nel created for printk. The static printk bufferused at init-time is marked as __initdata andis subsequently discarded. A benefit of thisscheme is that the printk kernel buffer can bemade relatively large. In fact, it can be largeenough so that it does not overflow even whencopious boot messages are printed on a largesystem. The relayfs version of printk modi-fies the syslog(3) system call to read from theprintk channel instead of from the static ker-nel buffer. Since /proc/kmsg also uses sys-log(3) to retrieve data from the kernel buffer,user-space programs that read from the printkbuffer do not need to be modified to use therelayfs version of printk. The relayfs versionof printk adds commands to syslog(3) allowinguser-mode clients to manually resize the printkchannel; these of course must be coded for. Seethe relayfs web page[1] for code and status.

6.2 Linux Trace Toolkit

LTT creates a separate bulk-delivery chan-nel for each processor. The LTT ker-nel client replicates the inner workingsof relay_write by using the relayfs low-level API described briefly in Documenta-tion/filesystems/relayfs.txt. More detail can befound by examining trace() in kernel/trace.cand the relay_write() implementation. It usesthe low-level API to obtain maximum perfor-mance from relayfs, because system tracing is

one of the most demanding clients. We wouldexpect most subsystems to function acceptablyusing the described APIs, but the low-level APIis available for those requiring maximum per-formance. The low-level API allows a clientto write directly into the reserved slot in thechannel, rather than passing the address of abuffer to relay_write for it to copy. By doingso, LTT avoids the overhead of forcing trace()to collate its fields into a separate buffer beforepassing it to relay_write, and avoids putting theevent data on the stack (in reality this would notbe possible because some events can be 8K inlength). This is more convenient than havingtrace() manage a staging buffer between poten-tially multiple writers.

rchan = rchan_get(channel_handle);if (rchan == NULL)

return -ENODEV;

/* this is a nop for lockless */relay_lock_channel(rchan, flags);

reserved =relay_reserve(rchan,

data_size, &time_stamp,&time_delta, &reserve_code,&interrupting);

if (reserved == NULL)goto check_buffer_switch_signal;

bytes_written +=relay_write_direct(reserved,

&event_id, sizeof(event_id));bytes_written +=

relay_write_direct(reserved,&data, sizeof(data));

relay_commit(rchan, reserved,bytes_written, reserve_code,interrupting);

relay_unlock_channel(rchan, flags);rchan_put(rchan);

In the above code, the first few tasks arebookkeeping tasks. These involve getting thechannel structure backing the channel handle,and locking the channel (if applicable—for the

Linux Symposium 505

lockless scheme, this is effectively a no-op).We then reserve a slot, and using the specialrelay_write_direct() low-level method (essen-tially a memcpy), write fields directly into thechannel buffer. When we are done writing thefields, we indicate to the channel that the datais ready by ’committing’ the slot, and then fin-ish up with a couple more bookkeeping tasks.

See the relayfs web page[1] for code and status.

6.3 Driver debugging

It is straightforward to create and use a relayfschannel for driver tracing/debugging. Open achannel with the desired_attributes:

channel = relay_open("channel", ...);

and write to it from within your driver:

relay_write(channel, ...)

Then write a simple user-space program thatloops around read():

fd = open("/mnt/relay/channel", ...);

for(;;)read(fd, ...);

6.4 rvmalloc/rvfree replacement

At last count there were 9 drivers with sepa-rate implementations of rvmalloc/rvfree. Therehave been discussions in the past of exportinga single ’blessed’ instance of rvmalloc/rvfree,but so far that has not happened. Since re-layfs channels are based on rvmalloc/rvfree,relayfs provides the equivalent of a public rv-malloc/rvfree by providing clients a means tocreate rvmalloc’ed buffers using degeneratevalues to relay_open() e.g. a large frame buffercould be allocated via relay_open with mostparameters 0/NULL:

int channel_id =relay_open(

"framebuffer",

bufsize, /* size of sub-buffer */1, /* one sub-buffer only */0, /* no flags in this case */,NULL, /* no callbacks */0, /* no reserve */0, /* no reserve */0, /* no reserve */0, /* no resize_min */0, /* no resize_max */0, /* don’t autoresize */NULL); /* no special file ops */

The above opens a relay channel with a file-name of /mnt/relay/framebuffer, assuming re-layfs is mounted at /mnt/relay.

Information about the channel can be retrievedusing relay_info().

relay_info(channel_id, &rchan_info);

The data contained in rchan_info includes thevirtual address of the buffer allocated for thechannel via rvmalloc. This can be directly usedto write into the buffer memory from the kernelside.

The file created for the channel can then subse-quently be mmap()’ed into user space:

int framebuf_file =open("/mnt/relay/framebuffer",

...);char * framebuf = mmap(NULL, bufsize,

PROT_READ|PROT_WRITE, MAP_PRIVATE,framebuf_file);

Note that if a channel is mmap()’ed, it can-not be resized, but a client wishing to resizethe channel can unmap the file, resize it, thenremap it.

Finally, closing the channel frees the buffer viarvfree():

relay_close(channel_id);

7 Conclusion

We have presented relayfs, an efficient and uni-fied mechanism for transferring data from ker-nel to user space. relayfs addresses the need

Linux Symposium 506

to have a reliable mechanism that can be usedacross various kernel subsystems, without hav-ing to replicate and separately maintain thefunctionality for each subsystem. The lock-less and per-processor techniques allow effi-cient logging on multiprocessor systems, andthe channels provide flexibility to the clients.Test results show that relayfs performs welleven under demanding workloads. We pre-sented several example kernel subsystems thatuse relayfs including printk, LTT, and driverdebugging. As relayfs becomes more widelyaccepted, other kernel subsystems will be ableto use it, reducing code replication and improv-ing reliability of the kernel subsystems.

References

[1] relayfs home page,http://www.opersys.com/relayfs .

[2] Robert W. Wisniewski and Luis F.Stevens. A model and tools for supportingparallel real-time applications in unixenvironments. InProceedings of The 12thIEEE Real-Time Technology andApplications Symposium, pages 126–133,Chicago Illinois, May 15-17 1995.

Linux IPv6 NetworkingPast, Present, and Future

Hideaki YoshifujiThe University of Tokyo

[email protected]

Kazunori MiyazawaYokogawa Electric Corporation

[email protected]

Yuji SekiyaThe University of Tokyo

[email protected]

Hiroshi EsakiThe University of [email protected]

Jun MuraiKeio University

[email protected]

Abstract

In order to deploy high-quality IPv6 proto-col stack, we, USAGI Project[13], have ana-lyzed and addressed issues on Linux IPv6 im-plementation. In this paper / in our talk inOLS2003, we describe the analysis of LinuxIPv6 protocol stack, improvements and imple-mentation of the IPv6 and IPsec protocol stackand patches which are integrated into the main-line kernel. We will explain the impacts of ourAPI improvements on network applications.

We want to discuss on missing pieces and di-rection for future development.

As a demonstration we would like to provideIPv6 network connectivity to the OLS2003meeting venue.

1 Introduction

Establishment of IPv6, as a next-generation In-ternet protocol to IPv4, started from the be-ginning of the 1990’s. The aspect of IPv6 ison providing the solution to the protocol scal-ability, the greatest problem IPv4 facing as theInternet growing larger. In detail, IPv6 differfrom IPv4 in following ways.

• 128bit address space.

• Forbidding of packet fragmentation in in-termediate routers.

• Flexible feature extension using extensionheaders.

• Supporting security features by default.

• Supporting Plug & Play features by de-fault.

Currently, IPv6 is at the final phase of standard-ization. Fundamental specifications are almostfixed and commercial products with IPv6 sup-port are being deployed in the market. Inter-national leased lines for IPv6 are out as well.IPv6 has expanded the existing Internet by pro-viding solutions to protocol scalability and be-ginning to grow as a standard for connectingeverything, not just existing computers.

2 The Dawn of Linux IPv6 Stack

Linux IPv6 implementation was originally de-veloped by Pedro Roque and integrated intomainline kernel at the end of 1996 in early 2.1days and this was the one of the earliest imple-mentation of IPv6 stack in the world.

Linux Symposium 508

In 1998, Linux IPv6 Users JP, which is a groupof Linux IPv6 users in Japan, examined the sta-tus of IPv6 implementation in Linux and rec-ognized the several grave issues1.

• lack of scope in socket API; for exampleLinux does not havesin6_scope_idmember insockaddr_in6{} .

• So many bugs (Table 1), found by theTAHI [11] IPv6 Conformance Test Suite,especially in Neighbor Discovery andStateless Address Auto-configuration.

• "default routes" are ignored on routers

• many missing features such as IPsec, Mo-bile IP.

These were because the stack had not beenwell-maintained / developed since 2.1 becausethere were not so widely used by Linux hack-ers. Thus, there had been few new features.Implementation had not followed the specifica-tion even the spec had been changed, and then,conformity to Specification became very low.

In 2.3 days, they, the Linux IPv6 Users JP,developedsin6_scope_id support. Theircode was integrated into mainline kernel.There, however, were very few change in 2.3days other than this.

Considering above circumstances, USAGIProject was lunched in October, 2000. US-AGI Project is a project which aims to pro-vide improved IPv6 stack on Linux; It seems tobe required (almost) full-time task-force whichcommits Linux IPv6 development. There aresimilar organization called KAME [6], whichprovides IPv6 stack on BSD Operating sys-tems such as FreeBSD, NetBSD, OpenBSD,and BSD/OS. However, KAME Project doesnot target their development on Linux. It is

1We will discuss them later.

important to provide high-quality IPv6 stackon Linux, which is one of the most popu-lar free open-source operating systems in theworld, and widely used in embedded systems,for IPv6 to propagate.

Table 1: Summary of TAHI Conformance Test(linux-2.2.15, %)

Test Series Pass Warn Fail

Spec. 94 6 0ICMPv6 100 0 0

Neighbor Discovery 34 0 66Autoconf 4 0 96PMTU 50 0 50

IPv6/IPv4 Tunnel 100 0 0Robustness 100 0 0

3 USAGI Challenges in Linux IPv6Stack

Since USAGI Project started, we have contin-ued analyzing issues we faced. In this section,we describes issues we found and our chal-lenges to solve them.

3.1 ND and Addrconf

Neighbor Discovery (ND [8]) and StatelessAddress Auto-configuration (Addrconf, [12])are ones of the core features of IPv6. They takevery important role to keep stable communica-tion. However, the results of the ConformanceTests of Linux IPv6 stack were bad.

We’ve tried to fix the problems in the followingway.

• Reinforcing checking illegal ND Mes-sages

• Improving control times for ND state tran-sition and address validation.

Linux Symposium 509

��

��

��

��

��

�� !#"��

��

��

��

��

�� !#"

��

��

��

��

��

�� "!$#&%

�� "!$#('

�� "!$#*)

�� "!$#(+

��

��

��

��

��

�� "!$#&%

�� "!$#('

�� "!$#*)

�� "!$#(+

Figure 1: NDP Table: Linux vs USAGI

• Fixing ND state transition

3.1.1 Improving Timers for ND

The state of a neighbor is changed by eventssuch as incoming Neighbor Advertisementmessage and timer expiration. It is required tomanage timer accurately.

However, the existing Linux IPv6 protocolstack checks reachability of neighbor nodeswith a single kernel timer(Figure 1 (Left)).Consequently, reachability were checked inconstant intervals, regardless of the status foreach node.

Therefore, USAGI Project improved this ker-nel timer to check each NDP entry indepen-dently as shown in Figure 1 (Right). Thus, re-source management for a neighbor includingmutual exception is simplified, and it is pos-sible to enable and disable timer separately foreach NDP entry, and prevent check made to un-necessary NDP entries. Moreover, it is possi-ble to exchange messages correspondent to thestatus of each NDP entry as defined in the NDPspecifications.

now

tstamp

tstamp

Figure 2: Dynamic Address Validation Timer

3.1.2 Improving Timers for Address Vali-dation

As ND is, the state of an address is changed byevents such as incoming Router Advertisementand time expiration. It is required to managetimer accurately, especially for Privacy Exten-sions [7].

However, the existing Linux IPv6 protocolstack performs validity checks with a long-term, constant, single kernel timer.

USAGI Project introduced new dynamic timer.When the timer expires, the timeout functionvisits each address for validation and deter-mining the next timeout (Figure 2) with mini-mum and maximum interval between timeouts.Thus, accuracy of timer is improved. It is usualthat several addresses request next timeout at(almost) the same time, introducing minimuminterval between operations aggregates themand suppresses load of timer events.

Table 2 shows the results of these improve-ments and other minor fixes. Neighbor Discov-ery and Autoconf are significantly improved.

3.2 Routing Restructuring

3.2.1 Default Route Support on Routers

In routing table using radix tree[9], the topof the tree is the host which possesses theinformation regarding “default route.” How-ever, as shown in Figure 3, Linux IPv6

Linux Symposium 510

Table 2: Summary of TAHI Conformance Test(usagi24-s20020401, %)


Spec. 100 0 0ICMPv6 100 0 0

Neighbor Discovery 79 5 15Autoconf 98 2 0PMTU 50 0 50

IPv6/IPv4 Tunnel 100 0 0Robustness 100 0 0

��

��

�� "!

#�$ ��%�� #&$ �'�(�%��

� )+*,�'��-�./.0�1�2 $/#43�5��62798/8;:�<=798>:@?A��48B��C�DE8 �6

��

��

�� "!

#�$ ��%�� #&$ �'�(�%��

� )+*,�'��-�./.0�1�2 $/#43�5��62798/8;:�<=798>:@?A��48B��C�DE8 �6

Figure 3: Linux IPv6 Routing Table Structure

protocol stack has a radix tree with fixednode information on top and it points toipv6_null_entry . Therefore, when de-fault route is added, the information is attachednext to thert6_info{} structure which con-tains ipv6_null_entry . This causes de-fault route not to be referred.

In USAGI implementation, we replace theipv6_null_entry with the new entrywhen adding a new routing entry on the toplevel root of the tree (Figure 4). Whenthe last route is being deleted from thethe top level root of the tree, we re-insertipv6_null_entry . Thus, we can insertand remove the “default route” entries properlyto/from the routing table.

3.3 Improvements on Router Selection

We pick one default router from the de-fault router list and round-robin the de-

��

��

�� !"�#�

�$� �%� �&'��

� (*)+�%�,�-/.0.1��2 �0�43

�&�65279808;:�</7 8;:>=*�?�48@�A��BDCE8 '5

�#�&�"��F�� A�&�"��G��

��

��

�� !"�#�

�$� �%� �&'��

� (*)+�%�,�-/.0.1��2 �0�43

�&�65279808;:�</7 8;:>=*�?�48@�A��BDCE8 '5

�#�&�"��F�� #�&�"��F�� A�&�"��G�� A�&�"��G��

Figure 4: USAGI IPv6 Routing Table Structure

metrics::/0

fib6_node

rt6_info

RTN_INFO

rt6_dflt_ptr

Figure 5: Default Routers in Linux

fault router list when it becomes unreach-able. The default router is pointed by thert6_dflt_pointer , which is guarded byrt6_dflt_lock , and default routers arestored on the top level root node of the routingtree (Figure 5). In this implementation, therewere several issues.

• rt6_dflt_pointer is reset whenrouting is modified; this happens very of-ten and routers are not equally selected.

• We did not regard the metrics; we couldnot force using routes with smaller met-rics (which is probably added manually.)

“Default Router Preferences, More-SpecificRoutes, and Load Sharing” [1] improves theability of hosts to pick an appropriate router,especially when the host is multi-homed andthe routers are on different links, and mandatesload-share between routers with same "prefer-ences."

To implement this specification, we stores pref-erence (2 bits) of routes into the flags of the

Linux Symposium 511

metrics::/0

fib6_node

rt6_info

RTN_INFO

same metrics

Figure 6: New Method for Route Round-robin

routing informations instead of reflecting it tothe metrics; We would have to fix up routingtable when receiving RA.

We also make a new generic round-robincode for the routes with same metrics(Figure 6). We use this for all routesand the rt6_dflt_pointer andrt6_dflt_lock are eliminated. Nowwe are free from above issues.

4 Linux IPv6 in 2.6

In this section, we describes key changes ofIPv6 networking code between 2.4 and 2.62

Then we try to describes how IPsec works.

4.1 Key Changes in 2.6.x

We have been developing IPv6 actively sinceend of 2.3.x era. However, only several se-lected changes were integrated into the main-line tree. One reason was that we were obscureand novice on kernel development.

After experiencing about two years of kerneldevelopment, we started integrating our effortsto the mainline kernel from the fall of 2002more aggressively than before.

We have been being fixing several bugs suchas:

• Verify ND options properly

2Some changes will be appeared in 2.4.21 (or later2.4.x series).

• Refine IPv6 Address Validation Timer

• Fixing source address of MLD messages

• Avoiding garbage sin6_scope_id forMSG_ERRQUEUE message

In addition to these bug fixing, we’ve inte-grated following new features:

IPsec for IPv6This is based on IPsec for IPv4, developedby David S. Miller et.al. See section 4.2.

Default route support on routerSee section 3.2.1.

IPV6_V6ONLY supportSee section 5.2.4.

ICMP6 rate limit supportAdded rate-limit sysctl for ICMPv6 likefor ICMP.

Privacy Extensions [7]Assign randomized interface identifier toimprove privacy.

AF-independent XFRM InfrastructureSplit up XFRM subsystem into af-independent portion and af-specific por-tion. Section 4.2.3.

Per-interface Statistics InfrastructureMake a new infrastructure to provide per-interface statistics information.

4.2 IPsec

IP security provides security functionality forIP layer. An implementation of IPv4 IPsec byFreeS/WAN is available for years, however, thecode was never merged into the mainline ker-nel. In 2000, IABG Project provided IPv6 sup-port patch for FreeS/WAN. It, however, was“patched” and also unlikely to be merged intothe mainline kernel.

Linux Symposium 512

We redesigned the architecture for multi-protocol, both IPv4 and IPv6, extensible IPsec.In our design, IPv4 and IPv6 share the SecurityPolicy Database (SPD) and Security Associa-tion Database (SAD). CryptoAPI and its vari-ants are used for cryptographic, digesting andcompression/decompression algorithms.

4.2.1 Stackable Destination and XFRM

A new framework for processing IP packetshas been introduced into linux-2.5.x. It iscalled “stackable destination” and XFRM.

Stackable destination is like a linked list ofdst{} , which is made temporally and cached.We are able to insert anotherdst{} to orig-inal dst{} and make a stack of thedst{}structure. dst{} normally has a pointerto xfrm_state{} , whose output providessome functionality, i.e. transformation, for thepacket.

XFRM stands for transformer.xfrm_policy{} and xfrm_state{}represent IPsec policy and IPsec SA respec-tively. xfrm_state{} is associated withxfrm_policy{} by xfrm_tmpl{} . SPDconsists ofxfrm_policy{} . SAD alsoconsist ofxfrm_state{} .

4.2.2 Packet Processing

The output process of IPsec fully uses thisarchitecture. The order of primal func-tions arexfrm_lookup() , xfrm_tmpl_resolve() , xfrm_bundle_create()and dst_output() . xfrm_lookup()looks up xfrm_policy{} in SPD afterrouting resolution. At the moment the pa-rameterdst{} in the stack points originaldst{} structure.xfrm_tmpl_resolve()is called in xfrm_lookup() to resolve

xfrm_tmpl{} in xfrm_policy{} whichrepresents how the packet is processed and findxfrm_state{} matched up withxfrm_tmpl{} . This process is equivalent to lookingup IPsec SA or IPsec SA bundle matched withIPsec policy. xfrm_bundle_create()creates the stackable destination and IPsec SAbundle if multiple SA are needed. These func-tions are called at routing resolution.dst_output() is called after building up thepacket. Each output routine specified by thefunction pointer in thedst{} is called alongwith the chain ofdst{} . This pointer pointse.g esp6_output() . The output functionis able to usexfrm_state{} from dst{}pointer insk_buff{} .

Output Packet Processing

Lookup Routing Table

Find xfrm_policy as IPsec policy

Look up xfrm_state with comparingwith xfrm_tmpl in the policy

xfrm_policy xfrm_tmplxfrm_tmplxfrm_tmpl

dstsk_buff

xfrm_tmplxfrm_tmplxfrm_state

ip6_route_output

xfrm_lookup

xfrm_tmpl_resolv

xfrm_bunele_createConnect xfrm_state with the dst andcreate stackable destination

dstsk_buff

xfrm_statedst

dstoriginal dst

xfrm_state

Figure 7: IPsec output process

The input process for IPsec is more simplethan output. AH and ESP process routinesare registered toinet6_protos[] at ini-tiation. The kernel parse a packet and callthe routines when protocol is AH or ESP.To unified extension header processing, allheader types and handlers are registered ininet6_protos[] like upper layer proto-col. IPsec packet process is looking upxfrm_state{} and process it. When itsucceeds, usedxfrm_state{} pointer keepin sec_path{} in sk_buff{} which con-tains the packet. After processing IPsec,the kernel call xfrm_policy_check()at entrance of upper layer process. Inxfrm_policy_check() the kernel matchup xfrm_tmpl{} in xfrm_policy{} and

Linux Symposium 513

Input Packet Processing

Comparing xfrm_tmpl in xfrm_policyand xfrm_state used for header processing

xfrm6_rcv

ip6_input_finish

IP Layer Process

xfrm6_policy_check

xfrm_state looking upUsed xfrm_states are connect with skb

Call header processing functions registeredwith ip_proto

xfrm_policy xfrm_tmplxfrm_tmplxfrm_tmpl

sec_pathsk_buffxfrm_tmplxfrm_tmplxfrm_state

Compare

Figure 8: IPsec input process

xfrm_state{} kept insec_path{} .

4.2.3 AF Indenendent XFRM Infrastruc-ture

Since core functionality of the XFRM engine iscommon among address families, AF indepen-dent XFRM infrastructure has been introduced.

Address family specific XFRM functions areregistered via address family information ta-bles, e.g. xfrm_policy_afinfo{} andxfrm_state_afinfo{} . Common vari-ables are also passed via the tables.

4.2.4 Key And Policy Management Inter-face

PF_KEYandnetlink(7) interface are pro-vided as IPsec interfaces.PF_KEY providesinterface to maintain SAD and SPD.PF_KEYprotocol is defined in RFC2367 but it is not soenough to maintain IPsec that implementationof PF_KEY is ordinary extended. The exten-sion is different each implementation. Linux-2.5.xPF_KEYis compatible with KAME.

4.2.5 Test Results

On 24th April, 2003, Tom Lendacky reportedto netdev mailing list that Test results of Linux-2.5 IPsec are very excellent(Table 3).

We have tried to fix the bugs in IPv6 IPsec frag-mentation, and they should be fixed for now.


ipsec 95 2 3ipsec4 98 2 0

ipsec4-udp 96 4 0

Table 3: Summary of TAHI Conformance Test(linux-2.5.58, %)

5 Modern Programming Style forNetwork Applications

It is requested that applications should supportboth IPv4 and IPv6. In this section, we tryto describe modern programming style for net-work applications.

5.1 Socket API and Protocol Independency

The Socket API is the framework of program-ming for communication including networkingvia the Internet. It was designed to be proto-col independent. Communication is abstractedby the socket descriptor, and endpoint informa-tion, which is protocol dependent, is passed viaopaque pointers to the generic socket addressstructuresockaddr .

IPv6 networking is also supported inthis framework. New address familyAF_INET6 and IPv6 socket address structuresockaddr_in6 are defined.

Linux Symposium 514

5.2 Protocol Independent Programming

The framework of the Socket API between thekernel and the user space is basically proto-col independent. However, since the socketaddress structure and the naming space ofthe protocol depend on the protocol (or ad-dress), it was protocol dependent to lookup thename/address and to setup the protocol specificsocket address structure. This prevented appli-cation from the protocol independency.

In RFC2133[3], new two name-lookup func-tions are defined: getaddrinfo() andgetnameinfo() . It abstracts translation be-tween name/service representation and socketaddress structure.

5.2.1 getaddrinfo(3)

getaddrinfo(3) [2] is the protocol inde-pendent function for forward lookup (nameto address). This function looks up the“node” and “service” on condition that isspecified by the “hints.” It returns dynami-cally allocated linked list ofaddrinfo{} .Eachaddrinfo{} includes information forsocket(2) , connect(2) (orbind(2) , ifAI_PASSIVE flag is specified in hints).

Thus, application is not required to know thedetails of socket address structure, now. Aclient application walks though the list tryingto create a socket and trying to connecting re-mote host until one of the attempt succeeds.Likewise, a server application walks throughthe list trying to create a socket and trying tobinding local address.

5.2.2 getnameinfo(3)

getnameinfo(3) [2] is the protocol inde-pendent function for reverse lookup (address

to name). This function takes socket addressstructure and looks up node name and ser-vice name on condition specified by the flags.By using this function, application is not re-quired to know the details of each socket ad-dress structure for extracting addresses and / orport number.

For example, to extract numeric ser-vice number from the socket addressstructure, use getnameinfo(3) withNI_NUMERICSERV, and convert the resultingservice number usingatoul(3) .

Sample programs usinggetaddrinfo(3)andgetnameinfo(3) are provided in Ap-pendix.

5.2.3 getifaddrs(3)

Other issue to support IPv6 in application howwe know addresses on the node on which itis running. SIOCGIFADDR is used for IPv4,however,ifreq{} is not enough to store IPv6socket address structure. There might be pos-sibility to introduce newioctl(2) to man-age lager addresses, however, it is nasty to getinformation via buffer of fixed length. Thus,getifaddrs(3) was invented.3 This func-tion grubs network address information includ-ing netmask etc. on the node. MAC, IPv4 andIPv6 are supported for now. Application walksthrough the linked list returned from this func-tion, looking for appropriate information usingthe family, flags etc.

Sample programs usinggetifaddrs(3) isprovided in Appendix.

3BSDI’s invention; this is not standardized yet.

Linux Symposium 515

5.2.4 IPV6_V6ONLY Socket Option

IPv6 sockets may be used for both IPv4 andIPv6 communications. IPv4-mapped IPv6 ad-dress is defined [5] to enable IPv6 applicationIPv4 address of an IPv4 node is represented asan IPv4-mapped IPv6 address in such applica-tions.

Linux supports this feature and port space ofTCP (or UDP) has been completely shared be-tween IPv4 and IPv64.

However, some applications may want to re-strict their use of an IPv6 socket to IPv6communication only. For these applications,IPV6_V6ONLY is defined in RFC3493 [2].

In 2.6, IPV6_V6ONLY socket option is sup-ported. In this implementation, the ‘IPv4-mapped” feature is enabled by default as be-fore, and as spec says. If theIPV6_V6ONLYsocket option is set to the IPv6 socket, thesocket will not care about IPv4 address spaceat all.

As mentioned before, spec says this “IPv4-mapped” feature is enable by default. How-ever, there are OSes, such as NetBSD, whichdo not enable that feature by default, or evenwhich do not support that feature at all. TheseOSes are not RFC compliant, but, unfortu-nately, it is the real. So, application would needto take care of this situation. For example:

• Try to setup both IPv6 and IPv4 sockets.

• Set IPV6_V6ONLY socket option to theIPv6 socket, prior to performbind(2) .

• It is not fatal to failed to setIPV6_V6ONLY socket option.

4In some OSes, such as FreeBSD 4.x, IPv4 socketcan override IPv6 socket of the same port. We believethat this fashion is vulnerable to "binding closer" typeattacks.

• Don’t take it fatal unless all socket cre-ation resulted in error.

The sample server provided in the Appendix iswritten in this manner.

6 Future Plans

Finally, we list our future plans. Here’s our fu-ture plan, especially for this year.

• stabilizing IPv6 IPsec

• introducing IP Mobility to mainline

• completing porting ND fix to (pre-)linux-2.6 and submitting patches to mainline

• introducing generic IPv4,6 tunnel inter-face

• examining and stabilizing IPv6 Netfilter

• completing Prefix Delegation

• improving API support [2, 10]

• implementing IPv6 Multicast Routing

As of writing this paper, we’re working hardstabilizing IPv6 and IPv6 IPsec stack in (pre-)linux-2.6.

We’re also discussing how the Mobile IPshould be implemented with the maintainersand HUT [4] people.

XFRM is flexible and promising framework innetwroking. We are able to and going to imple-ment Mobile IP, generic tunnel etc.

We have been waiting for new documents forthe APIs. It has taken long time to publishthe new documents for APIs, however, newversion are (about to) available. We’ll followthem.

Linux Symposium 516

Multicast routing is probably the biggest miss-ing piece; we will try to implement this, too.

References

[1] R. Drave and R. Hinden. Default routerpreferences, more-specific routes, andload sharing. Work in Progress, June2002.

[2] R. Gilligan, S. Thomson, J. Bound,J. McCann, and W. Stevens. BasicSocket Interface Extensions for IPv6.RFC3493, March 2003.

[3] R. Gilligan, S. Thomson, J. Bound, andW. Stevens. Basic Socket InterfaceExtensions for IPv6. RFC2133, April1997.

[4] GO Project. MIPL Mobile IPv6 forLinux. http://www.mipl.mediapoli.com/.

[5] R. Hinden and S. Deering. Ip version 6addressing architecture. RFC2373, July1998.

[6] KAME Project. KAME Project WebPage. http://www.kame.net.

[7] T. Narten and R. Draves. Privacyextensions for stateless addressautoconfiguration in ipv6. RFC3041,January 2001.

[8] T. Narten, E. Nordmark, andW. Simpson. Neighbor Discovery for IPVersion 6 (IPv6). RFC2461, December1998.

[9] Keith Sklower. A tree-based packetrouting table for berkeley unix. InUSENIX Winter, pages 93–104, 1991.

[10] W. Stevens, M. Thomas, E. Nordmark,and T. Jinmei. Advanced sockets api foripv6. Work in Progress, March 2003.

[11] TAHI Project. Test and Verification forIPv6. http://www.tahi.org.

[12] S. Thomson and T. Narten. IPv6Stateless Address Autoconfiguration.RFC2462, December 1998.

[13] USAGI Project. USAGI Project WebPage. http://www.linux-ipv6.org.

Linux Symposium 517

7 Appendix: Sample Application Written in Modern Manner

7.1 Client

/** Sample Modern Client** Usage:* % ./modern-client host.example.com daytime** $Id: modern-client.c,v 1.1 2003/05/13 20:06:58 yoshfuji Exp $*/

#include <stdio.h>#include <unistd.h>#include <sys/types.h>#include <sys/socket.h>#include <netinet/in.h>#include <netdb.h>#include <string.h>

int main(int argc, char **argv) {char *host, *port;struct addrinfo hints, *ai0, *ai;int s;int gai;

/* check arguments */if (argc != 2 && argc != 3) {

fprintf(stderr, "Usage: %s [host] portnum\n", argv[0]);exit(1);

}

if (argc == 3) {host = argv[1];port = argv[2];

} else {host = NULL; /* loopback address */port = argv[1];

}

/* look-up name */memset(&hints, 0, sizeof(hints));hints.ai_family = AF_UNSPEC;hints.ai_socktype = SOCK_STREAM;hints.ai_protocol = IPPROTO_TCP;hints.ai_flags = 0;

gai = getaddrinfo(host, port, &hints, &ai0);if (gai) {

fprintf(stderr,"getaddrinfo(): %s port %s: %s\n",host, port, gai_strerror(gai));

Linux Symposium 518

exit(1);}

/* loop connecting remote entity */s = -1;for (ai = ai0; ai; ai = ai->ai_next) {

/* create a socket */s = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);if (s == -1)

continue;

/* connect */if (connect(s, ai->ai_addr, ai->ai_addrlen) == 0)

break;

close(s);s = -1;

}

/* free address information */freeaddrinfo(ai0);

/* check if we have failed */if (s == -1) {

fprintf(stderr, "Cannot connect to %s port %s\n",host != NULL ? host : "(null)",port);

exit(1);}

/* process loop */while (1) {

ssize_t cc;char buf[1024];

/* read from remote host */cc = read(s, buf, sizeof(buf));if (cc == -1) {

perror("read");close(s);exit(1);

} else if (cc == 0) {break;

}

/* output response */if (write(STDOUT_FILENO, buf, cc) == -1) {

perror("write");close(s);exit(1);

}}

close(s);

Linux Symposium 519

exit(0);}

7.2 Server

/** Sample Modern Server** Usage:* % ./modern-server :: 12345** $Id: modern-server.tex,v 1.1 2003/05/15 03:54:13 yoshfuji Exp $*/

#include <stdio.h>#include <unistd.h>#include <sys/types.h>#include <sys/socket.h>#include <netinet/in.h>#include <netdb.h>#include <string.h>#include <stdlib.h>#include <sys/select.h>

#ifndef MAX_SOCKNUM# define MAX_SOCKNUM FD_SETSIZE#endif

static const char *message = "Hello, world!\n";

int main(int argc, char **argv) {char *host, *port;struct addrinfo hints, *ai0, *ai;int gai;int socknum = 0, *socklist = NULL;int maxfd = -1;fd_set fds_init, fds;int i;

/* check arguments */if (argc != 2 && argc != 3) {

fprintf(stderr, "Usage: %s [host] portnum\n", argv[0]);exit(1);

}

if (argc == 3) {host = argv[1];port = argv[2];

} else {host = NULL; /* unspecified address */port = argv[1];

}

Linux Symposium 520

/* resolve address */memset(&hints, 0, sizeof(hints));hints.ai_family = AF_UNSPEC;hints.ai_socktype = SOCK_STREAM;hints.ai_protocol = IPPROTO_TCP;hints.ai_flags = AI_PASSIVE;

gai = getaddrinfo(host, port, &hints, &ai0);if (gai) {

fprintf(stderr,"getaddrinfo(): %s port %s: %s\n",host != NULL ? host : "(null)",port,gai_strerror(gai));

exit(1);}

/* initialize fd_set for select(2) */FD_ZERO(&fds_init);

/* loop waiting for connection */for (ai = ai0; ai; ai = ai->ai_next) {

int s;int *newlist;

#ifdef IPV6_V6ONLYint on = 1;

#endif

/* create a socket */s = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);if (s == -1)

continue;

#ifdef IPV6_V6ONLYif (ai->ai_family == AF_INET6 &&

setsockopt(s,IPPROTO_IPV6, IPV6_V6ONLY,&on, sizeof(on)) == -1) {

perror("setsockopt(IPV6_V6ONLY)");/*

* Some systems do not support his option;* This error should no be fatal.*/

}#endif

/* listen */if (bind(s, ai->ai_addr, ai->ai_addrlen) == -1) {

close(s);continue;

}

if (listen(s, 5) == -1) {close(s);

Linux Symposium 521

continue;}

if (s >= FD_SETSIZE || socknum >= MAX_SOCKNUM) {close(s);fprintf(stderr, "too many file/socket descriptors\n");break;

}

/* re-allocate list of socket */newlist = realloc(socklist, sizeof(int)*(socknum+1));if (newlist == NULL) {

perror("realloc");close(s);break; /* XXX: terminate immidiately? */

}

socklist = newlist;socklist[socknum++] = s;

/* set fd_set */FD_SET(s, &fds_init);

if (maxfd < s)maxfd = s;

}

/* free address information */freeaddrinfo(ai0);

/* check if we have failed */if (socknum == 0) {

fprintf(stderr,"Cannot allocate any listen sockets on %s port %s\n",host != NULL ? host : "(null)",port);

exit(1);}

while (1) {int i;

fds = fds_init;

if (select(maxfd + 1, &fds, NULL, NULL, NULL) == -1) {perror("select");continue;

}

for (i = 0; i < socknum; i++) {int sock = socklist[i];

/* look up listener.* XXX: this is not fair between listers

Linux Symposium 522

*/if (FD_ISSET(sock, &fds)) {

int newfd;struct sockaddr_storage ss;socklen_t sslen;ssize_t cc;char hostbuf[NI_MAXHOST];int gni;

sslen = sizeof(ss);newfd = accept(sock, (struct sockaddr *)&ss, &sslen);if (newfd == -1) {

perror("accept");continue;

}

gni = getnameinfo((struct sockaddr *)&ss, sslen,hostbuf, sizeof(hostbuf),NULL, 0,NI_NUMERICHOST);

if (gni)strcpy(hostbuf, "???"); /*FIXME!*/

printf("accept from %s\n", hostbuf);

cc = write(newfd, message, strlen(message));if (cc == -1) {

perror("write");} else if (cc != strlen(message)) {

fprintf(stderr,"write returned %d ""while %d is expected.\n",cc, strlen(message));

}

close(newfd);}

}}

/* we should not reache here */for (i = 0; i < socknum; i++)

close(socklist[i]);free(socklist);

exit(0);}

7.3 getifaddrs(3)

#include <stdlib.h>#include <ifaddrs.h>

Linux Symposium 523

int main() {struct ifaddrs *ifa0, *ifa;int ret;

ret = getifaddrs(&ifa0);if (ret) {

perror("getifaddrs()");exit(1);

}

for (ifa = ifa0; ifa; ifa = ifa->ifa_next) {if (!ifa->ifa_addr)

continue;switch(ifa->ifa_addr->sa_family) {

case AF_INET:/* ifa->ifa_addr points sockaddr_in{} *//* ... */break;

case AF_INET6:/* ifa->ifa_addr points sockaddr_in6{} *//* ... */break;

#if defined(AF_PACKET)case AF_PACKET:

/* ifa->ifa_addr points sockaddr_ll{} *//* ... */break;

#endif#if defined(AF_LINK)

case AF_LINK:/* ifa->ifa_addr points sockaddr_dl{} *//* ... */break;

#endifdefault:

/* not supported */;

}}freeifaddrs(ifa0);exit(0);

}

Fault Injection Test Harnessa tool for validating driver robustness

Louis ZhuangIntel Corp.

[email protected],

[email protected]

Stanley WangIntel Corp.

[email protected]

Kevin GaoIntel Corp.

[email protected]

Abstract

FITH (Fault Injection Test Harness) is a toolfor validating driver robustness. Withoutchanging existing code, it can intercept arbi-trary MMIO/PIO access and IRQ handler indriver.

Firstly I’ll first list the requirements and designfor Fault Injection. Next, we discuss a cou-ple of new generally useful implementation inFITH

1. KMMIO - the ability to dynamically hookinto arbitrary MMIO operations.

2. KIRQ - the ability to hook into an arbi-trary IRQ handler,

Then I’ll demonstrate how the FITH can helpdevelopers to trace and identify tricky issuesin their driver. Performance benchmark is alsoprovided to show our efforts in minimizing theimpact to system performance. At last, I’llelaborate on current and future efforts and con-clude.

1 Introduction

High-availability (HA) systems must respondgracefully to fault conditions and remain oper-ational during unexpected software and hard-ware failures. Each layer of the software stackof a HA system must be fault tolerant, produc-ing acceptable output or results when encoun-tering system, software or hardware faults, in-cluding faults that theoretically should not oc-cur. An empirical study [2] shows that 60-70% of kernel space defects can be attributedto device driver software. Some defect con-ditions (such as hardware failure, system re-source shortages, and so forth) seldom hap-pen, however, it is difficult to simulate andreproduce without special assistant hardware,such as an In-Circuit Emulator. In these situa-tions, it is difficult to predict what would hap-pen should such a fault occur at some time inthe future. Consequently, device drivers thatare highly available or hardened are designedto minimize the impact of failures to a system’soverall functionality.

Developing hardened drivers requires employ-ing fault avoidance software development tech-niques early in the development phase. To

Linux Symposium 525

eliminate faults during development and con-firm a driver’s level of hardening, a developercan test a device driver by injecting or simulat-ing fault events or conditions. The focus of thispaper is on the injection or simulation of hard-ware faults. Injection of software faults will beconsidered in future version.

FITH simulates hardware-induced software er-rors without modifying the original driver. Itoffers flexible customization of hardware faultsimulation as well as provides command-linetool for facilitating test development. FITH canalso provide the ability to log the route of an in-jected fault, thereby enabling driver developersto diagnose the tested driver.

2 Requirements

This section describes some requirements forFITH; we derived as part of the development.

The only behavioral requirement is that FITHshould not impact functionality of the testeddriver. The tested driver should work as if thereis no FITH at all.

There are various functionality requirementsthat need to be considered. Most center aroundthe interception of resources access. FITHneeds to have capability to intercept accessesfor MMIO, IO, IRQ and PCI configuration.The other major requirement is about handlingafter interception. FITH needs to have capa-bility to support the complex and customizedpost-handling, such as tracing hardware status,emulating fake hardware register, injecting er-ror data and logging.

With respect to performance, there are basi-cally two overriding goals:

• minimize the impact to system perfor-mance when the tested driver doesn’t en-able FITH.

• minimize the number of instructions incritical kernel path, such as exception andinterrupt part.

3 Architecture

FITH consists of four components: intercep-tors, faultsets, configuration tools, and a faultinjection manager. The configuration toolsprovide the command-line utilities needed tocustomize a faultset for a driver. Three in-terceptors will be implemented to catch IO,MMIO and IRQ access. When a driver tries toaccess a hardware resource, the IO interceptorcaptures this access and asks the fault injectionmanager if there is a corresponding item in thefaultset. If there is, the fault injection managerdetermines how to inject the appropriate faultaccording to the associated properties definedin the faultset. The fault injection manager thenreturns this information to the interceptor, sothe interceptor injects the actual fault into thehardware.

IRQ fault injection is somewhat different fromthe other types of fault injection. Hardwaretriggers the IRQ, and a kernel IRQ interrupthandler delivers this event to the IRQ inter-ceptor. The IRQ interceptor then checks thefaultset to determine whether a specific faultis available to inject into the event. 1 illus-trates how the interceptor interacts with theother subsystems.

The interceptor sits between the hardware andthe device driver and modifies informationbased on certain conditions in the driver’snamespace. An interceptor for one driver doesnot affect the interceptor for others. As a mat-ter of fact, the hardware that the driver observesis the hardware that our interceptor wraps.

Linux Symposium 526

Figure 1: Architecture of FITH

4 KMMIO—Interceptor of MMIOaccess

One of those requirementsof FITH is the abil-ity to hook to a specific memory mapped IO re-gion before the user of the region gets access.A fault injection test case may need to just notewhen a given region is being read/written, takesome action before the caller returns from theread or write operation, or change the valuethat is being read or written.

4.1 Approaches for capturing MMIO accesses

There are several hardware/software ap-proaches for capaturing MMIO accesses.

• Overriding MMIO functions.

Memory mapped IO access can be cap-tured by overriding MMIO functions,such as readb() and writeb() .The major advantage is this method isplatform-independent because all Linuxplatforms support these MMIO functions.Disadvantages are

1. Any driver that accesses MMIOwithout using the standard MMIO

functions cannot be intercepted.

2. A special FITH header file needs tobe added to the driver code, and thedriver needs to be recompiled.

3. There are some differences betweenthe “released driver” and the “driverwith FITH.” The driver that is vali-dated and verified is the driver withFITH rather than the released driver.

• Setting Watch Points.

IA-32 architecture provides extensive de-bugging facilities for debugging code andmonitoring code execution. These facili-ties can also be used to intercept memoryaccess. The major advantages are

1. The driver does not need to be re-compiled.

2. There have been a good patch to sup-port it.[3]

On the other hand, there are several disad-vantages:

1. In IA32 architecture, this is a traptype of exception, which means thatthe processor generates the excep-tion after the IO instruction has beenexecuted, so this method cannot dofault injection in write operation.

2. There are only four watch points thatcan be used. This may be not enoughin a complex environment.

• Trapping MMIO access by using Page-Fault Exceptions.

Like normal memory, MMIO is handledby a page-protection mechanism. There-fore, MMIO access can be intercepted bycapturing page faults. The method clearsthe PE (PRESENT) bit of the PTE (PageTable Entry) of the MMIO address so that

Linux Symposium 527

the processor triggers a page-fault excep-tion when MMIO is accessed. The majoradvantages are

1. In IA32 architecture, this is a faulttype of exception, which means thatthe processor generates an exceptionbefore MMIO access is executed, sothis method can do fault injection inwrite operation.

2. The driver does not need to be re-compiled. There is a disadvan-tage in the method—because theunit of intercepted MMIO is the sizeof a page (4k in IA-32), an adja-cent MMIO access may unnecessar-ily trigger an exception, system per-formance might be impacted.

Based on FITH requirements and our analysisabove, we implemented a page-fault method tocapture MMIO.

4.2 Implementation

We also followed the same usage style likewhat kprobes provides, with aregister_kmmio_probe() function for addingthe probe, and aunregister_kmmio_probe() function for removing the probe.The register_kmmio_probe adds theprobe into internal list and set the pagewhich the probe is on as UNPRESENT. Afterregister_kmmio_probe , any access onthe page will trigger a page-fault exceptionand fall into KMMIO core.

To get the control in page-fault exception,we need some tweaks tofaults.c . KM-MIO needs to add an additional path here.When the page-fault falls into KMMIO, KM-MIO looks up the fault address in probe hashlist. If the fault address is one of probes, thepre_handler of the probe is called.

diff −Nru a/arch/i386/mm/fault.c

b/arch/i386/mm/fault.c

−−− a/arch/i386/mm/fault.c Thu May 15

15:52:08 2003

+++ b/arch/i386/mm/fault.c Thu May 15 15:52:08

2003

@@−80,6 +81,9 @@

/ ∗ get the address ∗/

__asm__("movl %%cr2,%0":"=r"

(address));

+ if (is_kmmio_active() && kmmio_handler(regs,

address))

+ return;

+

/ ∗ It’s safe to allow irq’s after cr2 has been saved ∗/

if (regs −>eflags & X86_EFLAGS_IF)

local_irq_enable();

Figure 2: Patch against fault.c

Then, KMMIO tries to recover normal exe-cution. It sets the page as PRESENT. ButKMMIO needs to do more than this. Be-cause the probe should be re-enabled after cur-rent instruction which trigger the page-faultexception, KMMIO enables single-step beforeexiting page-fault exception. Similar to thechange tofaults.c , KMMIO needs to patchtraps.c to get the control when the single-step exception is triggered.

After the instruction, which triggers the page-fault execption, is executed, a single-step ex-ecption is triggered and falls into KMMIOagain. If there is a probe on the fault ad-dress, KMMIO calls thepost_handler ofthe probe. Then KMMIO sets the page as UN-PRESENT again to enable the probe on thepage.

Linux Symposium 528

diff −Nru a/arch/i386/kernel/traps.c

b/arch/i386/kernel/traps.c

−−− a/arch/i386/kernel/traps.c Thu May 15

15:52:08 2003

+++ b/arch/i386/kernel/traps.c Thu May 15

15:52:08 2003

@@−524,6 +525,9 @@

__asm__ __volatile__("movl %%db6,%0" : "=r"

(condition));

+ if (post_kmmio_handler(condition, regs))

+ return;

+

/ ∗ It’s safe to allow irq’s after DR6 has been saved ∗/

if (regs −>eflags & X86_EFLAGS_IF)

local_irq_enable();

Figure 3: Patch against traps.c

5 KIRQ—Interceptor of IRQ han-dler

Placing hooks into IRQ handler of devices isa straight task. KIRQ stores IRQ handler ofthe device intostruct kirq and modifiesthe IRQ chain in kernel to replace IRQ han-dler of the device with KIRQ’s handler. Whenthe device’s interrupt falls into KIRQ, it callshandler of the hook. Based on return valueof handler of hook, KIRQ calls the originalhandler of the device in turn.

6 Example

The serial device driver in Linux kernel wasused in our fault injection trials. Four stepswere involved in developing the fault injectiontests:

1. Identify resources that needs to be fault in-jected.

2. Prepare the faultset data source.

3. Set up the test environment.

4. Run the workload and analyze the results.

6.1 Preparing the Faultset Data Source

FITH supports faultset scripts and action codesegments. They can be used for customizingFor example, a transmission error fault wouldmodify data when the driver received the hard-ware status from the register. The faultset de-scription looks like the following:

<?xml version="1.0" encoding="UTF-8" ? >

<fsml

xmlns=

"http://fault-injection.sourceforge.net/FSML/" >

<trigger id="2"

type="r"

len="1"

addr="0x3FD"

bitmask="0"

min="0"

max="0"

skip="0"

protection_mask="0"

hz="0" >

<action code −segment="cs_001" / >

</trigger >

</fsml >

Figure 4: Faultset description example

The corresponding code segment looked likethe following:

6.2 Setting Up the Test Environment

There are three steps:

1. load FITH kernel modules.

Linux Symposium 529

#include <fith/state_machine.h >

unsigned long pointer=0;

int inject_faults(struct

context ∗cur) {// translate the bus address into linear addressline_addr = fith_bus2line(pointer);

// inject errors in data by going though// the device special// structure data// ...

return 0;

};

/ ∗ cs_001 is called by trigger "001" in FSML∗ script when IO port 0x3FD∗ (Command Register) is written. ∗ /

int cs_001(struct state_machine ∗sm,

struct context ∗cur) {unsigned long line_addr;

if (cur −>data==‚a‚) {// check if it is ’a’ characterinject_faults(cur);

}return 0;

};

Figure 5: Code segment example

2. set up the faultsets by Fault InjectionCommand-Line (ficl) configuration tool.

3. load the serial driver.

6.3 Running the Workload and Analyzing theResults

To validate the serial driver, a well-chosenworkload was run to stress the driver when itaccessed the device. FITH injected faults be-tween the driver and device and logged these

operations. The fault-induced results and in-jected faults were later analyzed.

7 System Performance Impact

In this section we assess the performance im-pact of current implementation of FITH.

7.1 LMbench

LMbench is a general OS benchmark designedto measure all sides of OS from applicationview. This is useful for generating a set ofapples to apples systems comparisons betweenpure Linux kernel and FITH-enabled Linuxkernel.

All experiments were performed on a dualPentium-III 933, 512K L2 Cache, 512 MBRAM system. The “52-pure” data was ob-tained by running on a vanilla 2.5.52 linux ker-nel. The “cs@1901” data was obtained by run-ning on a patched 2.5.52 Linux kernel (whichcontained KMMIO KIRQ etc. patches). Thereare no active probes in “cs@1901” experiment.

Because FITH patches Linux kernel in page-fault exception path, the potential impactsshould be in memory management subsystem.Current FITH implementation, however, hasminimized the impact when there are no ac-tive probes. Differences between two experi-ments are so small that they are buried by test-ing noise.

8 Acknowledgements

We specially thanks following persons for theirkind help and feedback:

David Edwards (for initial prototype and de-sign), Frank Wang and Elton Yang (for projectplan and support), Fleming Feng (for feedback

Linux Symposium 530

LMB ENCH 2.0 SUMMARY

Basic system parametersHost OS Description Mhz52-pure Linux 2.5.52 i686-pc-linux-gnu 932cs@1901 Linux 2.5.52 i686-pc-linux-gnu 932

Processor, Processes - times in microseconds - smaller is betterHost OS Mhz null null open selct sig sig fork exec sh

call I/O stat clos TCP inst hndl proc proc proc52-pure Linux 2.5.52 932 0.39 0.68 22.9 24.4 27.2 1.09 4.47 210. 996. 5050cs@1901 Linux 2.5.52 932 0.37 0.68 22.7 24.0 31.5 1.06 4.51 221. 1052. 5202

*Local* Communication latencies in microseconds - smaller is betterHost OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP

ctxsw UNIX UDP TCP conn52-pure Linux 2.5.52 7.109 33.4 41.7 61.6 81.1 110.2 110.cs@1901 Linux 2.5.52 7.137 15.2 41.8 61.1 81.1 109.8 134.

File & VM system latencies in microseconds - smaller is betterHost OS 0K File 10K File Mmap Prot Page

Create Delete Create Delete Latency Fault Fault52-pure Linux 2.5.52 92.5 46.0 225.6 75.3 2053.0 0.885 2.00000cs@1901 Linux 2.5.52 93.4 46.5 229.2 77.1 2063.0 0.765 2.00000

*Local* Communication bandwidths in MB/s - bigger is betterHost OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem

UNIX reread reread (libc) (hand) read write52-pure Linux 2.5.52 40.7cs@1901 Linux 2.5.52 41.0

Figure 6: LM Benchmark

and requirement), Rusty Lynch (for sysfs inter-face in FITH).

References

[1] David A. Edwards, “An Approach toInjecting Faults into Hardened Software,”Proceedings of the Ottawa LinuxSymposium,http://wwww.linuxsymposium.org/2002

[2] Andy Chou, Junfeng Yang, Benjamin Chelfetc., “An Empirical Study of OperatingSystem Errors,” 2001,http://www.stanford.edu/˜engler/metrics-sosp-01.ps

[3] Vamsi Krishna, Rusty Russell etc., “KernelProbes for Linux,”

http://www-124.ibm.com/linux/projects/kprobes/

Documents

Proceedings of the Linux Symposium · Proceedings of the Linux Symposium July 23th–26th, 2003 Ottawa, Ontario Canada. Contents ... making the Linux kernel run in user space has