Software Monitoring with Controllable Overhead

Software Tools for Technology Transfer

Software Monitoring with Controllable OverheadXiaowan Huang, Justin Seyster, Sean Callanan, Ketan Dixit, Radu Grosu, Scott A. Smolka, Scott D. Stoller, Erez Zadok

Stony Brook University

Appears in the Proceedings of Software Tools for TechnologyTransfer

Abstract. We introduce the technique ofSoftware Monitor-ing with Controllable Overhead(SMCO), which is based ona novel combination of supervisory control theory of discreteevent systems and PID-control theory of discrete time sys-tems. SMCO controls monitoring overhead by temporarilydisabling monitoring of selected events for as short a time aspossible under the constraint of a user-supplied target over-headot. This strategy is optimal in the sense that it allowsSMCO to monitor as many events as possible, within the con-fines ofot. SMCO is a general monitoring technique that canbe applied to any system interface or API.

We have applied SMCO to a variety of monitoring prob-lems, including two highlighted in this paper:integer rangeanalysis, which determines upper and lower bounds on in-teger variable values; andNon-Accessed Period (NAP) detec-tion, which detects stale or underutilized memory allocations.We benchmarked SMCO extensively, using both CPU- andI/O-intensive workloads, which often exhibited highly burstybehavior. We demonstrate that SMCO successfully controlsoverhead across a wide range of target-overhead levels; itsaccuracy monotonically increases with the target overhead;and it can be configured to distribute monitoring overheadfairly across multiple instrumentation points.

Key words: Software instrumentation, supervisory control

1 Introduction

Ensuring the correctness and guaranteeing the performanceof complex software systems, ranging from operating sys-tems and Web servers to embedded control software, presentsunique problems for developers. Errors occur in rarely calledfunctions and inefficiencies hurt performance over the longterm. Moreover, it is difficult to replicate all of the environ-ments in which the software may be executed, so many such

problems arise only after the software is deployed. Conse-quently, testing and debugging tools, which tend to operatestrictly within the developer’s environment, are unable tode-tect and diagnose all undesirable behaviors that the systemmay eventually exhibit. Model checkers and other verifica-tion tools can test a wider range of behaviors but require dif-ficult to develop models of execution environments and arelimited in the complexity of programs they can check.

This situation has led to research in techniques to moni-tor deployed and under-development software systems. Suchtechniques include the DTrace [6] and DProbes [14] dynamictracing facilities for Solaris and Linux, respectively. Theseframeworks allow users to build debugging and profiling toolsthat insertprobesinto a production system to obtain informa-tion about the system’s state at certain execution points.

DTrace-like techniques have two limitations. First, theyarealways on. Hence, frequently occurring events can causesignificant monitoring overhead. This raises the fundamentalquestion:Is it possible to control the overhead due to soft-ware monitoring while achieving high accuracy in the mon-itoring results?Second, these tools arecode-oriented: theyinterpose on execution of specified instructions in the code;events such as accesses to specific memory regions are diffi-cult to track with these tools, because of their interrupt-drivennature. Tools such as Valgrind allow one to instrument mem-ory accesses, but benchmarks have shown a 4-fold increase inruntimes even without any instrumentation [17]. This raises arelated question:Is it possible to control the overhead due tomonitoring accesses to specific memory regions?

To answer the first question, we introduce the new tech-nique of Software Monitoring with Controllable Overhead(SMCO). To answer the second question, we instrument theapplication program, such that SMCO can exploit thevirtualmemory hardware, to selectively monitor memory accesses.

As the name suggests, SMCO is formally grounded incontrol theory, in particular, a novel combination of super-visory control of discrete event systems [15,1] and propor-tional-integral-derivative (PID) control of discrete time sys-

2 Xiaowan Huang et al.: Software Monitoring with Controllable Overhead

Controller observed

overhead o

target

overhead ot

PlantMonitorenable/disable

monitoring

Instrumented

Program

Fig. 1. Structure of a typical SMCO application.

tems [18]. Overhead control is realized by temporarily dis-abling interrupts generated by monitored events, thus avoid-ing the overhead from processing these interrupts. Moreover,such interrupts are disabled for as short a time as possible sothat the number of events monitored, under the constraint ofa user-suppliedtarget overheadot, is maximized.

Our main motivation for using control theory is to avoidanad hocapproach to overhead control, in favor of a muchmore rigorous one based on the extensive available litera-ture [9]. We also hope to show that significant benefits can begained by bringing control theory to bear on software systemsand, in particular, runtime monitoring of software systems.

The structure of a typical SMCO application is illustratedin Figure 1. The main idea is that given the target overheadot,an SMCO controller periodically sends enable/disable mo-nitoring commands to an instrumented program and its asso-ciated monitor in such a way that the monitoring overheadnever exceedsot. Note that the SMCO controller is afeed-back controller, as the current observed overhead level is con-tinually fed back to it. This allows the controller to carefullymonitor the observed overhead and, in turn, disable moni-toring when the overhead is close toot and, conversely, en-able monitoring when the likelihood of the observed over-head exceedingot is small. Also note that the instrumentedprogram sends events of interest to the monitor as they occur,e.g., memory accesses and assignments to variables.

More formally, SMCO can be viewed as the problem ofdesigning an optimal controller for a class of nonlinear sys-tems that can be modeled as the parallel composition of a setof extended timed automata (see Section 2). Furthermore, weare interested in proportional-integral-derivative (PID) con-trollers. We consider two fundamental types of PID control-lers: (1) a single, integral-likeglobal controller for all mon-itored objects in the application software; and (2) acascadecontroller, consisting of an integral-likeprimary controllerwhich is composed with a number of proportional-likesec-ondary controllers, one for each monitored object in the ap-plication. The main function of the primary controller is tocontrol the set point of the secondary controllers.

We use the suffix “-like” because our PID controllers areevent driven, as in the setting of discrete-event supervisorycontrol, and not time-driven, as in the traditional PID con-trol of continuous or discrete time systems. The primary con-troller and the global controller are integral-like because thecontrol action is based on the sum of recent errors (i.e., dif-ferences between target and observed values). The secondarycontrollers are proportional-like because the controller’s out-put is directly proportional to the error signal.

We use both types of controllers because of the follow-ing trade-offs between them. The global controller featuresrelatively simple control logic and hence is very efficient.Itmay, however, undersample infrequent events. The cascadecontroller, on the other hand, is designed to provide fair mon-itoring coverage of all events, regardless of their frequency.

SMCO is a general runtime-monitoring technique that canbe applied toany system interface or API. To substantiatethis claim, we have applied SMCO to a number of monitor-ing problems, including two highlighted in this paper:integerrange analysis, which determines upper and lower boundson the values of integer variables; andNon-Accessed Period(NAP) detection, which detects stale or underutilized mem-ory allocations. Integer range analysis is code-oriented,be-cause it instruments instructions that update integer variables,whereas NAP detection is memory-oriented, because it inter-cepts accesses to specific memory regions.

The source-code instrumentation used in our integer rangeanalysis is facilitated by a technique we recently developedcalledcompiler-assisted instrumentation(CAI). CAI is basedon aplug-in architecture for GCC[5]. Using CAI, instrumen-tation plug-ins can be separately compiled as shared objects,which are then dynamically loaded into GCC. Plug-ins haveread/write access to various GCC internals, including abstractsyntax trees (ASTs), control flow graphs (CFGs), and staticsingle-assignment (SSA) representations.

The instrumentation in our NAP detector makes novel useof virtual-memory hardware (MMU) by using themprotectsystem call to guard each memory area suspected of beingunderutilized. When a protected area is accessed, the MMUgenerates a segmentation fault, informing the monitor thatthearea is being used. If a protected area remains unaccessedfor a period of time longer than a user-specified threshold(the NAP length), then the area is considered stale. SMCOcontrols the total overhead of NAP detection by enabling anddisabling the monitoring of each memory area appropriately.

To demonstrate SMCO’s ability to control overhead whileretaining accuracy in the monitoring results, we performeda variety of benchmarking experiments involving real appli-cations. Our experimental evaluation considers virtuallyallaspects of SMCO’s design space: cascade vs. global con-troller, code-oriented vs. memory-oriented instrumentation,CPU-intensive vs. I/O-intensive workloads (some of whichwere highly bursty). Our results demonstrate that SMCO suc-cessfully controls overhead across a wide range of target-overhead levels; its accuracy monotonically increases withthe target overhead; and it can be configured, using the cas-cade controller, to fairly distribute monitoring overheadacross

Xiaowan Huang et al.: Software Monitoring with Controllable Overhead 3

reference

input x

Controller

QPlant

Pcontrol

input vplant

output y

Fig. 2. Plant (P) and Controller (Q) architecture.

multiple instrumentation points. Additionally, SMCO’sbaseoverhead, the overhead observed atot =0, is a mere 1–4%.

The rest of the paper is organized as follows. Section 2 ex-plains SMCO’s control-theoretic foundations. Section 3 de-scribes our architectural framework and the applications wedeveloped. Section 4 presents our benchmarking results. Sec-tion 5 discusses related work. We conclude in Section 6 anddiscuss future work.

2 Control-Theoretic Monitoring

The controller design problemis the problem of devising acontrollerQ that regulates the inputv to a processP (hence-forth referred to as theplant) in such a way thatP ’s outputy adheres to areference inputx with good dynamic responseand small error; see the architecture shown in Figure 2.

Runtime monitoring with controllable overhead can ben-eficially be stated as a controller design problem: The con-troller is a feedback controller that observes the monitoringoverhead, the plant comprises the runtime monitor and the ap-plication software, and the reference inputx to the controlleris given by the user-specifiedtarget overheadot. This struc-ture is depicted in Figure 1. To ensure that the plant iscon-trollable, one typically instrumentsthe application and themonitor so that they emiteventsof interest to the controller.The controller catches these events, and controls the plantbyenablingor disablingmonitoring and event signaling. Hence,the plant can be regarded as adiscrete event process.

In runtime monitoring, overhead is the measure of howmuch longer a program takes to execute because of monitor-ing. If an unmodified and unmonitored program executes intime R and executes in total timeR +M with monitoring,we say that the monitoring has overheadM / R.

Instead of controlling overhead directly, it is more conve-nient to write the SMCO control laws in terms ofmonitoringpercentage: the percentage of execution time spent monitor-ing events, which is equal toM/(R +M). Monitoring per-centagem is related to the traditional definition of overheado by the equationm = o / (1 + o). Theuser-specified targetmonitoring percentage(UTMP) mt is derived fromot in asimilar manner; i.e.,mt = ot / (1 + ot).

The classical theory of digital control [9] assumes that theplant and the controller are linear systems. This assumptionallows one to semi-automatically design the controller by ap-plying a rich set of design and optimization techniques, suchas the Z-transform, fast Fourier transform, root-locus analy-sis, frequency response analysis, proportional-integrative-de-rivative (PID) control, and state-space optimal design. Fornonlinear systems, however, these techniques are not directly

applicable, and various linearization and adaptation techniquesmust be applied as pre- and post-processing, respectively.

The problem we are considering is nonlinear, because ofthe enabling and disabling of interrupts. Intuitively, theinter-rupt signal is multiplied by a control signal which is 1 wheninterrupts are enabled and 0 otherwise. Although lineariza-tion is one possible approach for this kind of nonlinear sys-tem, automata theory suggests a better approach, recastingthe controller design (synthesis) problem as one ofsupervi-sory controller design[15,1].

The main idea of supervisory control we exploit to en-able and disable interrupts is the synchronization inherent intheparallel compositionof state machines. In this setting, theplantP is a state machine, the desired outcome (tracking thereference input) is a languageL, and the controller designproblem is that of designing a controllerQ, which is also astate machine, such that the languageL(Q‖P ) of the compo-sition ofQ andP is included inL. This problem is decidablefor finitestate machines [15,1].

Monitoring percentage depends on the timing (frequency)of events and the monitor’s per-event processing time. Thespecification languageL therefore consists oftimed wordsa1, t1, . . . , al, tl where eachai is an (access) event that occursat time ti. Consequently, the state machines used to modelP andQ must also include a notion of time. Previous workhas shown that supervisory control is decidable fortimed au-tomata[2,19] and fortimed transition models[16].

Modeling overhead control requires however, the use ofmore expressive, extended timed automata (see Section 2.2),and for such automata decidability is lost. The lack of decid-ability means that a controller cannot be automatically syn-thesized. This however, does not diminish the usefulness ofcontrol theory. On the contrary, this theory becomes an indis-pensable guide in the design of a controller that satisfies a setof constraints. In particular, we use control theory to developa novel combination of supervisory and PID control. As inclassical PID control, the error from a given setpoint (and theintegral and derivative of the error) is employed to controltheplant. In contrast to classical PID control, the computation ofthe error and its associated control happens in our frameworkon an event basis, instead of a fixed, time-step basis.

To develop this approach, we must reconcile the seem-ingly incompatible worlds of event- and time-based systems.In the time-based world of discrete time-invariant systems,the input and the output signals are assumed to be known andavailable at every multiple of a fixed sampling interval∆t.Proportional control (P) continually sets the current controlinput v(n) as proportional to the current errore(n) accord-ing to the equationv(n)= ke(n), wheren stands forn∆tande(n)= y(n)−x(n) (recall thatv, x, andy are depicted inFigure 2). Integrative control (I) sums the previous and cur-rent error and sets the control input tov(n)= k

∑n

i=0e(n).

In contrast, in the event-based world, time informationis usually abstracted away, and the relation to the time-basedworld, where controller design is typically done, is lost. How-ever, in our setting the automata are timed, that is, they con-tain clocks, ticking at a fixed clock interval∆t. Thus, events


can be assumed to occur at multiples of∆t, too. Of course,communication is event based, but all the necessary infor-mation to compute the proper control valuev(t) is available,whenever an event is thrown at a given timet by the plant.

We present two controller designs with different trade-offs and correspondingly different architectures. Ourglobalcontroller is a single controller responsible for all objects ofinterest in the monitored software; for example, these objectsmay be functions or memory allocations, depending on thetype of monitoring being performed. The global controllerfeatures relatively simple control logic and hence is very ef-ficient: its calculations add little to the observed overhead.It does not, however, attempt to be fair in terms of monitor-ing infrequently occurring events. Ourcascade controller, incontrast, is designed with fairness in mind, as the compositionof a primary controller and a set ofsecondary controllers,one for each monitored plant.

Both of the controller architectures temporarily disableinterrupts to control overhead. One must therefore considerthe impact of events missed during periods of non-monitoringon the monitoring results. The two applications of SMCO weconsider are integer range analysis and the detection of under-utilized memory. For under-utilized memory detection, whenan event is thrown, we are certain that the corresponding ob-ject is not stale. We can therefore ignore interrupts for a defi-nite interval of time, without compromising soundness and atthe same time lowering the monitoring percentage.

Similarly, for integer range analysis, two updates to aninteger variable that are close to each other in time (e.g., con-secutive increments to a loop variable) are often near eachother in value as well. Hence, processing the interrupt for thefirst update and ignoring the second, is often sufficient to ac-curately determine the variable’s range, while also loweringmonitoring percentage. For example, in the benchmarking ex-periments described in Section 4, we achieve high accuracy(typically 90% or better) in our integer range analysis withatarget overhead of just 10%.

2.1 Target Specification

The target specification for a single controlled plant is givenas a timed languageL, containing timed words of the forma1, t1, . . . , al, tl, whereai is an event andti is the time atwhich ai has occurred. Each plant has a local target moni-toring percentagemlt, which is effectively that plant’s por-tion of the UTMPmt. Specifically,L contains timed wordsa1, t1, . . . , al, tl that satisfy the following conditions:

1. The average monitoring percentagem= (l pa) / (tl−t1)is such thatm≤mlt, wherepa is the average time takenby the monitor and controller to process an event.

2. If the strict inequalitym< mlt holds, then the monito-ring-percentage undershoot is due to time intervals withlow activity during which all events are monitored.

The first condition bounds only themean monitoring per-centagem within a timed wordw∈L. Hence, various poli-cies for handling monitoring percentage, and thus enabling

v?en / i=1 v?di / i=0

[i=1] / yf!ac, k

2=0

[k1 ≤ MT]

[k2 ≤ p

M]

k1 ≥ MT

[true]

[k2 ≥ p

m] / y

f! ac

running stopped

monitor access

k1=0, i=1

Fig. 3. Automaton for the hardware plantP of one monitored object.

and disabling interrupts, are allowed. The second conditionis a best-effortcondition which guarantees that if the targetmonitoring percentage is not reached, this is only because theplant does not throw enough interrupts. As our benchmark-ing results of Section 4 demonstrate, we designed the SMCOglobal and cascade controllers (described in Section 2.3) tosatisfy these conditions.

When considering the target specification languageL andthe associated mean monitoring percentagem, it is importantto distinguish plants in which all interrupts can be disabled(as in Figure 3) from the other (as in Figure 4). Hardware-based execution platforms (e.g., CPU and MMU) and vir-tual machines such as the JVM belong to the former cate-gory. (The JVM supports disabling of software-based inter-rupts through just-in-time compilation.)

Software plants written in C, however, typically belong tothe latter category, because code inserted during instrumenta-tion is not removed at run-time. In particular, as discussedin Section 2.2.2, when function calls are instrumented, theinstrumented program always throws function-call interruptsaf c. Consequently, for such plants, in addition tom, there isalso an unavoidablebase monitoring percentagemb = k pf c,wherek is the number of function calls.

2.2 The Plant Models

This section specifies the behavior of the above plant typesin terms ofextended timed automata(introduced below). Forillustration purpose, eachhardware plantis controlled by asecondary controller, and the uniquesoftware plantis con-trolled by the global controller.

2.2.1 The Hardware Plant

Timed automata(TA) [2] are finite-state automata extendedwith a set of clocks, whose values are positive reals. Clockpredicates on transitions are used to model timing behavior,while clock predicates appearing within locations (states) areused to enforce progress properties. Clocks may be reset bytransitions.Extended TAare TA with local variables and amore expressive clock predicate/assignment language.

The hardware plantP is modeled by the extended TA inFigure 3. Its alphabet consists of input and output events. Theclock predicates labeling its locations and transitions are ofthe formk∼ c, wherek is a clock, c is a natural number or


variable, and∼ is one of<, ≤, =, ≥, and>. For example,the predicatek1 ≤MT labelingP ’s staterunning is a clockconstraint, wherek1 is a clock andMT is the maximum-monitoring-time parameter discussed below.

Transition labels are of the form[guard]In/Cmd, whereguard is a predicate overP ’s variables;In is a sequence ofinput events of the formv?e denoting the receipt of valuee(written as a pattern) on channelv; andCmd is a sequenceof output and assignment events. An output event is of theform y!a denoting the sending of valuea on channely; an as-signment event is simply an assignment of a value to a localvariable of the automaton. All fields in a transition label areoptional. The use of? and! to denote input and output eventsis standard, and first appeared in Hoare’s paper on CSP [12].

A transition isenabledwhen its guard is true and the spec-ified input events (if any) have arrived. A transition is notforcedto be taken unless letting time flow would violate thecondition (invariant) labeling the current location. For exam-ple, the transition out of the statemonitor accessin Figure 3is enabled as soon ask2 ≥ pm, but not forced untilk2 ≥ pM .The choice is nondeterministic, and allows to succinctly cap-ture any transition in the interval[pm, pM ]. This is a classicway of avoiding overspecification.

P has an input channelv where it may receiveenableanddisablecommands, denoteden anddi, respectively, andan output channelyf where it may send begin and end ofac-cessmessages, denotedac andac, respectively. Upon receiptof di, interrupt biti is set to 0, which prevents the plant fromsending further messages. Upon receipt ofen, i is set to 1,which allows the plant to send an access messageac at ar-bitrary moments in time. Once an access message is sent,Presets the clock variablek2 and transitions to a new state. Atany time in the interval[pm, pM ], P can leave this state andsend an end of access messageyf !ac to the controller.

P terminates when the maximum monitoring timeMT ,a parameter of the model, is reached, i.e., when clockk1

reaches valueMT . Initially, i =1 andk1 =0.A running program can have multiple hardware plants,

with each plant a source of monitored events. For example,a program running under our NAP detection tool for findingunder-utilized memory has one hardware plant for each moni-tored memory region. The NAP detector’s controller can indi-vidually enable or disable interrupts for each hardware plant.

2.2.2 The Software Plant

In a software plantP , the application program is instrumentedto handle, together with the monitor, the interrupt logic read-ily available to hardware plants (see Figure 4).

A software plant represents a single function that can runwith interrupts enabled or disabled. In practice, the functiontoggles interrupts by choosing between two copies of the func-tion body each time it is called: one copy that is instrumentedto send event interrupts and one that is left unmodified.

Whenever a function call happens at thetop levelstateof P , the instrumented program resets the clock variablek1,sends the messagefc onyf to the controller andwaits for its

monitored program

(execute function)

[k1 ≤ F]

[k1 ≥ F]

instrumented program

(choose function body)

[k2 ≤ p

M]fc

monitored program

(monitor access)

[k2 ≤ p

M]


(top level)

[true]


(wait for controller)

[true]

yf!ac, k

2=0

vf?m / cm=m, k

2=0

fc[k2 ≥ p

m & cm=di ] / y

f!fc

[k2 ≥ p

m] / y

f!ac

yf!fc, k

1=0

fc[k2 ≥ p

m & cm=ei ] / y

f!fc

Fig. 4. Automaton for the software plantP of all monitored objects.

response. If the response onvf isdi , indicating that interruptsare disabled, then the unmonitored version of the functionbody is called. This is captured inP by returning to the toplevel state at any time in the interval[pfc

m , pfcM ]. This interval

represents the time required to implement the call logic.If the response onvf is ei , indicating that interrupts are

enabled, then the monitored version of the function body iscalled. This is captured inP by transitioning to the stateexe-cute functionwithin the same interval[pfc

m , pfcM ].

Within the monitored function body, the monitor may sendon yf a begin of access eventac to the controller, whenevera variable is accessed, and transition to the statemonitor ac-cess. The time spent by monitoring this access is expressedwith a transition back toexecute functionthat happens at anytime in the interval[pm, pM ]. This transition sends an end ofaccess messageac onyf to the controller.

P terminates processing functionf when the maximummonitoring timeF , a parameter of the model, is reached; thatis, when clockk1 ≥F .

2.3 The Controllers

2.3.1 The Global Controller

Integrative control uses previous behavior of the plant to con-trol feedback. Integrative control has the advantage that it hasgood overall statistical performance for plants with consis-tent behavior and is relatively immune tohysteresis, in whichperiodicity in the output of the plant produces periodic, out-of-phase responses in the controller. Conversely, proportionalcontrol is highly responsive to changes in the plant’s behav-ior, which makes it appropriate for long-running plants thatexhibit change in behavior over time.

We have implemented an integral-like global controllerfor plants with consistent behavior. Architecturally, theglobalcontroller is in a feedback loop with a single plant represent-ing all objects of interest to the runtime monitor. The archi-tecture of the global controller is thus exactly that of Figure 1,


True

yf?ac / τ=τ+k, k=0

yf?fc / p=p+k, k=0

True

[ p/(τ+p) > m

t ] y

f?fc / v

f!di, τ=τ+k, k=0

x?mt / mt=mt, k=0, p=0, τ=0

[ p/(τ+p) ≤ mt ] y

f?fc / v

f!ei, τ=τ+k, k=0

[true][true] [true]y

f?ac / p=p+k, k=0

mt=mt, k=0, p=0, τ=0

variable

accessfunction

calltop

level

Fig. 5. Automaton for global controller.

which is identical to the classical plant-controller architectureof Figure 2, except that in Figure 1, the plant is decomposedinto the runtime monitor and the software it is monitoring.

In presenting the extended TA for the global controller,we assume it is in a feedback loop with a software-orientedplant whose behavior is given by the extended TA of Figure 4.This is done without loss of generality, as the global con-troller’s state machine is simpler in the case of a hardware-oriented plant. The global controller thus assumes the plantemits events of two types:function-call eventsand accessevents, where the former corresponds to the plant having en-tered a C function, and the latter corresponds to updates tointeger variables, in the case of integer range analysis.

The global controller’s automaton is given in Figure 5and consists of three locations:top level, the function-callprocessing location, and thevariable-accessprocessing lo-cation. Besides the UTMPmt, the automaton for the globalcontroller makes use of the following variables: clock vari-ablek, a running totalτ of the program’s execution time, anda running totalp of the instrumented program’s observed pro-cessing timep. Variableτ keeps the time the controller spentin total (over repeated visits) in its top-level location, whereasvariablep keeps the time the controller spent in total in itsfunction-call and access-event processing locations. Hence,at every moment in time, the observed overhead iso= p / τand the observed monitoring percentage ism = p / (τ + p).

In the top-level location, the controller can receive theUTMP on channelx. The controller transitions from the top-level to the function-call processing location whenever a func-tion-call event occurs. In particular, when functionf is called,the plant emits anfc signal to the controller alongyf (re-gardless of whether access event interrupts are enabled forf ), transitioning the controller to the function-call process-ing location along one of two edges. If the observed moni-toring percentage for the entire program execution is abovethe UTMPmt, the edge taken sends thedi signal alongvf todisable monitoring of interrupts for that function call. Other-wise, the edge taken enables these interrupts. Thus, the globalcontroller decides to enable/disable monitoring on a per func-tion-call basis. Moreover, since the enable/disable decisiondepends on the sign of thecumulativeerrore =m−mt, thecontroller isintegrative.

The time taken in the function-call processing location,which the controller determines by reading clockk’s valueupon receipt of anfc signal from the plant, is considered mon-

x PQ

Q1

x1

u1

Q2

x2

u2

Qn

xn

un

...

P1

v1

y1

P2

v2

y2

Pn

vn

yn

y1

y2

yn

Fig. 6. Overall cascade control architecture.

itoring time; the transition back to the initial state thus addsthis time to the total monitoring timep.

The controller transitions from the top-level to the variable-access processing location whenever a functionf sends thecontroller an access eventac and interrupts are enabled forf .Upon receipt of anac event signaling the completion of eventprocessing in the plant, the controller measures the time itspent in its variable-access location, and adds this quantity top. To keep track of the plant’s total execution timeτ , each ofthe global controller’s transitions exiting the initial locationupdatesτ with the time spent in the top-level location.

Note that all of the global controller’s transitions are event-triggered, as opposed to time-triggered, as it interacts asyn-chronously with the plant. This aspect of the controller modelreflects the discrete-event-based nature of our PID controllers.

2.3.2 The Cascade Controller

As per the discussion of monitoring-percentage undershootin Section 2.1, some plants (functions or objects in a C pro-gram) mightnot generate interrupts at a high rate, and there-fore might not make use of the target monitoring percentageavailable to them. In such situations it is desirable to redis-tribute such unused UTMP to more active plants, which aremore likely to make use of this monitoring percentage. More-over, this redistribution of the unused UTMP should be per-formed fairly, so that less-active plants are not ignored.

This is the rationale for the SMCO cascade controller (seeFigure 6), which consists of a set of secondary controllersQi,each of which directly control a single plantPi, and a primarycontrollerPQ that controls the reference inputsxi to the sec-ondary controllers. Thus, in the case of cascade control, eachmonitored plant has its own secondary controller that enablesand disables its interrupts. The primary controller adjusts the


[true]y

f?ac / v!di, τ=k, k=0

[ k ≥ d ] / v!en, k=0

x?mlt / mlt=mlt, k=0

yf?ac / p=k , u!k, d=p/m

lt-p-τ , k=0

[k ≤ d] [true]accesstop wait

Fig. 7. Automaton for secondary controllerQ.

τ1

p1

Start

Monitoring

Event

d1

τ2

Controller gets

τ1 and p

1 sets d

1

Start

Monitoring

Stop

Monitoring

Time

Legend MonitoringNot Monitoring

p2

Event

Stop

Monitoring

d2

Controller gets

τ2 and p

2 sets d

2

Fig. 8. Timeline for secondary controller.

local target monitoring percentage(LTMP) mlt for the sec-ondary controllers.

The secondary controllers.Each monitored plantP has asecondary controllerQ, the state machine for which is givenin Figure 7. Within each iteration of its main control loop,Qdisables interrupts by sending messagedi alongv upon re-ceiving an access eventac alongy, and subsequently enablesinterrupts by sendingen alongv. Consider thei-th execu-tion of Q’s control loop, and letτi be thetime monitoring ison within this cycle; i.e., the time between eventsv!en andy?ac. Let pi be the time required toprocesseventy?ac, andlet di be thedelay timeuntil monitoring is restarted; i.e., untileventv!en is sent again. See Figure 8 for a graphical depic-tion of these intervals. Then the overhead in thei-th cycle isoi = pi / (τi + di) and accordingly, themonitoring percent-ageof thei-th cycle ismi = pi / (pi + τi + di).

To ensure thatmi = mlt whenever the plant is throw-ing access events at a high rate,Q computesdi as the leastpositive integer greater than or equal topi/mlt − (τi + pi).Choosingdi this way lets the controller extend the total timespent in thei-th cycle so that itsmi is exactly the targetmlt.

To see how the secondary controller is like a proportionalcontroller, regardpi as a constant (pi does not vary muchin practice), so thatpi/mlt—the desired value for the cycletime—is also a constant. The equation fordi becomes nowthe difference between the desired cycle time (which we taketo be the controller’s reference value) and the actual cycletime measured when eventi is finished processing. The valuedi is then equal to the proportional error for thei-th cycle,making the secondary controller behave like a proportionalcontroller with proportional constant 1.

If plant P throws events at a low rate, then all eventsare monitored anddi =0. When processing ofac is finished,which is assumed to occur within the interval[pm, pM ], Qsends the processing timek to the primary controller alongchannelu.

The primary controller. Secondary controllerQ achieves itsLTMP mlt only if plant P throws events at a sufficientlyhigh rate. Otherwise, its mean monitoring percentagem isless thanmlt. When monitoring a large number of plantsPi

simultaneously, it is possible to take advantage of this under-utilization ofmlt by increasing the LTMP of those controllersQi associated with plants that throw interrupts at a high rate.In fact, we can adjust themlt of all secondary controllersQi

by the same amount, as the controllersQj of plantsPj withlow interrupt rates will not take advantage of this increase.Furthermore, we do this everyT seconds, a period of timewe call theadjustment interval. The periodic adjustment ofthe LTMP is the task of the primary controllerPQ.

Its extended TA is given in Figure 9. After first inputtingthe UTMPmt on x, PQ computes the initial LTMP to bemt/n, thereby partitioning the global target monitoring per-centage evenly among then secondary controllers. It assignsthis initial LTMP to the local variablemlt and outputs it tothe secondary controllers. It also assignsmt to local variablemgt, theglobal target monitoring percentage(GTMP). PQalso maintains an arrayp of total processing time, initiallyzero, such thatp[i] is the processing time used by secondarycontrollerQi within the last adjustment interval ofT seconds.Array entryp[i] is updated wheneverQi sends the processingtimepj of the most recent eventaj ; i.e.,p[i] is the sum of thepj thatQi generates during the current adjustment interval.

When the time bound ofT seconds is reached,PQ com-putes the errore = mgt −

∑n

i=1p[i]/T , as the difference

between the GTMP and the observed monitoring percentageduring the current adjustment interval.PQ also updates a cu-mulative errorec, which is initially 0, such thatec = ec + e,making it the sum of the error over all adjustment intervals.To correct for the cumulative error,PQ computes an offsetKIec that it uses to adjustmlt down to compensate for over-utilization, and up to compensate for under-utilization. Thenew LTMP is set tomlt =mgt/n +KIec and sent to all sec-ondary controllers, after which arrayp and clockk are reset.


ui?p

i / p[i ]

= p[i ] + p

i

[k ≤ T ] [ k ≥ T ] / ec=e

c+(m

gt-Σp[i ]/T ), m

lt=m

gt/n+K

Ie

c , x

1!m

lt ,..., x

n!m

lt , k=0, p=[0,...,0]

x?mt / m

lt=m

t/n, x

1!m

lt, ... x

n!m

lt, m

gt=m

t, e

c=0, k=0, p=[0,...,0]

i=1

n

adjust

LTMP

Fig. 9. Automaton for the primary controller.

Instrumented Program

f’ g’

Range Checker

Controller

activations events

ih

Fig. 10. SMCO architecture forrange-solver.

Because the adjustmentPQ makes to the LTMPmlt overa given adjustment interval is a function of a cumulative errortermec, primary controllerPQ behaves as anintegrative con-troller. In contrast, each secondary controllerQi alone main-tain no state beyondpi andτi. They are therefore a form ofproportional controller, which respond directly as the plantoutput changes. The controller parameterKI in PQ’s adjust-ment termKIec is known in control theory as theintegra-tive gain. It is essentially a weight factor that determines towhat extent the cumulative errorec affects the local monitor-ing percentagemlt. The larger theKI value, the larger thechangesPQ will make tomlt during the current adjustmentinterval to correct for the observed overhead.

Thetarget specification languageLP is defined in a fash-ion similar to the one for the secondary controllers, exceptthat the events of the plantP are replaced by the events of theparallel compositionP1 ‖P2 ‖ . . . ‖Pn of all plants.

3 Design and Implementation

This section presents two applications that we have imple-mented for SMCO. Section 3.1 describes our integer range-analysis tool. Section 3.2 introduces our novel memory under-utilization monitor. Section 3.3 summarizes the developmenteffort for these monitoring tools and their controllers.

3.1 Integer Range Analysis

Integer range analysis [7] determines the range (minimumand maximum value) of each integer variable in the moni-tored execution. These ranges are useful for finding programerrors. For example, analyzing ranges on array subscripts mayreveal bounds violations.

Figure 10 is an overview ofrange-solver, our integerrange-analysis tool.range-solver usescompiler-assistedinstrumentation(CAI), an instrumentation framework basedon a plug-in architecture for GCC [5]. Ourrange-solverplug-in adds range-update operations after assignments toglo-bal, function-level static, and stack-scoped integer variables.

L2:

while (i < len) {

total += values[i];

update_range("func:total", total);

i++;

update_range("func:i", i);

}

return total;

if (controller("func")) goto L2; else goto L1;

L1:

while (i < len) {

total += values[i];

i++;

}

return total;

Fig. 11. range-solver adds a distributor with a call to the SMCO con-troller. The distributed passes control to either the original, uninstrumentedfunction body shown on the left, or the instrumented copy shown on the right.

The Range Checker module (shown in Figure 10) consumesthese updates and computes ranges for all tracked variables.Range updates are enabled or disabled on a per-function ba-sis. In Figure 10, monitoring is enabled for functionsf andg;this is reflected by the instrumented versions of their functionbodies, labeledf ′ andg′, appearing in the foreground.

To allow efficient enabling and disabling of monitoring,the plug-in creates a copy of the body of every function tobe instrumented, and adds instrumentation only to the copy.A distributor block at the beginning of the function calls theSMCO controller to determine whether monitoring for thefunction is currently enabled. If so, the distributor jumpstothe instrumented version of the function body; otherwise, con-trol passes to the original, unmodified version. Figure 11 showsa function modified by therange-solver plug-in to have adistributor block and a duplicate instrumented function body.Functions without integer updates are not duplicated and al-ways run with monitoring off.

Because monitoring is enabled or disabled at the functionlevel, the instrumentation notifies the controller of function-call events. As shown in Figure 10, the controller responds byactivating or deactivating monitoring for instrumented func-tions. With the global controller, there is a single “on-off”switch that affects all functions: when monitoring is off, theuninstrumented versions of all function bodies are executed.The cascade controller maintains a secondary controller foreach instrumented function and can switch monitoring on andoff for individual functions.

Timekeeping.As our controller logic relies on measurementsof monitoring time,range-solver queries the system timewhenever it makes a control decision. TheRDTSC instruc-tion is the fastest and most precise timekeeping mechanismon the x86 platform. It returns the CPU’s timestamp counter(TSC), which stores a processor cycle timestamp (with sub-nanosecond resolution), without an expensive system call.


time

access

nap threshold

nap nap

a b c d e

Fig. 12. Accesses to an allocation, and the resulting NAPs. NAPs can varyin length, and multiple NAPs can be reported for the same allocation.

Instrumented Program

f hg i

MMU / Allocator

Memory m n o p

NAP detector

m

n

p

oSplay

tree

faults, allocs, frees

Controller

faultsactivations

Fig. 13. SMCO architecture for NAP detector.

However, we found that evenRDTSC can be too slow forour purposes. On our testbed, we measured theRDTSC in-struction to take 45 cycles on average, more than twenty timeslonger than an arithmetic instruction. With time measurementsnecessary on every function call for ourrange-solver, thiswas too expensive. Our firstrange-solver implementationcalledRDTSC inline for every time measurement, resulting ina 23% overhead even with all monitoring turned off.

To reduce the high cost of timekeeping, we modified therange-solver to spawn a separate “clock thread” to handleits timekeeping. The clock thread periodically callsRDTSCand stores the result in a memory location thatrange-solver

uses as its clock.range-solver can read this clock witha simple memory access. This is not as precise as callingRDTSC directly, but it is much more efficient.

3.2 Memory Under-Utilization

Our memory under-utilization detector is designed to iden-tify allocated memory areas (“allocations” for short) thatarenot accessed during time periods whose duration equals orexceeds a user-specified threshold. We refer to such a timeperiod as aNon-Accessed Period, or NAP, and to the user-specified threshold as theNAP threshold. Figure 12 depictsaccesses to an allocation and the resulting NAPs. For ex-ample, the time from accessb to accessc exceeds the NAPthreshold, so it is a NAP. Note that we are not detecting al-locations that are never touched (i.e., leaks), but rather allo-cations that are not touched for a sufficiently long period oftime to raise concerns about memory-usage efficiency.

Sampling by enabling and disabling monitoring at thefunction level would be ineffective for finding NAPs. An al-located region could appear to be in a NAP if some functionsthat access it are not monitored, resulting in false positives.Ideally, we want to enable and disable monitoringper allo-cation, so we never unintentionally miss accesses to moni-tored allocations. To achieve per-allocation control overmon-itoring, we introduce a novel memory-access interpositionmechanism that takes advantage of memory-protection hard-

ware. Figure 13 shows the architecture of our NAP Detector.To enable monitoring for an allocation, the controller callsmprotect to turn on read and write protections for the mem-ory page containing the allocation.

In Figure 13, memory regionsm ando are protected: anyaccess to them will cause a page fault. The NAP detector in-tercepts the page fault first to record the access; informationabout each allocation is stored in a splay tree. Then, the con-troller interprets the fault as an event and turns monitoringoff for the faulting allocation by removing its read and writeaccess protections. Because the NAP detector restores readand write access after a fault, the faulting instruction canex-ecute normally once the page fault handler returns. It is notnecessary to emulate the effects of a faulting instruction.

Within the cascade controller, each allocation has a sec-ondary controller that computes a delayd, after which timemonitoring should be re-enabled for the allocation. After pro-cessing an event, the controller checks for allocations that aredue for re-enabling. If no events occur for a period of time,a background thread performs this check instead. The back-ground thread is also responsible for periodically checkingfor memory regions that have been protected for longer thanthe NAP threshold and reporting them as NAPs.

We did not integrate the NAP detector with the globalcontroller because its means of overhead control does notallow for this kind of per-allocation control. When a moni-tored evente occurs and the accumulated overhead exceedsot, a global controller should establish a period ofglobalzero overhead by disabling all monitoring. Globally disablingmonitoring has the same problem as disabling monitoring atthe function level: allocations that are unlucky enough to beaccessed only during these zero-overhead periods will be er-roneously classified as NAPs.

We also implemented a shared library that replaces thestandard memory-management functions, includingmalloc

andfree, with versions that store information about the cre-ation and deletion of allocations in splay trees.

To control access protections for individual allocations,each such allocation must reside on a separate hardware page,becausemprotect has page-level granularity. To reduce thememory overhead, the user can set a size cutoff: allocationssmaller than the cutoff pass through to the standard memoryallocation functions, and are never monitored (under-utilizedsmall allocations are of less interest than under-utilizedlargeallocations anyway). In our experiments, we chose double thepage size (equal to 8KB on our test machine) as the cutoff,limiting the maximum memory overhead to 50%. Thoughthis cutoff would be high for many applications, in our ex-periments, 75% of all allocations were monitored.

3.3 Implementation Effort

To implement and test therange-solver tool described inSection 3.1, we developed two libraries, totaling 821 linesofcode, that manage the logic for the cascade and global con-trollers and perform the integer range analysis. We also de-


veloped a 1,708-line GCC plug-in that transforms functionsto report integer updates to our integer range-analysis library.

The NAP Detector described in Section 3.2 consists of a2,235-line library that wraps the standard memory-allocationfunctions, implements the cascade controller logic, and trans-parently handles page faults in the instrumented program.

4 Evaluation

This section describes a series of benchmarks that togethershow that SMCO fulfills its goals: it closely adheres to thespecified target overhead, allowing the user to specify a pre-cise trade-off between overhead and monitoring effectiveness.In addition, our cascade controller apportions the target over-head to all sources of events, ensuring that each source getsits fair share of monitoring time.

Our results highlight the difficulty inherent in achievingthese goals. The test workloads vary in behavior consider-ably over the course of an execution, making it impracticalto predict sources of overhead. Even under these conditions,SMCO is able to control observed overhead fairly well.

To evaluate SMCO’s versatility, we tested it on two work-loads, one CPU-intensive and one I/O-intensive, and with ourtwo different runtime monitors. Section 4.1 discusses our ex-perimental testbed. Section 4.2 describes the workloads andprofiles them in order to examine the challenges involved incontrolling monitoring overhead. In Section 4.3, we bench-mark SMCO’s ability to control the overhead of our integerrange analysis monitor using both of our control strategies.Section 4.4 benchmarks SMCO overhead control with ourNAP detector. Section 4.5 explains how we optimized certaincontroller parameters.

4.1 Testbed

Since controlling overhead is most important for long-runningserver applications, we chose a server-class machine for ourtestbed. Our benchmarks ran on a Dell PowerEdge 1950 withtwo quad-core 2.5GHz Intel Xeon processors, each with a12MB L2 cache, and 32GB of memory. It was equipped witha pair of Seagate Savvio 15K RPM SAS 73GB disks in a mir-rored RAID. We configured the server with 64-bit CentOSLinux 5.3, using a CentOS-patched 2.6.18 Linux kernel.

For our observed overhead benchmark figures, we aver-aged the results of ten runs and computed 95% confidenceintervals using Student’st-distribution. Error bars representthe width of a measurement’s confidence interval.

We used the/proc/sys/vm/drop caches facility pro-vided by the Linux kernel to drop page, inode, and dentrycaches before each run of our I/O-intensive workload to en-sure cold caches and to prevent consecutive runs from influ-encing each other.

4.2 Workloads

We tested our SMCO approach on two applications: the CPU-intensivebzip2 and an I/O-intensivegrep workload. Thebzip2 benchmark is a data compression workload from theSPEC CPU2006 benchmark suite, which is designed to max-imize CPU utilization [11]. This benchmark uses thebzip2

utility to compress and then decompress a 53MB file consist-ing of text, JPEG image data, and random data.

Our range-solver monitor, described in Section 3.1,found 80 functions inbzip2, of which 61 contained integerassignments, and 445 integer variables, 242 of which weremodified during execution. Integer-update events were spreadvery unevenly among these variables. The least-updated vari-ables were assigned only one or two times during a run, whilethe most-updated variable was assigned 2.5 billion times.

Figure 14 shows the frequency of accesses to two differ-ent variables, the most updated variable and the99th mostupdated variable, over time. The data was obtained by instru-mentingbzip2 to monitor a single specified variable withunbounded overhead. The monitoring runs for these two vari-ables took 76.4 seconds and 55.6 seconds, respectively. Thetwo histograms show different extremes: the most updatedvariable is constantly active, while accesses to the99th mostupdated variable are concentrated in short periods of high ac-tivity. Both variables, however, experience heavy bursts of ac-tivity that make it difficult to predict monitoring overhead.

Our I/O-intensive workload uses GNUgrep 2.5, the pop-ular Linux regular expression search utility. In our bench-marks,grep searches the entire GCC 4.5 source tree (about543MB in size) for an uncommon pattern. When we testedthe workload with the Unixtime utility, it reported that theseruns typically used only 10–20% CPU time. Most of each runwas spent waiting for read requests, making this an I/O-heavyworkload. Because thegrep workload repeats the same shorttasks, we found that its variable accesses were distributedmore uniformly than inbzip2. Ourrange-solver reported489 variables, with 128 actually updated in each run, and 149functions, 87 of which contained integer assignments. Themost updated variable was assigned 370 million times.

4.3 Range Solver

We benchmarked therange-solver monitor discussed inSection 3.1 on both workloads using both of the controllers inSection 2. Sections 4.3.1 and 4.3.2 present our results for theglobal controller and cascade controller, respectively. Sec-tion 4.3.3 compares the results from the two controllers. Sec-tion 4.3.4 discussesrange-solver’s memory overhead.

4.3.1 Global Controller

Figure 15 shows how the global controller performs on ourworkloads for a range of target overheads (on thex-axis),with the observed overhead on they-axis and the total numberof events monitored on they2-axis for each target-overhead


0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80

# of

eve

nts

(mill

ions

)

time (seconds)

(a) Most updated variable

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60

# of

eve

nts

(mill

ions

)

time (seconds)

(b) 99th most updated variable

Fig. 14. Event distribution histogram for the most updated variable(a) and99th most updated variable (b) inbzip2. Execution time (x-axis) is split into 0.4second buckets. They-axis shows the number of events in each time bucket.

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140 0

5

10

15

20

25

30

Obs

erve

d O

verh

ead

(%)

Eve

nts

(bill

ions

)Target Overhead (%)

overheadevents

ideal

(a) bzip2

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 0

0.2

0.4

0.6

0.8

1

1.2

1.4

Obs

erve

d O

verh

ead

(%)

Eve

nts

(bill

ions

)

Target Overhead (%)

overheadevents

ideal

(b) grep

Fig. 15. Global controller with range-solverobserved overhead (y-axis) for a range of target overhead settings (x-axis) and two workloads.


0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140 0

5

10

15

20

25

Obs

erve

d O

verh

ead

(%)

Eve

nts

(bill

ions

)

Target Overhead (%)

overheadevents

ideal

(a) bzip2

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 0

0.2

0.4

0.6

0.8

1

1.2

1.4

Obs

erve

d O

verh

ead

(%)

Eve

nts

(bill

ions

)

Target Overhead (%)

overheadevents

ideal

(b) grep

Fig. 16. Cascade controller with range-solverobserved overhead (y-axis) for a range of target overhead settings (x-axis) and two workloads.

setting. With target overhead set to 0%, both workloads ranwith an actual overhead of 4%, which is the controller’sbaseoverhead. The base overhead is due to the controller logicand the added complexity from unused instrumentation.

The dotted line in each plot shows the ideal result: ob-served overhead equals target overhead up to an ideal max-imum. We computed the ideal maximum to be the observedoverhead from monitoring all events in the program with allcontrol turned off. Any observed overhead above the idealmaximum is the result of overhead incurred by the controller.

At target overheads of 10% and higher, Figure 15(a) showsthat the global controller tracked the specified target overheadall the way up to 140% in thebzip2 workload. Thegrepworkload (Figure 15(b)) showed a general upward trend for

increasing target overheads, but never exceeded 9% observedoverhead. In fact, at 9% overhead,range-solver is alreadyat nearly full coverage, with 99.7% percent of all programevents being monitored. Thegrep workload’s low CPU us-age imposes a hard limit onrange-solver’s ability to useoverhead. The controller has no way to exceed this limit.Confidence intervals for thegrep workload were generallywider than forbzip2, because I/O operations are noisier thanCPU operations, making run times less consistent.

4.3.2 Cascade Controller

Figure 16 shows results from experiments that are the sameas those for Figure 15 except using the cascade controller in-


stead of the global controller. The results were similar. Onthebzip2 workload, the controller tracked the target overheadwell from 10% to 100%. With targets higher than 100%, theobserved overhead continued to increase, but the controllerwas not able to adjust overhead high enough to reach the tar-get because observed overhead was already so close to the120% maximum. On thegrep workload, we saw the sameupward trend and eventual saturation with 9% observed over-head monitoring 99.5% of events.

4.3.3 Controller Comparison

The global and cascade controllers differ in the distributionof overhead across different event sources. To compare them,we developed an accuracy metric for the results of a bounded-overhead range-solver run. We measured the accuracy of abounded-overhead run of the range-solver against a referencerun with full coverage of all variables (allowing unboundedoverhead). The reference run determined the actual range forevery variable.

In a bounded-overhead run, the accuracy of a range com-puted for a single variable is the ratio of the computed rangesize to the range’s actual size (which is known from the ref-erence run). Missed updates in a bounded-overhead run cancauserange-solver to report smaller ranges, so this ratiois always in the interval[0, 1]. For a set of variables, the ac-curacy is the average accuracy for all variables in the set.

Figure 17 shows a breakdown ofrange-solver’s ac-curacy by how frequently variables are updated. We group-ed variables into sets with geometrically increasing bounds:the first set containing variables with 1–10 updates, the sec-ond group containing variables with 10–100 updates, etc. Fig-ure 17(a) shows the accuracy for each of these sets, and Fig-ure 17(b) shows the cumulative accuracy, with each set con-taining the variables from the previous set.

We used 10% target overhead for these examples, becausewe believe that low target overheads represent the most likelyuse cases. However, we found similar results for all other tar-get overhead values that we tested.

The cascade controller’s notion of fairness results in bet-ter coverage, and thus better accuracy, for rarely updated vari-ables. In this example, the cascade controller had better accu-racy than the global controller for variables with fewer than100 updates. As the global controller does not seek to fairlydistribute overhead to these variables, it monitored a smallerpercentage of their updates. Most dramatically, Figure 17(a)shows that the global controller had 0 accuracy for all vari-ables in the 10–100 updates range, meaning it did not monitormore than one event for any variable in that set. The 3 vari-ables in the workload with 10–100 updates were used whilethere was heavy activity, causing their updates to get lost inthe periods when the global controller had to disable moni-toring to reduce overhead.

However, with the same overhead, the global controllerwas able to monitor many more events than the cascade con-troller, because it did not spend time executing the cascadecontroller’s more expensive secondary controller logic. These

Table 1. range-solver memory usage, including executable size, vir-tual memory usage (VSZ), and physical memory usage (RSS).

Exe Size VSZ RSSbzip2 (unmodified) 68.6KB 213KB 207KBbzip2 (global) 262KB 227KB 203KBbzip2 (cascade) 262KB 225KB 201KBgrep (unmodified) 89.2KB 61.4MB 1260KBgrep (global) 314KB 77.1MB 1460KBgrep (cascade) 314KB 78.2MB 1470KB

extra events gave the global controller much better coveragefor frequently updated variables. Specifically, it had better ac-curacy for variables with more than106 updates.

Between these two extremes, i.e., for variables with100–106 updates, both approaches had similar accuracy. The cu-mulative accuracy in Figure 17(b) shows that overall, consid-ering all variables in the program, the two controllers achievedsimilar accuracy. The difference is primarily in where the ac-curacy was distributed.

4.3.4 Memory Overhead

Althoughrange-solver does not use SMCO to control itsmemory overhead, we measured memory use of our con-trollers for both workloads. Table 1 shows our memory over-head results. HereExe Sizeis the size of the compiled binaryafter stripping debugging symbols (as is common in produc-tion environments). This size includes the cost of the SMCOlibrary, which contains the compiled controller and monitorcode.VSZis the total amount of memory mapped by the pro-cess, andRSS(Resident Set Size) is the total amount of thatvirtual memory stored in RAM. We obtained the peak VSZand RSS for each run using the Unixps utility.

Both binaries increased in size by 3–4 times. Most of thisincrease is the result of function duplication, which at leastdoubles the size of each instrumented function. Duplicatedfunctions also contain a distributor block and instrumentationcode. The 17KB SMCO library adds a negligible amount tothe instrumented binary’s size. As few binaries are more thanseveral megabytes in size, we believe that even a 4x increasein executable size is acceptable for most environments; thisis more true these days, with increasing amounts of RAM inpopular 64-bit systems.

The worst-case increase in virtual memory use was only27.4%, for thegrep workload with the cascade controller.The additional virtual memory is allocated statically to storeinteger variable ranges and per-function overhead measure-ments (when the cascade controller is used). This extra mem-ory scales linearly with the number of integer variables andfunctions in the monitored program, not with runtime mem-ory usage. Thebzip2workload uses more memory thangrep,so we measured in this case a much lower 6.6% virtual mem-ory overhead.


0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

Global - 100HzCascaded - KI=0.1

(a) Non-cumulative

0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

Global - 100HzCascaded - KI=0.1

(b) Cumulative

Fig. 17. Comparison of range-solver accuracyfor both controllers withbzip2 workload. Variables are grouped by total number of updates.

4.4 NAP Detector

Figure 18 shows the results of our NAP detector, described inSection 3.2, onbzip2 using the cascade controller. The NAPdetector’s instrumentation incurs no overhead while ignoringevents, so it has a very low base overhead: only 1.1%. TheNAP detector also tracks the user-specified target overheadwell from 10–140%. The results also show that the NAP de-tector takes advantage of extra overhead that the user allowsit. As target overhead increased, the number of monitoredevents scaled smoothly from only 300,000 to over 4 million.

Because the global controller is not designed to use per-allocation sampling (as explained in Section 3.2), we usedonly cascade control for our NAP detector experiments.

Table 2. NAP detector memory usage, including executable size, virtualmemory usage (VSZ), and physical memory usage (RSS).

Exe Size VSZ RSSbzip2 (unmodified) 68.6KB 213KB 207KBbzip2 (cascade) 89.4KB 225KB 209KB

Table 2 shows memory overhead for the NAP detector.Exe Size, VSZ, andRSZare the same as in Section 4.3.4. Al-though it is possible for the NAP detector to increase memoryusage by as much as 50% (see Section 3.2), thebzip2 bench-mark experienced only a 5.3% increase in virtual memoryusage (VSZ). All of the allocations monitored in the bench-


0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140 0

0.5

1

1.5

2

2.5

3

Obs

erve

d ov

erhe

ad (

%)

Eve

nts

(mill

ions

)

Target overhead (%)

Observed overheadIdeal

Access events

Fig. 18. Cascade controller with NAP detectorobserved overhead (y-axis) and number of monitored memory access events (y2-axis) for a range of targetoverhead settings (x-axis) running on thebzip2 workload.

mark were 10–800 times larger than the page size, so forcingthese allocations to be in separate pages resulted in very littlewasted space. The 20.8KB increase in Exe size is from thestatically linked SMCO library.

4.5 Controller Optimization

This section describes how we chose values for several of ourcontrol parameters in order to get the best performance fromour controllers. Section 4.5.1 discusses our choice of clockprecision for time measurements. Section 4.5.2 explains howwe chose the integrative gain and adjustment interval for theprimary controller. Section 4.5.3 discusses optimizing the ad-justment interval for an alternate primary controller.

4.5.1 Clock Frequency

The control logic for ourrange-solver implementation usesa clock thread, as described in Section 3.1, which trades offprecision for efficiency. This thread can keep more precisetime by updating its clock more frequently, but more frequentupdates result in higher timekeeping overhead. Recall thattheRDTSC instruction is relatively expensive, taking 45 cycles onour testbed.

We performed experiments to determine how much preci-sion was necessary to control overhead accurately. Figure 19shows therange-solver benchmark in Figure 15 repeatedfor four different clock frequencies. Theclock frequencyishow often the clock thread wakes up to read the TSC.

At only 10Hz, the controller’s time measurements werenot accurate enough to keep the overhead below the target.With infrequent updates, most monitoring operations occurredwithout an intervening clock tick and therefore appeared totake 0 seconds. The controller ended up with an under-estimateof the actual monitoring overhead and thus overshot its goal.

At 100Hz, however, controller performance was good andthe clock thread’s impact on the system was still negligible,incurring the 45-cycleRDTSC cost only one hundred times forevery 2.5 billion processor cycles on our test system. Morefrequent updates did not perform any better and wasted re-sources, so we chose 100Hz for our clock update frequency.

In choosing the clock frequency, we wanted to ensure thatwe also preserved SMCO’s effectiveness. Figure 20 showsthe accuracy of therange-solver at 10% target overheadusing the four clock frequencies we tested. We used the sameaccuracy metric as in Section 4.3.3 and plotted the resultsas in Figure 17. Therange-solver accuracy is similar for100Hz, 1000Hz, and 2500Hz clocks. These three values re-sulted in similar observed overheads in Figure 19. It thereforemakes sense that they achieve similar accuracy, since SMCOis designed to support trade-offs between effectiveness andoverhead. The 10Hz run has the best accuracy, but this resultis misleading because it attains that accuracy at the cost ofhigher overhead than the user-requested 10%. Testing theseclock frequencies at higher target overheads showed similarbehavior. Note that the 100Hz curves in Figure 20 are thesame as the global controller curves from the controller com-parison in Figure 17.

4.5.2 Integrative Gain

The primary controller component of the cascade controllerdiscussed in Section 2 has two parameters: the integrativegainKI and the adjustment intervalT . The integrative gainis a weight factor that determines how much the cumulativeerrorec (the deviation of the observed monitoring percentagefrom the target monitoring percentage) changes the local tar-get monitoring percentagemlt (the maximum percent of ex-ecution time that each monitored plant is allowed to use for


0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Obs

erve

d O

verh

ead

(%)

Target Overhead (%)

10 Hz100 Hz

1000 Hz2500 Hz

ideal

(a) bzip2

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14

Obs

erve

d O

verh

ead

(%)

Target Overhead (%)

10 Hz100 Hz

1000 Hz2500 Hz

ideal

(b) grep

Fig. 19. Observed overhead for global controller clock frequencieswith 4 different clock frequencies and 2 workloads usingrange-solver instrumentation.


0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

10Hz100Hz

1000Hz2500Hz

(a) Non-cumulative

0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

10Hz100Hz

1000Hz2500Hz

(b) Cumulative

Fig. 20. Accuracy of global controller clock frequencieswith 4 different clock frequencies usingrange-solver instrumentation. Variables are groupedby total number of updates.

monitoring). WhenKI is high, the primary controller makeslarger changes tomlt to correct for observed deviations.

The adjustment interval is the period of time between pri-mary controller updates. With a lowT value, the controlleradjustsmlt more frequently.

There are processes for choosing good control parame-ters for most applications of control theory, but our systemis

tolerant enough that good values forKI andT can be deter-mined experimentally.

We tuned the controller by runningrange-solver onan extendedbzip2 workload with a range of values forKI

andT and then recording how the primary controller’s out-put variable,mlt, stabilized over time. Figure 21 shows ourresults for fourKI values and threeT values with target over-


0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 50 100 150 200 250

mlt

time (second)

KI=0.1KI=0.5KI=1.0KI=2.0

(a) T = 0.4s

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0 50 100 150 200 250

mlt

time (second)

KI=0.1KI=0.5KI=1.0KI=2.0

(b) T = 2s

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0 50 100 150 200 250

mlt

time (second)

KI=0.1KI=0.5KI=1.0KI=2.0

(c) T = 10s

Fig. 21. Local target monitoring percentage (mlt) over timeduringbzip2workload forrange-solverwith cascade control. Results shown with targetoverhead set to 20% for 4 different values ofKI and 3 values ofT .


0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Obs

erve

d O

verh

ead

(%)

Target Overhead (%)

KI=0.1KI=0.5KI=1.0KI=2.0

ideal

Fig. 22. Observed overhead for primary controllerKI valuesusingrange-solver with T = 400ms and four differentKI values.

head set to 20% for all runs. These results revealed trends inthe effects of adjusting the controller parameters.

In general, we found that higher values ofKI increasedcontroller responsiveness, allowingmlt to more quickly com-pensate for under-utilization of monitoring time, but at a costof driving higher-amplitude oscillations inmlt. This effect ismost evident withT = 0.4s (see Figure 21(a)). All the valueswe tested fromKI = 0.1 to KI = 2.0 successfully met ouroverhead goals, but values greater than 0.1 oscillated wildlyenough that the controller had to sometimes turn monitoringoff completely (by settingmlt to almost 0) to compensatefor previous spikes inmlt. With KI = 0.1 however,mlt os-cillated stably for the duration of the run after a 50-secondwarm-up period.

When we changedT to 2s (see Figure 21(b)), we ob-served that 0.1 was no longer the optimal value forKI . Butwith KI = 0.5, we were able to obtain performance as goodas the optimal performance withT = 0.4s: the controller metits target overhead goal and had the same 50-second warmuptime. We do see the consequences of choosing too small aKI , however: withKI = 0.1, the controller was not ableto finish warming up before the benchmark finished, and thesystem failed to achieve its monitoring goal.

The same problem occured whenT = 10s (see Figure 21(c)):the controller updated so infrequently that it only completedits warm-up for the highestKI value we tried. Even with thehighestKI , the controller still undershot its monitoring per-centage goal.

Because of its stability, we choseKI = 0.1 (with T =0.4s) for all of our cascade controlledrange-solver ex-periments with low target overhead. Figure 22 shows howthe controller tracked target overhead for all theKI valueswe tried. AlthoughKI = 0.1 worked well for low over-heads, we again observed warmup periods that were too longwhen target overhead was very high. The warmup period took

longer with very high overhead because of the larger gap be-tween target overhead and the initial observed overhead. Todeal with this discrepancy, we actually use twoKI values,KI = 0.1 for normal cases, and a special high-overheadKH

I

for cases with target overhead greater than 70%. We choseKH

I = 0.5 using the same experimental procedure we usedfor KI .

4.5.3 Adjustment Interval

Before settling on the integrative control approach that weusefor our primary controller, we attempted an ad hoc controlapproach that yielded good results for therange-solver,but was not stable enough to control the NAP Detector. Wealso found the ad hoc primary controller to be more difficultto adjust than the integrative controller. Though we no longeruse the ad hoc approach, the discussion of how we tuned itillustrates some of the properties of our system.

Rather than computing error as a difference, the ad hoccontroller computes erroref fractionally as the Global Tar-get Monitoring Percentage (GTMP, as in Section 2.3.2) di-vided by the observed global monitoring percentage for theentire program run so far. The value ofef is greater than onewhen the program is under-utilizing its monitoring time andless than one when the program is using too much monitor-ing time. After computing the fractional error, the controllercomputes a new value formlt by multiplying it byef .

The ad hoc primary controller has only one parameter toadjust. Like the integrative controller, it has an intervaltimeT between updates tomlt.

Figure 23 shows therange-solver benchmark in Fig-ure 16 repeated for four values ofT . For thebzip2 workload(Figure 23(a)), the best results were obtained with aT inter-val of 10 seconds. SmallerT values led to an unstable pri-mary controller: the observed overhead varied from the target


0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120 140

Obs

erve

d O

verh

ead

(%)

Target Overhead (%)

400 msec2 sec

10 sec20 sec

ideal

(a) bzip2

0

2

4

6

8

10

12

0 2 4 6 8 10 12 14

Obs

erve

d O

verh

ead

(%)

Target Overhead (%)

400 msec2 sec

10 sec20 sec

ideal

(b) grep

Fig. 23. Observed overhead for an ad hoc cascade controller’sT valueswith 4 different values ofT and 2range-solverworkloads.

and results became more random, as shown by the wider con-fidence intervals for these measurements.

This result contradicts the intuition that a smallerT shouldstabilize the primary controller by allowing it to react faster.In fact, the largerT gives the controller more time to correctlyobserve the trend in utilization among monitoring sources.For example, with a shortT , the first adjustment interval ofbzip2’s execution used only a small fraction of all the vari-ables in the program. The controller quickly adjustedmlt

very high to compensate for the monitoring time that the un-accessed variables were failing to use. In following intervals,the controller needed to adjustmlt down sharply to offset theoverhead from many monitoring sources becoming active atonce. The controller spent the rest of its run oscillating to

correct its early error, never reaching equilibrium duringtheentire run.

Figure 24 shows how the observed monitoring percent-age for each adjustment interval fluctuates during an extendedrange-solver run of thebzip2 workload as a result of theprimary controller’smlt adjustments. WithT = 0.4 seconds,the observed monitoring percentage spikes early on, so ob-served overhead is actually very high at the beginning of theprogram. Near 140 seconds, the controller overreacts to achange in program activity, causing another sharp spike inobserved monitoring percentage. As execution continues, theobserved monitoring percentage sawtooths violently, and thereare repeated bursts of time when the observed percentage is


20

40

60

80

100

120

140

160

0 50 100 150 200 250 300 350 400

Obs

. Mon

itorin

g %

with

in In

terv

als

(%)

Time (seconds)

400 ms10s

Fig. 24. Observed monitoring percentage over bzip2range-solver execution.The percent of each adjustment interval spent monitoring for 2 values ofT . The target monitoring percentage is shown as a dotted horizontal line.

much higher than the target percentage (meaning observedoverhead is much higher than the user’s target overhead).

With T = 10 seconds, the observed monitoring percent-age still fluctuates, but the extremes do not vary as far fromthe target. As execution continues, the oscillations dampen,and the system reaches stability.

In our bzip2 workload, the first few primary controllerintervals were the most critical: bad values at the beginning ofexecution were difficult to correct later on. The more reason-able 10 secondT made its first adjustment after a larger sam-ple of program activity, so it did not overcompensate. Over-all, we expect that a primary controller with a longerT is lesslikely to be misled by short bursts or lulls in activity.

There is a practical limit to the length ofT , however. InFigure 23(a), a controller withT = 20 seconds overshot itstarget overhead. Because the benchmark runs for only aboutone minute, the slower primary controller was not able to ad-justmlt often enough to converge on a stable value before thebenchmark ended.

As in Section 4.5.1 we also tested how the choice ofT af-fects therange-solver’s accuracy, using the same accuracymetric as in Section 4.3.3. Figure 25 shows the accuracy forthebzip2 workload using four different values forT with a10% target overhead. The accuracy results confirm our choiceof 10 seconds. Only the 20 second value forT yields betteraccuracy, but it does so because it consumes more overheadthan the primary controller withT = 10 seconds. Tests withhigher target overheads gave similar results.

Evaluation SummaryOur results show that in all the caseswe tested, SMCO was able to track the user-specified targetoverhead for a wide range of target overhead values. Evenwhen there were too few events during an execution to meetthe target overhead goal, as was the case for thegrep work-

load with target overheads greater than 10%, SMCO’s con-troller was able to achieve the maximum possible observedoverhead by monitoring nearly all events. We also showedthat the overhead trade-off is a useful one: higher overheadsallowed for more effective monitoring. These results are forchallenging workloads with unpredictable bursts in activity.

Although our results relied on choices for several param-eters, we found that it was practical to find good values forall of these parameters empirically. As future work, we planto explore procedures for automating the selection of optimalcontroller parameters, which can vary with different typesofmonitoring.

5 Related Work

Chilimbi and Hauswirth [10] propose an overhead-controlmechanism for memory under-utilization detection that ap-proximates memory-access sampling by monitoring a pro-gram’s basic blocks for memory accesses. Low overhead isachieved by reducing the sampling rate of blocks that gener-ate a high rate of memory accesses. The aim of their samplingpolicy is, however, to reduce overhead over time; once moni-toring is reduced for a particular piece of code, it is never in-creased again, regardless of that code’s future memory-accessbehavior. As a NAP (Non-Accessed Period) is most com-monly associated with a memory allocation and not a ba-sic block, there is no clear association between the samplingrate for blocks and confidence that some memory region isunder-utilized. In contrast, SMCO’s NAP detector uses vir-tual-memory hardware to directly monitor areas of allocatedmemory, and overhead control is realized by enabling anddisabling the monitoring of memory areas appropriately.


0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

400 msec2 sec

10 sec20 sec

(a) Non-cumulative

0

20

40

60

80

100

10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Ran

ge S

olve

r A

ccur

acy

(%)

# of accesses (log)

400 msec2 sec

10 sec20 sec

(b) Cumulative

Fig. 25. Accuracy of cascade controllerT valueswith 4 values ofT on thebzip2workload usingrange-solver instrumentation. Variables are groupedby total number of updates.

Artemis [8] reduces CPU overhead due to runtime mon-itoring by enabling monitoring for only certain function ex-ecutions. By default, Artemis monitors a function executiononly if the function is called in acontextthat it has not seenbefore. In theory, a function’s context consists of the valuesof all memory locations it accesses. Storing and comparingsuch contexts would be too expensive, so Artemis actuallyuses weaker definitions of “context” and “context matching,”which may cause it to miss some interesting behaviors [8].Artemis’ context-awareness is orthogonal to SMCO’s feed-back control approach. SMCO could be augmented to prefernew contexts, as Artemis does, in order to monitor more con-texts within an overhead bound. The main difference between

Artemis and SMCO is in the goals they set for overhead man-agement. Artemis uses context awareness to reduce—but notbound—the overhead. The overhead from monitoring a func-tion in every context varies during program execution and canbe arbitrarily high, especially early in the execution, whenmost contexts have not yet been seen. The user does not knowin advance whether the overhead will be acceptable. In con-trast, SMCO is designed to always keep overhead within auser-specified bound. As a result of this different goal, SMCOuses very different techniques than Artemis. In summary, over-head reduction techniques like those in Artemis are usefulbut, unlike SMCO, they do not directly address the problemof unpredictable and unacceptably high overheads.


Liblit et al.’s statistical debugging technique [13] seeksto reduce per-process monitoring overhead by partitioninga monitoring task across many processes running the sameprogram. This approach is attractive when applicable but hasseveral limitations: it is less effective when an application isbeing tested and debugged by a small number of developers;privacy concerns may preclude sharing of monitoring resultsfrom deployed applications; and for some monitoring tasks,itmay be difficult to partition the task evenly or correlate moni-toring data from different processes. Furthermore, in contrastto SMCO, statistical debugging does not provide the abilityto control overhead based on a user-specified target level.

We first introduced the concept of a user-specified targetoverhead, along with the idea of using feedback control toenforce that target, in an NSF-sponsored workshop [4]. Thecontroller we implemented in that workshop is the basis forour cascade controller.

The Quality Virtual Machine (QVM) [3] also supportsruntime monitoring with a user-specified target overhead. It’soverhead-management technique is similar to our cascade con-troller in that it splits control into global and local overheadtargets and prioritizes infrequently executed probes to increasecoverage. The QVM overhead manager assigns asamplingfrequencyto each overhead source, and it adjusts the sam-pling frequency to control overhead. Unlike SMCO, whichuses an integral controller to adjust local target monitoringpercentages, these adjustments are made by anad hoccon-troller that lacks theoretical foundations. The section ontheoverhead manager in QVM does not describe (or justify) thecomputations used to adjust sampling frequency.

In its benchmarks, QVM did not track overhead goals asaccurately as SMCO. For example, configured for 20% tar-get overhead, two of QVM’s benchmarks,eclipse andfop,showed an actual overhead of about 26% [3]. Though we can-not test Java programs, ourbzip workload is a comparableCPU-bound benchmark. SMCO’s worst case forbzip was a22.7% overhead with NAP Detector instrumentation, shownin Figure 18. This suggests that controllers, like SMCO’s, thatare based on established control theory principle can providemore accurate control. This is not surprising, since meetingan overhead goal is inherently a feedback control problem.

SMCO achieves accurate overhead control even when mon-itored events happen very frequently. Ourrange-solver at-tempts to track every integer assignment in a program, han-dling many more events than QVM’s monitors, which trackcalls to specified methods. We measured the rate of total eventsper second in benchmarks from both systems: the rate of in-teger assignment events in ourbzip2 workload and the rateof potentially monitored method calls in the DaCapo bench-marks that QVM uses. Thebzip2 event rate was more thanfifty times the DaCapo rate. Our technique of toggling moni-toring at the function level, using function duplication, makesit possible to cope with high event rates, by reducing thenumber of controller decisions. Even with this technique, wefound that direct calls toRDTSC by the controller—an ap-proach that is practical for the low event rates in QVM’sbenchmarks—are too expensive at high event rates like those

in range-solver. We overcame this problem by introduc-ing a separate clock thread, as described in Section 3.1.

QVM and SMCO make sampling decisions at differentgranularities, reflecting their orientation towards monitoringdifferent kinds of properties. QVM makes monitoring deci-sions at the granularity of objects. In theory, for each object,QVM monitors all relevant operations (method invocations)or none of them. In practice, this is problematic, because thetarget overhead may be exceeded if QVM decides to trackan object that turns out to have a high event rate. QVM dealswith this by usingemergency shutdownto abort monitoring ofsuch objects. In contrast, SMCO is designed for monitors thatdo not need to observe every operation on an object to pro-duce useful results. For example, even if monitoring is tem-porarily disabled for a memory allocation, our NAP detectorcan resume monitoring the allocation and identify subsequentNAPs.

QVM and SMCO are designed for very different execu-tion environments. QVM operates in a modified Java Vir-tual Machine (JVM). This makes implementation of efficientmonitoring considerably easier, because the JVM sits conve-niently between the Java program and the hardware. SMCO,on the other hand, monitors C programs, for which there is noeasy intermediate layer in which to implement interceptionand monitoring. Monitoring of C programs is complicated bythe fact that C is weakly typed, pointers and data can be ma-nipulated interchangeably, and all memory accesses are ef-fectively global: any piece of code in C can potentially accessany memory address via any pointer. Low-level instrumenta-tion techniques allow SMCO to control overhead in spite ofthese complications: function duplication reduces overheadfrom instrumentation that is toggled off, and our novel use ofmemory management hardware allows efficient tracking ofaccesses to heap objects.

6 Conclusions and Future Work

We have presented Software Monitoring with ControllableOverhead (SMCO), an approach to overhead control for theruntime monitoring of software. SMCO is optimal in the sensethat it monitors as many events as possible without exceed-ing the target overhead level. This is distinct from other ap-proaches to software monitoring that promise low or adaptiveoverhead, but where overhead, in fact, varies per applicationand under changing usage conditions. The key to SMCO’sperformance is the use of underlying control strategies foranonlinear control problem represented in terms of the com-position of timed automata.

Using SMCO as a foundation, we developed two sophisti-cated monitoring tools: an integer range analyzer, which usescode-oriented instrumentation, and a NAP detector, whichuses memory-oriented instrumentation. Both the per-functionchecks in the integer range analyzer and the per-memory-areachecks in the NAP detector are activated and deactivated bythe same generic controller, which achieves a user-specifiedtarget overhead with either of these systems running.


Our extensive benchmarking results demonstrate that it ispossible to perform runtime monitoring of large software sys-tems with fixed target overhead guarantees. As such, SMCOis promising both for developers, who desire maximal moni-toring coverage, and system administrators, who need a wayto effectively manage the impact of runtime monitoring onsystem performance. Moreover, SMCO is fully responsiveto both increases and decreases in system load, even highlybursty workloads, for both CPU- and I/O-intensive applica-tions. As such, administrators need not worry about unusualeffects in instrumented software caused by load spikes.

Future work There are many potential uses for SMCO, in-cluding such varied techniques as lockset profiling and check-ing, runtime type checking, feedback-directed algorithm se-lection, and intrusion detection. SMCO could be used to man-age disk or network time instead of CPU time. For exam-ple, a background file-system consistency checker could useSMCO to ensure that it gets only a specific fraction of disktime, and a background downloader could use SMCO to en-sure that it consumes only a fixed proportion of network time.

Though our cascade controller adheres to its target over-head goals very well, it is dependent on somewhat carefuladjustment of theKI control parameter. Changes to the mon-itor can alter the system enough to require retuningKI . Itwould be useful to automate the process for optimizingKI

for low and high overheads so that developing new monitorsdoes not require tedious experimentation. An automated pro-cedure could model the response time and level of oscillationas a function ofKI in order to find a good trade-off.

We also believe that SMCO as a technique can be im-proved. One improvement would be to handledependenciesbetween monitored events in cases where correctly monitor-ing an event requires information from previous events. Forexample, if a runtime type checker fails to monitor the eventthat initializes an object’s type, it would be useless to moni-tor type-checking events for that object; the controller shouldtherefore be able to ignore such events.

7 Acknowledgments

The authors would like to thank the anonymous reviewers fortheir invaluable comments and suggestions. They would alsolike to thank Michael Gorbovitski for insightful discussionsand help with the benchmarking results. Research supportedin part by AFOSR Grant FA9550-09-1-0481, NSF GrantsCNS-0509230, CCF-0926190, and CNS-0831298, and ONRGrant N00014-07-1-0928.

References

1. A. Aziz, F. Balarin, R. K. Brayton, M. D. Dibenedetto, A.Sladanha, and A. L. Sangiovanni- Vincentelli. Supervisorycon-trol of finite state machines. In P. Wolper, editor,7th Inter-national Conference On Computer Aided Verification, volume939, pages 279–292, Liege, Belgium, 1995. Springer Verlag.

2. R. Alur and D. L. Dill. A theory of timed automata.TheoreticalComputer Science, 126(2):183–235, 1994.

3. M. Arnold, M. Vechev, and E. Yahav. QVM: An efficient run-time for detecting defects in deployed systems. InProceed-ings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications(OOPSLA), Nashville, TN, October 2008. ACM.

4. S. Callanan, D. J. Dean, M. Gorbovitski, R. Grosu, J. Seyster,S. A. Smolka, S. D. Stoller, and E. Zadok. Software Monitoringwith Bounded Overhead. InProceedings of the 2008 NSF NextGeneration Software Workshop, in conjunction with the 2008International Parallel and Distributed Processing Symposium(IPDPS 2008), Miami, FL, April 2008.

5. S. Callanan, D. J. Dean, and E. Zadok. Extending GCC withmodular GIMPLE optimizations. InProceedings of the 2007GCC Developers’ Summit, Ottawa, Canada, July 2007.

6. B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic in-strumentation of production systems. InProceedings of the An-nual USENIX Technical Conference, pages 15–28, 2004.

7. L. Fei and S. P. Midkiff. Artemis: Practical runtime monitoringof applications for errors. Technical Report TR-ECE-05-02,Electrical and Computer Engineering, Purdue University, 2005.docs.lib.purdue.edu/ecetr/4/.

8. L. Fei and S. P. Midkiff. Artemis: Practical runtime monitoringof applications for execution anomalies. InProceedings of the2006 ACM SIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI ’06), Ottawa, Canada, June2006.

9. G.F. Franklin, J.D. Powell, and M. Workman.Digital Controlof Dynamic Systems, Third Edition. Addison Wesley Longman,Inc., 1998.

10. M. Hauswirth and T. M. Chilimbi. Low-overhead memoryleak detection using adaptive statistical profiling.Proceedingsof the 11th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS2004), pages 156–164, October 2004.

11. J. L. Henning. SPEC CPU2006 benchmark descriptions.Com-puter Architecture News, 34(4):1—17, September 2006.

12. C. A. R. Hoare. Communicating Sequential Processes.Com-munications of the ACM, 21:666–677, August 1978.

13. B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isola-tion via remote program sampling. InProceedings of the 2003ACM SIGPLAN Conference on Programming Language Designand Implementation (PLDI ’03), San Diego, CA, June 2003.

14. R. Moore. A universal dynamic trace for Linux and other op-erating systems. InProceedings of the 2001 USENIX AnnualTechnical Conference, June 2001.

15. P.J. Ramadge and W.M. Wonham. Supervisory control of aclass of discrete event systems.SIAM J. Control and Optimiza-tion, 25(1):206–230, 1987.

16. P.J. Ramadge and W.M. Wonham. Supervisory control of timeddiscrete-event systems.IEEE Transactions on Automatic Con-trol, 38(2):329–342, 1994.

17. J. Seward, N. Nethercote, and J. Fitzhardinge. Valgrind. http://valgrind.kde.org, August 2004.

18. Q.-G. Wang, Z. Ye, W.-J. Cai, and C.-C. Hang.PID ControlFor Multivariable Processes. Lecture Notes in Control and In-formation Sciences, Springer, March 2008.

19. H. Wong-Toi and G. Hoffmann. The control of dense real-timediscrete event systems. InProc. of 30th Conf. Decision andControl, pages 1527–1528, Brighton, UK, 1991.

Documents

Software Monitoring with Controllable Overhead