Download pdf - Aequitas: Coordinated Energy Management Across Parallel ...davidl/papers/ICS16.pdf · This section provides an overview of the baseline Her-mes system, together with a review of our

Aequitas: Coordinated Energy Management Across ParallelApplications

Haris RibicSUNY Binghamton

Binghamton NY 13902, [email protected]

Yu David LiuSUNY Binghamton

Binghamton NY 13902, [email protected]

ABSTRACTA growing number of energy optimization solutions operateat the application runtime level. Despite delivering promis-ing results, these application-scoped optimizations are fun-damentally greedy : They assume to have an exclusive ac-cess to power management and often perform poorly whenmultiple power-managing applications co-exist, or differentthreads of the same application share power managementhardware. In this paper, we introduce Aequitas, a firststep to address this critical yet largely overlooked problem.The insight behind Aequitas is that co-existing applicationsmay view power-managing hardware as a shared resourceand coordinate power management decisions. As a concreteinstance of this philosophy, we evaluated our ideas on top ofa state-of-the-art energy-efficient work-stealing runtime. Ex-periments show that without Aequitas, multiple co-existingpower-managing application runtimes suffer up to 32% per-formance loss and negate all power savings. With Aequitas,the beneficial energy-performance tradeoff reported in thesingle-application setting (12.9% energy savings and 2.5%performance loss) can be retained, but in a much more chal-lenging setting where multiple power-managing runtimes co-exist on parallel architectures and multiple CPU cores sharethe same power domain.

CCS Concepts•Software and its engineering → Power management;•Computing methodologies → Parallel programming lan-guages;

KeywordsEnergy Management; Parallelism; Work Stealing; DVFS

1. INTRODUCTIONAs cloud servers, scientific computing clusters, and even

mobile multi-cores increasingly dominate today’s computing

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ICS ’16, May 29-June 02, 2016, Istanbul, Turkey

c© 2016 ACM. ISBN 978-1-4503-4361-9/16/05. . . $15.00

DOI: http://dx.doi.org/10.1145/2925426.2926260

Application A

1

Application B

Application A

1 2

Legend:

Application

Application Thread

Power Domain

CPU Core

(b) CO-MANAGED POWER DOMAINS (a) CO-EXISTING APPLICATIONS

XA XA YA

XB

…

…

Figure 1: Power Management Scenarios.

world, energy efficiency of parallel systems is becoming criti-cal to the future of computing. Energy efficiency has becomea first-class design goal with direct consequences on oper-ational cost, reliability, usability, maintainability, and en-vironmental sustainability. Two philosophies largely guideexisting solutions: those that view the applications as a blackbox and those that view them as a white box. Black-box so-lutions make power/energy management decisions based onhardware performance counters and operating system statesduring application execution (e.g., [19]). In contrast, white-box solutions make optimization decisions based on the in-spection of the application runtime such as control-flow anddata-flow dependencies between application threads [20, 44,16, 41, 3, 2, 36, 8, 21, 48, 47, 42, 45]. At a time whenpure black-box solutions become more established and theoptions for their further optimizations dwindle, white-boxsolutions are gaining attention, especially for their benefitsof using application-specific knowledge for optimization, andtheir ability for fine-grained energy management.

Promising as they are, wide adoption of white-box op-timizations face a fundamental hurdle as they are greedyin nature when viewed from the whole-system perspective.These optimizations are often designed as if the applicationof concern was the only one executing on the underlying sys-tem and could make exclusive power management decisions.However, this is often too idealistic for real-world deploy-ments. Consider a white-box solution that uses applicationruntime knowledge to guide Dynamic Voltage and FrequencyScaling (DVFS) [15, 7] – a strategy commonly adopted byexisting energy-aware programming languages [36, 3, 8] andcompilers [20, 44, 16] – and is deployed in the two scenariospresented in Figure 1:

• Co-Existing Applications: Multiple applicationsoften share the same underlying system. Assume ap-plications A and B co-exist as in Figure 1(a). Considera white-box approach guiding A and its thread XA toset the frequency to 2GHz whereas it is guiding B andits thread XB to set the frequency to 1GHz. The twoare deployed on the same CPU core. Which frequencyshould be set?

• Co-Managed Power Domains: CPU cores oftenshare the same power management hardware as in Fig-ure 1(b). Consider a white-box approach guiding A andits thread XA to set the frequency to 1GHz and YA to setthe frequency to 2GHz. The two are deployed on twocores within the same power domain. Which frequencyshould be set?

This paper presents Aequitas, an effective strategy to co-ordinate energy management across multiple multi-threadedapplications over parallel architectures. The key insight isthat the challenges faced by Co-Existing Applicationsand Co-Managed Power Domains are a symptom of con-tention over shared power-management hardware – either bydifferent applications or by different threads within the sameapplication. The core of Aequitas is a coordinated round-robin algorithm that allows different applications and differ-ent threads within each application to access the underlyinghardware power management within their share of time.

To evaluate our system, we selected Hermes [34] asa baseline. Hermes is an energy-efficient multi-threadedwork-stealing language runtime with a sophisticated algo-rithm to make semantics-aware DVFS decisions. Originally,Hermes reported 12.9% energy savings and 2.5% perfor-mance loss in a single-application optimization setting whereeach thread is deployed to a distinct power domain. How-ever, in the presence of Co-Existing Applications andCo-Managed Power Domains, all reported energy man-agement benefits of Hermes are erased with a significantperformance loss of 32% in some cases (see Section 2). Incontrast, Aequitas is capable of retaining nearly all of thebenefits reported by Hermes, but in the presence of muchmore challenging scenarios where multiple applications co-exist and their threads are deployed on cores that sharepower domains i.e., scenarios presented in Figure 1.

During our design space exploration, we also consideredalternative contention resolution policies for shared powermanagement such as a first-come-first-serve style or a pol-icy to average the CPU frequencies requested by contendingparties. Comparatively, Aequitas achieves the best energy-performance tradeoff. Our results are stable across a varietyof (homogeneous and heterogeneous) combinations of appli-cations and different configurations of the underlying sys-tems. Furthermore, Aequitas design is independent of thespecifics of the Hermes algorithm.

To the best of our knowledge, Aequitas is the first systemof its kind to gracefully extend application-runtime-level en-ergy optimization to a more holistic setting. Concretely, ourwork makes the following contributions:

• A practical system to coordinate multiple applica-tions guided by greedy energy optimizations on sharedpower domains, with a key insight that transforms anenergy management problem into a contention man-agement problem.

• A novel design with time-sliced power management,where each application and each thread within an ap-plication receives a share of time to regulate power.The time-slicing strategy for power management is fur-ther decoupled from that of OS-level scheduling.

• A thorough design space exploration covering alterna-tive contention resolution polices beyond time slicing,and an “under the hood” study to analyze the reasonsbehind Aequitas’ effectiveness.

2 1

1 0 3 2

Application 4 3

5 4 7 6

Application Power Domain

CPU Core

Thread

Association

Figure 2: An illustration of worker-to-module mapping inHermes.

2. BACKGROUNDThis section provides an overview of the baseline Her-

mes system, together with a review of our targeted parallelarchitectures.

Hermes Overview.Hermes [34] is an energy-efficient work-stealing runtime

built on top of Intel Cilk Plus, which traces its roots to MITCilk [5, 6, 14]. The design of Aequitas is independent ofboth the algorithm details of classic work stealing and thoseof Hermes, but a review may provide a concrete pictureof the challenging context to which Aequitas is aimed toaddress.

Work stealing is a classic fine-grained task managementstrategy over parallel architectures, known for its strengthin load balancing and scalability. Under work-stealing run-times, a program consists of tasks and a pre-assigned num-ber of threads (called workers) competing to process thesetasks. In practice, the number of workers is often equal tothe number of available CPU cores. Each worker maintainsa queue-like data structure called a double-ended queue ordeque containing tasks to be processed. When a worker fin-ishes processing a task, it picks up a next one from its dequeand continues the execution of that task. When its dequeis empty, a worker steals a task from the deque of anotherworker. In this case, the stealing worker is called a thiefwhereas the worker whose task is stolen is called a victim.The selection of victims follows the principles observed byload balancing and may vary in different implementations ofwork stealing.

Hermes is a white-box runtime-level energy optimiza-tion and takes advantage of the inherent features withinthe work-stealing runtimes to guide DVFS. A worker (thief)who steals a task from another worker (victim) may form adata dependency relationship. Slowing down the thief maynot significantly affect the overall performance of the ap-plication, but may reduce energy consumption. Therefore,

0%

5%

10%

15%

20%

25%

30%

35%

KNN 2 KNN 4 KNN Ray 2 Ray 4 Ray Hull 2 Hull 4 Hull

Slo

wd

ow

n w

.r.t

. In

tel C

ilk P

lus

Figure 3: Hermes Stress Test.

threads in a work-stealing environment – thieves and vic-tims – have varying impacts on the overall program runningtime.

Hermes coordinates DVFS frequencies associated witheach thread, so that in the scenario above, a workpath-sensitive algorithm will determine the state of each worker ina thief-victim relationship and set DVFS frequencies accord-ingly. A second, workload-sensitive, algorithm samples thedeque size and selects appropriate DVFS frequency basedon it. Hermes unifies of the two algorithms into one anda combination of the two provides an energy optimizationthat is aware of the states and semantics of the work-stealingprogram runtime.

In the original Hermes, each worker has an exclusive ac-cess to power management, and as shown in Figure 2, oneworker is mapped to each power domain module. Conse-quently, Hermes uses only 50% of available CPU cores. Thereported data showed an average of 11-12% energy savingswith an average of 3-4% performance loss. The latest ver-sion of their system reports a 12-13% energy savings withan average of 2-3% performance loss through some optimiza-tions. This optimized version of Hermes is the baseline ofour experiments.

We performed Hermes “stress test” by extending it tosupport Co-Existing Applications and Co-ManagedPower Domains as presented in Figure 3. Labels on the X-Axis describe the representative benchmarks (see Section 4)and the number of benchmarks executed in parallel. For ex-ample, label KNN indicates that an instance of benchmarkKNN is executed, whereas 4 KNN indicates that four KNNinstances are executed in parallel. Labels on the Y-Axis arethe normalized time of Hermes-based execution w.r.t. IntelCilk Plus, deployed on a system with an 8-core CPU whereevery two cores share the same power domain (see “SystemB” in Section 4). Each application instance consists of 8threads, each mapped to a CPU core.

As shown in Figure 3, a single application with Co-Managed Power Domains yields 5-8% slowdown w.r.t.Intel Cilk Plus. When Co-Existing Applications are al-lowed, performance further degrades to a 32% slowdown forthe case where 4 instances of KNN co-execute. Performancedegradation at this level is unacceptable for most applica-tions.

Nonetheless, we believe Hermes is an ideal client for Ae-quitas for a number of reasons. First, work-stealing is apromising approach among runtime systems for construct-ing parallel applications. It attracted significant interest inrecent years including a growing number of commercial im-

2A 3A 4A 1A

1 0 3 2

Application A

2B 3B 4B 1B

Application B

2A 3A 4A 1A

1 0 3 2

Application A

2B 3B 4B 1B

Application B

2A 3A 4A 1A

1 0 3 2

Application A

2B 3B 4B 1B

Application B

2A 3A 4A 1A

1 0 3 2

Application A

2B 3B 4B 1B

Application B

(A) (B)

(C) (D)

Application Power Domain

CPU Core

Thread

Association

Dominator

Figure 4: An Illustration of Aequitas with Round-RobinTime-Slicing Algorithm for Power Management.

plementations [18, 17, 26, 25, 9, 30, 24, 29]. Second, thekey feature of work stealing – its natural support for dy-namic load balancing – happens to be a classic energy man-agement strategy as well [32]. In that sense, work-stealingsystems are “pro-green” from onset. Third, Hermes im-proves energy efficiency of work-stealing runtimes throughfine-grained DVFS-based energy management. Retainingthe energy efficiency of Hermes sets a “high bar” for evalu-ation of Aequitas.

Parallel Architectures with Co-Managed PowerDomains.

Modern CPUs generally support basic power managementoperations, such as DVFS and power state adjustment. TheCPU cores that can only be adjusted at the same time be-long to the same power domain (or frequency domain). Ide-ally, every CPU core would be equipped with its own powermanagement module, but in the real world, most commer-cial CPUs do not have a per-core-per-power-domain designdue to costs and other design constraints.

The hardware reality often poses challenges in designingfine-grained algorithms to regulate power management. Forinstance, Hermes could only take advantage of half of thecores available in the AMD Piledriver/Bulldozer architec-tures, because every two cores share the same power domain.With Aequitas, the variations in the topology of power do-mains will no longer hamper fine-grained white-box energymanagement.

3. AEQUITAS DESIGNIn this section, we describe the core algorithm adopted by

Aequitas, together with alternative design ideas.

High-Level Design.We transformed the dual challenges described in Section 1

into a contention management problem – that is – how canwe manage and resolve access contentions to power man-agement, provide equal opportunity access to power man-agement, and avoid excessive DVFS that leads to perfor-mance degradation? Aequitas answers these questions witha number of interconnected ideas:

1. Power management modules are accessed through co-ordinated time slicing, where multiple applicationstake turns dominating the power management to ad-dress the Co-Existing Applications problem.

2. Threads deployed within an application on the samepower domain also employ coordinated time slicingand take turns dominating the power management toaddress the Co-Managed Power Domains problem.

3. Domination only implies exclusive access to all powermanagement “knobs” and not the CPU itself. Onlydominating threads can adjust CPU frequencies in aDVFS-based power management strategy but everythread – including the non-dominating ones – still ex-ecute based on the OS decision and allocation.

4. The size of the dominating time slice is independent ofthe time-sharing OS scheduler. Aequitas fundamen-tally decouples the overhead of power management –such as CPU frequency switch cost – from that of OS-level context switch.

The fourth Aequitas design idea is set in contrast witha naive approach that suggests piggybacking the contextswitch of power management – such as preserving and re-covering CPU frequencies for each application – over thatof the OS. The unfortunate consequence of such approach isthat every time a thread switches in, an action to the powermanagement module must be performed. Power manage-ment, such as DVFS, is known to have a switching cost inthe tens and hundreds of microseconds. The naive approachwould not only incur excessive overheads, but also hamperthe independence between OS scheduler design and energymanagement design. As a result, Aequitas enjoys coordi-nated and equalitarian power management, where each ap-plication and each thread within an application receives ashare of time to regulate power.

Equalitarian Coordination with Two-TieredDomination.

Aequitas is endowed with a “resource” contention medi-ator to coordinate DVFS-based power management amongmultiple competing energy-efficient applications and threadswithin those applications. Aequitas resolves power man-agement contention using a round-robin time-slicing ap-proach. Unlike OS schedulers, the goal of time slicing inAequitas is not to schedule CPU use but resolve resourceof contention and regulate power management. We call thisflavor of contention resolution the domination of power man-agement or domination for short. Domination is supportedat two levels:

• Co-existing applications take time-sliced turns to man-age power. When a time slice is held by an application,we also informally call the application the app domi-nator, and the time slice it holds the app dominationslice.

• Threads (of the same application) deployed on thesame power domain take turns to manage power ofthe co-managed power domain. When a time slice isheld by a thread, we informally call the thread the pddominator, and the time slice it holds the pd domina-tion slice.

Two design aspects are critical to understanding Ae-quitas and its advantages:

• Double Domination of Power Management : Powermanagement“knobs”can only be accessed – i.e., DVFScan only be issued – from a dominating thread, and ifthe host application of that thread is the app domina-tor.

• No Domination 6= No Progress: A non-dominatingmulti-threaded application can still execute but can-not make power adjustments.

In Aequitas, each dominating time slice is assigned anequal portion and in a round-robin order; however, the appdomination slices and the pd domination slices are of differ-ent magnitudes. Figure 4 is a visual illustration of Aequitaswith 2 applications, each with 4 threads/workers deployedon 4 CPU cores pairwise sharing 2 power domains. Darkenedcolor in the application rectangles illustrates application-level domination. While applications execute, Aequitasswitches between workers within an application based onthe pd domination slice. Figure 4(A) shows half of the work-ers (rectangles with darkened color) within Application A

dominate. The other half of workers still execute, but theirDVFS requests are no-ops. In Figure 4(B), the dominatingworker group is switched. In Figure 4(C) and (D), Applica-tion B becomes the app dominator and a similar switchingprocess with its workers ensues.

In Aequitas, the pd domination slice is smaller than theapp domination slice. If the opposite were true, half of anapplication’s workers may dominate during the entire appdomination slice. In the worst case, whenever an applicationbecomes dominant again, the same half of the workers woulddominate again, leading to inequality on the worker level.

Two advantages exist with the time-sliced round-robindesign. First, it promotes equalitarian power manage-ment. Through app domination, each application receivesan “equal share” of time where it can set frequency. Withinan application, through pd domination, each thread de-ployed to the same power domain also receives an “equalshare” of time to set frequency. It should be straightfor-ward to extend this design to support a “proportionally fair”system where the time slice sizes are proportional to appli-cation/thread priorities or system resource needs. Second,the round-robin design significantly reduces the number ofcontending DVFS actions, and their ensuing overhead.

Phase-based Domination Abdication.Applications are long known to display phased behav-

iors [38, 39, 40, 37]. This is especially true for parallelapplications, where parallel phases and serial phases oftenalternate. One design consideration of Aequitas is how theround-robin domination should behave in the presence ofprolonged inactivity of a dominating worker. If the dom-inating worker in a dominating application happens to beinactive, it would be a waste of opportunity for it to stay inthe dominator role and prevent other workers from perform-ing DVFS. Motivated by this insight, our algorithm is phase-aware: if the dominating worker is in an inactive phase, itabdicates its domination, and Aequitas passes on the dom-ination to the module co-worker of the same app.

Algorithm Details.

Algorithm 0.1 Main Scheduler1: w : Worker2: procedure scheduler(w)3: SET AFFINITY(w.cpu)4: while !work done do5: SCHEDULE(w)6: end while7: END AFFINITY(w.cpu)8: if w.self == 0 then9: w.G.liveApp[w.G.id] ← 010: for each w do11: w.G.coreGlobal[w.cpu].remove(w.self)12: w.G.coreGlobal[w.cpu] = null13: end for14: end if15: end procedure

Algorithm 0.2 Schedule Work1: w : Worker2: e : Register worker as core user3: procedure schedule work(w)4: loop5: t ← pop(w)6: if t == null then7: v = select()8: t ← steal(v)9: if t == null then10: yield(w)11: w.G.coreGlobal[w.cpu].remove(w.self)12: return13: else14: e← w.G.coreGlobal[w.cpu].register(w.self)15: if e == null then16: w.G.coreGlobal[w.cpu].add(w.self)17: end if18: end if19: end if20: work(w, t)21: end loop22: end procedure

Algorithm 0.3 Scale Up1: w: Worker2: procedure up(w)3: if w.G.id == w.G.appDmntr then4: if w.scale == w.G.pdDmntr then5: SCALE UP(w.cpu, w.freq 6= cpu.freq)6: end if7: end if8: end procedure

Algorithm 0.4 Scale Down1: w: Worker2: procedure down(w)3: if w.G.id == w.G.appDmntr then4: if w.scale == w.G.pdDmntr then5: SCALE DOWN(w.cpu, w.freq 6= cpu.freq)6: end if7: end if8: end procedure

Worker Structurestructure Worker W

self // application worker idscaler // scale ordercpu // cpu affinitychildpid // app schedulerG // global state

end structure

Algorithm 0.6 Make Worker1: w: Worker2: procedure make worker(w, g)3: INIT APP(w)4: if w.G.id == 0 & w.self == 0 then5: w.childpid = fork()6: procedure fork7: SCHEDULE APPS()8: end procedure9: end if10: end procedure

Algorithm 0.7 Initialization1: w: Worker2: procedure init app(w)3: BIND(w.G.appDmntr)4: BIND(w.G.pdDmntr)5: BIND(w.G.liveApp)6: BIND(w.G.coreGlobal)7: w.cpu ← cpu[w.self]8: w.G.coreGlobal[w.cpu].add(w.self)9: w.scaler ← 010: if w.self%2 == 0 then11: w.scaler ← 112: end if13: if w.self == 0 then14: w.G.liveApp[w.G.id] ← 115: end if16: end procedure

Algorithm 0.8 Schedule Apps1: w: Worker2: procedure schedule apps3: while G.liveApp.size 6= 0 do4: t += 0.255: for each cpu do6: m = get co worker module[cpu]7: if G.coreGlobal[cpu] == null then8: if G.coreGlobal[m] == null then9: MIN(cpu, m)10: end if11: end if12: end for13: if G.pdDmntr == 0 then14: G.pdDmntr = 115: else16: G.pdDmntr = 017: end if18: while !G.liveApp[k] & t == 1 do19: k ← (k+1)%MAX20: t ← 021: end while22: G.appDmntr ← k23: k ← (k+1)%MAX24: if G.liveApp.size == 0 then25: exit(0)26: else27: sleep(25ms)28: end if29: end while30: end procedure

Global Structurestructure Global G

id // app idpdDmntr // dominating workerappDmntr // dominating appliveApp // app registrycoreGlobal // core use

end structure

Figure 5: Aequitas Algorithm Specification.

Figure 5 presents Aequitas algorithm and we now infor-mally summarize several design details. During bootstrap-ping sequence of the first runtime, its first worker createsglobally shared memory areas called liveApp, appDmntr,pdDmntr, and coreGlobal. liveApp holds identifications ofall live applications, appDmntr holds unique identification ofthe dominating application, pdDmntr holds identification ofthe worker within the dominating application that currentlydominates power management. coreGlobal keeps track ofall threads that use a core.

As additional application runtimes bootstrap, they reg-ister themselves in liveApp indicating that they are alive.Their workers attach themselves to shared memory appDm-

ntr, pdDmntr and coreGlobal. During the same bootstrap-ping process, the first worker also forks a child procedurewhich will perform round-robin maintenance of domination,i.e., periodically updating appDmntr and pdDmntr data be-tween sleeps and according to the specified app/pd time slicesizes. The child procedure will also look thru coreGlobal

to find unused power domains and scale them to minimumtempo. The runtime has been modified so that a DVFS re-quest is only honored when appDmntr and pdDmntr confirm

that the current worker owns both the application domina-tion and worker domination slices. Lastly, when an appli-cation finishes its workload, its runtime will deregister itselffrom liveApp so that it will not be scheduled for domination.Once all applications finish, the child process will exit.

To support domination abdication, we rely on a simpleinternal status variable in Intel Cilk Plus – one that trackssteal counts – to identify whether a worker is in an inactivephase. Furthermore, when workers enter an inactive phases,they also deregister themselves from coreGlobal indicatingthey are not using the core for work. If a power domainmodule is not in use, it is scaled down to a minimum tempountil some worker becomes active and uses it again.

Alternative Contention Resolution.When two threads from Co-Existing Applications or

Co-Managed Power Domains desire different CPU fre-quencies, alternative contention resolution possibilities existto resolve the conflict. During our design exploration thatled to round-robin time-slicing approach adopted by Ae-quitas, we also considered alternatives. For example, let usconsider that there are two contending workers. Other pos-sible contention resolution policies are (for all policies, fre-quency is only changed if it differs from current frequency):

• First Comer Dominates (FCD): CPU frequency is setby the worker who is first scheduled to execute a taskin the shared power domain, and the worker scheduledlater does not perform DVFS until first worker finishes;

• Last Comer Dominates (LCD): opposite of FCD;

• Higher Frequency Dominates (HFD): CPU frequencyis set to the higher level desired by the two workersscheduled to the power domain;

• Lower Frequency Dominates (LFD): opposite of HFD;

• Average Frequency Dominates (AFD): CPU frequencyis set to average of all requests.

We have implemented these alternative policies, whosebenchmarking results will be discussed in Section 5.5. Noneis as effective as the round-robin time-slicing approachadopted by Aequitas. More importantly, these policies areapplication-blind, which sacrifices equalitarian coordinationin energy management. Nonetheless, we believe these re-sults complement Aequitas as a part of the complete designspace exploration.

4. IMPLEMENTATION AND EXPERI-MENTAL SETUP

Aequitas was implemented on top of Intel Cilk Plus(build 2856). In this section, we present our experimentalsetup(s).

We selected benchmarks from the Problem-Based Bench-mark Suite (PBBS) [4]. PBBS is widely used for bench-marking parallel applications and contains a subset of bench-marks that support work stealing. Figure 6 presents a 100mssnapshot overlay of worker 1 from each benchmark whichshows that, comparatively, individual benchmarks want tomake a different energy management decisions. The detailsof our selected benchmarks are defined in Table 1.

1.4

1.8

2.2

2.6

3.0

3.4

3.8

1

5

9

13

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

GH

z

Snapshot of Power Management Decisions on System B

Sort w=1 Hull w=1 Compare w=1 Ray w=1 KNN w=1

Figure 6: Power Management Decisions of Benchmarks Usedin Experiments.

To validate the effectiveness of our approach, we con-structed our experiments on two platforms as shown in Ta-ble 2. In both systems the default Completely Fair Sched-uler (CFS) in used in Linux. CFS is known to have no fixedtime slices for scheduling, even though the ballpark is in themagnitude of tens of milliseconds. For both Piledriver andBulldozer architecture, every two cores share a frequencydomain.

Energy consumption is measured through current metersover power supply lines to the CPU module, including thecores and the uncores (such as on-chip caches and intercon-nects). For CPU-intensive benchmarks – to which work-stealing applications/benchmarks belong to – these hard-ware components are the ones whose energy consumption ismost likely to vary. Data is converted through an NI DAQand collected by NI LabVIEW SignalExpress [10] with 100samples per second. The supply voltage is stable at 12V,and we compute energy consumption as the sum of currentsamples multiplied by 12× 0.01.

We executed each benchmark using Aequitas runtime.Since Aequitas manages DVFS at the application runtimelevel, the Linux governor is set to userspace. All energyand performance results reported in Section 5 are normal-ized to a control, defined as executing the same benchmark(in the same configuration) with the unmodified Intel CilkPlus (build 2856) runtime executing against default Linuxondemand governor. For each benchmark, we ran 20 trials(disregarding the first 2 trials) and calculated the average ofthe trials.

Hermes Comparison.Our experimental evaluation also involved some quanti-

tative comparisons against Hermes. Hermes runtime doesnot define behaviors when the number of workers exceedsthe number of power domains. Aequitas is capable of han-dling Co-Managed Power Domains and our experimentsuse all CPU cores, even if some share the same power do-main. To construct a fair comparison, we updated Hermessource code to support the same number of workers as CPUcores e.g., increasing the number of workers from 4 to 8 ona 8-core machine where every two cores share a power do-main. Initially, we did not explicitly bind the additional 4workers to any core. The experimental results were poor:the co-execution of two instances of Ray incurred a 167%

K-Nearest Neighbors (KNN)Uses pattern recognition methods to classify objectsbased on closest training examples in the feature space.Ray-Triangle Intersection (Ray)For each ray, it calculates the first triangle it intersectsgiven a set of triangles contained inside a 3D boundingbox and a set of rays that penetrate the box.Integer Sort (Sort)Sorting fixed-length integer keys into ascending orderwith the ability to carry along fixed-length auxiliary data.Comparison Sort (Compare)Similar to Sort but uses sample sort.Convex Hull (Hull)Computational geometry benchmark.

Table 1: Benchmark Descriptions.

System A• 2×16-core AMD Opteron 6378 (Piledriver)• Debian 7.0 Linux (Kernel 3.2.49)• 64GB DDR3 1600 RAM• Supports 5 distinct frequencies:

1.4GHz, 1.6GHz, 1.9GHz, 2.2GHz, 2.4GHzSystem B• 8-core AMD FX-8150 (Bulldozer)• Debian 7.0 Linux (Kernel 3.2.49)• 16GB DDR3 1600 RAM• Supports 5 distinct frequencies:

1.4GHz, 2.1GHz, 2.7GHz, 3.3GHz, 3.6GHz

Table 2: Systems Configurations.

slow down w.r.t. Intel Cilk Plus. We speculate this is dueto a loss of cache locality, combined with unpredictable con-tentions of DVFS. We decided that a more fair comparisonwould be to bind workers to cores in the identical fashion asAequitas. This is the setting we used for all experimentsreported in the paper.

5. EVALUATIONThis section presents detailed energy and time results of

Aequitas on Systems A and B w.r.t. Intel Cilk Plus. Togain confidence in the applicability of Aequitas, we con-ducted design space explorations under different application,algorithm, and platform settings. In Section 5.5, we also re-port results for alternative contention resolutions policies.

5.1 Overall ResultsFigure 7 shows the effectiveness of Aequitas w.r.t. Intel

Cilk Plus in the presence of Co-Existing Applicationsand Co-Managed Power Domains. In Figure 7, blackbars are normalized energy savings and gray bars are nor-malized time losses, both w.r.t. Intel Cilk Plus. A label suchas 4Ray means that four Ray instances are executed in paral-lel. Each of them has the same number of threads/workersas the number of CPU cores. We show the results when 2,3, and 4 applications (from the same benchmark) execute inparallel. Recall that both System A and B are supported byarchitectures where every pair of CPU cores share a powerdomain.

Overall on System A, Aequitas achieves an average en-ergy savings of 14% with performance loss averaging 2.5%.On System B, Aequitas achieves an average energy savingsof 12-13% also with performance loss averaging 2.5%.

-5%

0%

5%

10%

15%

Normalized Energy/Time on System A

Energy Time

-5%

0%

5%

10%

15%

Normalized Energy/Time on System B

Energy Time

Figure 7: Aequitas with Multiple Homogeneous Applications and Co-Managed Power Domains.

-5%

0%

5%

10%

15% Normalized Energy/Time on System A for Multiple Heterogeneous Applications

Energy Time

-5%

0%

5%

10%

15% Normalized Energy/Time on System B for Multiple Heterogeneous Applications

Energy Time

Figure 8: Aequitas with Multiple Heterogeneous Applications and Co-Managed Power Domains.

-10%

-5%

0%

5%

10%

15%

4KNN 4KNN 4KNN 4KNN 4KNN 4KNN 4KNN 4KNN 4Ray 4Ray 4Ray 4Ray 4Ray 4Ray 4Ray 4Ray 4Hull 4Hull 4Hull 4Hull 4Hull 4Hull 4Hull 4Hull

0.5_125 0.5_25 1_25 1_50 2_25 2_50 4_25 4_50 0.5_125 0.5_25 1_25 1_50 2_25 2_50 4_25 4_50 0.5_125 0.5_25 1_25 1_50 2_25 2_50 4_25 4_50

Power Management Time Slice Choices on System A

Energy Time

Figure 9: Power Management Time Slice Choices.

In single benchmark testing, the results show that Ae-quitas outperforms Hermes w.r.t. Intel Cilk Plus on

System A and slightly outperforms Hermes on SystemB – recall that the optimized Hermes demonstrated 12-

13% energy savings with an average of 2-3% performanceloss. Aequitas operates in a much more adversarial settingwhere multiple applications co-exist and multiple threads co-manage the same power domain. Improved energy efficiencyis due to Aequitas’ ability to scale down unused CPU’s toa minimum frequency, avoid unnecessary DVFS calls, andits ability to use all available CPU cores. With the supportof Co-Managed Power Domains, we can now deploy twoworkers to the cores that belong to the same power domain.By hardware design, the two cores also share more levels ofcaches, leading to more favorable cache locality and a lesserperformance loss of Aequitas compared to Hermes w.r.t.Intel Cilk Plus.

Furthermore, unlike Hermes, where threads/workers canalways execute power management decisions as they wish, athread under Aequitas cannot manage power if it is notdominating. Experimental results show that the idea oftime-sliced power management can be nearly as effective asexclusive power management.

Lastly, observe that Aequitas is stable across platforms.The patterns in relative savings/loss on System A and B arenearly identical, and their relative standings are specific tothe benchmarks themselves. The stability across System Aand B is manifested in nearly all of our experiments.

5.2 Homogeneous vs. Heterogeneous Appli-cations

We also experimented in a setting where multiple appli-cations are different benchmarks executing in parallel. Fig-ure 8 presents the combinations of all benchmarks with leg-ends/configurations identical to Figure 7. Energy data arereported based on the data collected between the commonstarting time of all applications and the completion time ofthe last application. On System A, Aequitas achieves anaverage energy savings of 13% with a slight performance lossaveraging 2.8%. On System B, Aequitas achieves an aver-age energy savings of 12% also with a slight performance lossaveraging 2.5%. Unlike homogeneous settings where multi-ple applications complete roughly at the same time, differentapplications in the heterogeneous setting may complete atvery different times. The most important observation is thatthe benefits of Aequitas in the homogeneous setting are re-tained for the heterogeneous combinations of applications.

5.3 Workload LevelWe also studied the impact of various workload levels on

Aequitas w.r.t Intel Cilk Plus with the results shown in Fig-ure 10 (including the case where the number of applicationis 2x that of physical cores). There is a small but notice-able decline in the benefits produced by Aequitas with anincrease in the number of applications executing in paralleland the degree of decline is benchmark-specific.

The decline does not come as a surprise. Consider 2KNN

and 16KNN for instance. It is 8x more likely that an applica-tion in the latter execution – than the one in the former –wishes to adjust the CPU frequency to a particular level, butcannot because it is not a dominator. Figure 10 shows thatdespite the exponential decline in the odds of executing the“best wish”, the energy/performance benefits do not expo-nentially precipitate. From that angle, Figure 10 is reason-ably good news. It also says 16KNN and 16Ray may remaina viable configurations for some energy-conscious users.

The more relevant question is whether executing 16 CPU-intensive applications concurrently represents a typical usescenario in the real world. To the best of our knowledge,such cases are most common where a portion or all such ap-plications are “background” in nature, or are I/O-bound, orinteractive (e.g., [22]). Work-stealing runtimes are naturallyCPU-intensive – it is long known that frequent I/O or syn-chronization would greatly hamper the stealing process – wespeculate a scenario such as 16KNN would be a rare event.

-8%

-4%

0%

4%

8%

12%

16%

Normalized Energy/Time on System B: Increased Workload

Energy Time

Figure 10: Energy/Performance in the Presence of VaryingWorkload Levels.

5.4 Time Slice ChoicesFor the results reported so far, the app/pd domination

slices are fixed at 1second/0.25seconds respectively. We alsostudied the impact of varying time-slice sizes. Figure 9shows the results of concurrently executing 4 of each KNN,Ray, and Hull benchmarks respectively, at various time slicechoices w.r.t. Intel Cilk Plus. The two numbers at the bot-tom of each bar are the time slice choices. For instance,0.5_125 means the app domination slice is 0.5 seconds, andthe pd domination slice is 0.125 seconds. The rest of thelegends/configurations are identical to Figure 7.

A common trend is that the results for performance lossexhibits an “inverse U” shape: both very small and verylarge time slices incur more performance loss than slices inthe middle. For all 3 benchmarks, performance quickly de-teriorates when the app domination slice is reduced to 0.5seconds. Here, the overhead of DVFS becomes a factor. Onthe other hand, when time slices become too coarse, theeffectiveness of Aequitas – its ability to allow each appli-cation to make the decisions it wishes – is discounted. Asa result, as we move from the center to the right for eachfigure, performance also deteriorates for each benchmark.

Readers might wonder why a time slice such as 0.25 sec-onds for power domain domination does not incur excessiveDVFS overhead (such as the naive solution may). The keyobservation here is that DVFS only occurs in the presenceof Double Dominations (see Section 3) and frequency is onlyset if it differs from current CPU frequency i.e. even if re-quested, there is no need to honor DVFS if it is requestingthe same frequency at which a core operates.

5.5 Alternative Contention ResolutionsFigure 11 shows normalized time loss of the 5 alternative

contention resolution policies w.r.t. Intel Cilk as describedin Section 3. We can quickly eliminate LCD, LFD and AVG

HFD

_KN

N

HFD

_Ray

HFD

_Hu

ll

LFD_K

NN

LFD_R

ay

LFD_H

ull

FCD

_KN

N

FCD

_Ray

FCD

_Hu

ll

LCD

_KN

N

LCD

_Ray

LCD

_Hu

ll

AV

G_K

NN

AV

G_R

ay

AV

G_H

ull

RR

_KN

N

RR

_Ray

RR

_Hu

ll

Alternative Contention Resolution Policies on System B

42%

30%

10%

0%

5%

Figure 11: Performance Loss of Alternative Contention Res-olution Policies.

-10%

-5%

0%

5%

10%

15%

RR

TS_KN

N

RR

TS_Ray

RR

TS_Hu

ll

FCD

_KN

N

FCD

_Ray

FCD

_Hu

ll

HFD

_KN

N

HFD

_Ray

HFD

_Hu

ll

RR

TS_4K

NN

RR

TS_4R

ay

RR

TS_4H

ull

FCD

_4K

NN

FCD

_4R

ay

FCD

_4H

ull

Normalized Energy/Time of Alternative Contention Resolution Policies on System B

Energy Time

Figure 12: Energy Savings of Alternative Contention Reso-lution Policies and Energy/Time for FCD vs. Round-RobinTime-Slicing with 4 co-existing applications.

from viable solutions. Even for a single application (with co-managed power domains), the three solutions already yielda performance loss from 7% to 37%. The poor performanceof LCD and AVG results from frequent CPU scaling. Thesestrategies blindly switch CPU frequencies for almost everywork-stealing task scheduled. It comes, as no surprise LFDis sub-optimal as it always favors executing at the lowestfrequency.

On the other hand, FCD and HFD remain competitive.Indeed, FCD shows an average of 2.5% performance loss,and for HFD only 0.6%, both w.r.t. Intel Cilk. Their en-ergy savings however are less competitive when compared tothe round-robin time-slicing strategy adopted by Aequitas.Even though HFD achieves the best performance w.r.t. IntelCilk, it also saves the least energy.

Among the alternative policies studied, FCD is the mostcompetitive when compared to round-robin time-slicingstrategy. Figure 12, presents the results of a scenario where 4work-stealing applications co-exist. In the presence of mul-tiple applications, FCD does not scale as well. The rootproblem with FCD is its application blindness. Philosophi-cally, FCD is blind to the intra-application coordination – ahallmark of white-box approaches such as Aequitas – andmay inadvertently make counter-intuitive decisions. The ex-perimental result shows the cost of application blindness. Itexperimentally explains the benefits of Aequitas’s two-tierdomination design.

5.6 Operating System SchedulerAs mentioned before, OS’s on both system A and B use

the CFS. Our Linux installations have also the O(1) sched-uler available. Rerun of our experiments with O(1) schedulershowed a margin of ±0.5% on average across benchmarks forpower and energy w.r.t. Intel Cilk Plus.

6. UNDERSTANDING AEQUITAS BE-HAVIORS

In this section, we conduct an “under-the-hood” compar-ative analysis of Aequitas and Hermes. It is our beliefthat the reasons why time-sliced power management pro-duces good results are just as important as – if not moreimportant than – the results themselves.

6.1 Frequency Scaling TracesWe first present a time series for frequency scaling of a

representative benchmark execution, in this case a hetero-geneous co-execution of Ray and Compare. The entire traceis shown in Figure 13 as a “bird’s view.”

Figures 13(a)(b) present Aequitas behavior, whereas Fig-ures 13(c)(d) present Hermes behavior. In both experi-ments, each application consists of 8 workers, one worker percore, and 2 cores share a power domain. Figures 13(a)(c)show the behaviors of worker 1 of Ray and Compare re-spectively, whereas Figures 13(b)(d) shows the behaviors ofworker 2 of Ray and Compare respectively. We omit workers3-8 to save space. The unit on the X-Axis is the executiontime in 1ms intervals (1ms provides the best overview forexplanation) and unit on the Y-Axis is the frequecy in GHz.

The most striking difference when comparing Aequitasto Hermes is a significantly higher volume of frequencyscaling by Hermes. Workers 1 and 2 of Compare spentnearly all of the first 5sec completing the first phase of theirwork in Hermes, which was completed in 1sec in Aequitas.We speculate that the root cause of this behavior is thecontention raised by the two workers of Compare in Her-mes while operating on the same power domain. Althougheach DVFS incurs only tens of microseconds of overhead,it quickly adds up. This is especially true when the twoworkers keep overriding each other’s frequency – a case ofpower domain “thrashing”. In Aequitas, workers take turnsto dominate and thus reduce a significant portion of DVFSand eliminate “thrashing”.

In Hermes, workers are not only competing in the caseof Co-Managed Power Domains but may also run intocontentions with Co-Existing Applications. In the ex-ample presented, this can be observed at around the 6000thmillisecond in Figure 13 where both applications are heavilycompeting to manage power.

The respective frequency scaling behaviors of Hermesand Aequitas also explain their different execution times.As shown in Figure 13, Hermes-based co-execution takesaround 147 seconds for both to complete, whereas its Ae-quitas-based counterpart takes only 132 seconds. Energyconsumption has a linear relationship with execution time.

Readers might wonder why Hermes co-execution is not 5xlonger than its Aequitas counterpart, considering that thefirst phase of Compare took nearly 5x longer for Hermes.This results from the fact that co-execution has extendedphases of inactivity, which somewhat absorb the effects ofslowdown and the fact that we extended Hermes to only

1.4

1.8

2.2

2.6

3

3.4

3.8

0 5000 10000 15000 20000 25000

Ray w=1 Compare w=1

1.4

1.8

2.2

2.6

3

3.4

3.8

0 5000 10000 15000 20000 25000

Ray w=2 Compare w=2

1.4

1.8

2.2

2.6

3

3.4

3.8

0 5000 10000 15000 20000 25000

Ray w=1 Compare w=1

1.4

1.8

2.2

2.6

3

3.4

3.8

0 5000 10000 15000 20000 25000

Ray w=2 Compare w=2

(a)

(b)

(c)

(d)

Figure 13: An overview of frequency scaling time series for a 1 Ray and 1 Compare mixed execution on System B.

0

2K

4K

6K

8K

10K

12K

14K

16K

18K

Hu

ll

KN

N

Ray

2H

ull

2K

NN

2R

ay

4H

ull

4K

NN

4R

ay

Hu

ll_So

rt

Hu

ll_K

NN

Hu

ll_C

om

par

e

Sort

_KN

N

Sort

_Ray

Sort

_Co

mp

are

KN

N_R

ay

KN

N_C

om

par

e

Ray

_Hu

ll

Ray

_Co

mp

are

Aequitas Hermes

Figure 14: The Number of Frequency Scaling for Co-Executions on System B.

set frequency if it differs from current CPU frequency i.e.there is no need to honor DVFS if it is requesting the samefrequency at which a core operates.

6.2 Scaling CountsAs previous subsection presented, the difference in the

number of frequency scaling for Hermes and Aequitasplays an important role in understanding the effectiveness ofAequitas. This section presents a closer look at this mat-ter in the presence of different benchmarking co-executionswith the results shown in Figure 14. The more obvious ob-servation is that as the number of co-executing benchmarksincreases, the number of frequency scaling also increases,but the effects for Hermes and Aequitas are different. InAequitas, the increase is mild, as the rotating dominationcan mask off many scaling attempts.

The effect for Hermes however is striking: the 4 Ray co-executions scale 8x more than single Ray execution. In both,Hermes (and Aequitas), a conditional check exists so thatDVFS is not set if the newly requested frequency is the same

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ray Compare Ray Compare

Aequitas Aequitas Hermes Hermes

Over Under

Figure 15: Normalized Overscaling/Underscaling Precent ofTime on System B.

as the current CPU operating frequency. As more bench-marks are involved, power domain“thrashing”becomes moreprevalent in a Hermes co-execution; therefore, the likeli-hood of this check to succeed is significantly reduced. Fur-thermore, observe that heterogeneous co-executions fare bet-ter than homogeneous ones. As benchmarks demonstratephased behaviors, homogeneous co-executions are likely tolead to more contentions because the involved applicationsare more likely to exhibit similar phased behaviors.

6.3 Underscaled/Overscaled TimeWe also investigated the effect of time slicing on miti-

gating frequency scaling. Under the domination design ofAequitas, the non-dominating workers may suffer by oper-ating on a non-desirable CPU frequency. However, this hap-pens excessively to Hermes: when a worker eagerly scalesthe frequency to a level it desires in Hermes, other workersoccupying the same power domain may operate at a non-desirable CPU frequency.

Figure 15 shows the normalized percentage of time Ray

and Compare benchmarks – including all of their respective

workers – spend at a lower frequency than desired denotedas Under and at a higher frequency than desired Over.

Aequitas reduces both. Note that the data here are nor-malized, so the shortened execution time alone is not thefactor as to why Aequitas fares better. In other words, notonly does Aequitas reduce the number of frequency settingsand shortens the execution time but it also helps reduce thepercentage of time an application operates at an undesir-able frequency. Energy consumption and frequency settingare strongly correlated. Figure 15 offers further evidencebehind the energy savings retained by Aequitas.

7. RELATED WORKEnergy-aware coordination frameworks exist for improv-

ing the adaptability and quality of service (QoS) of energy-aware applications. A classic example is Odyssey [13]. An-other early dynamic software management framework [12]considered the coordination of multiple applications, in ascenario where applications are both QoS-adaptable andenergy-aware. OS efforts such as Ecosystem [46] and Cin-der [35] take a bottom-up approach (from the view of thecompute stack) to cooperative energy management. Energyand power are treated as first-class resources, and their pro-visioning is managed at the OS level. Overall, the goal ofAequitas is orthogonal with this line of research, in that ex-isting work focuses on the coordination of energy consump-tion, whereas Aequitas coordinates energy/power manage-ment. In other words, Aequitas is a top-down approachthat adapts application/runtime-level energy managementto existing OS and hardware.

Many systems have been developed to improve the ef-ficiency of work-stealing algorithms. Recent examples in-clude reducing sequential overhead [24], dynamic barrieroverhead [23], deque management [1], and architectural sup-port [31]. A more comprehensive review of earlier work wasconducted by Hermes [34]. Among them, BWS [11] is theonly work considering a time-shared multi-core environment.BWS focuses on improving system throughput and fairness,not energy efficiency.

The high-level goal of Aequitas is to help white-box en-ergy optimization approaches adapt to time-shared oper-ating systems and co-managed power domains. There isa large body of white-box approaches that the idea be-hind Aequitas may potentially impact on. First, it isknown [43, 33] that different threads/tasks may form dataor timing dependencies and hence can be scheduled intelli-gently for improved energy efficiency with DVFS (e.g., [47,42, 45, 28]). The earlier systems along this line pre-datedthe multi-threaded application era, and hence were devel-oped in a (non-time-sliced) multi-tasking OS. In principlehowever, the core algorithms should be applicable to multi-threaded applications as well, especially with the help ofan Aequitas-like system. Second, compiler-based energyoptimization [20, 44, 16, 41, 3, 27] analyzes patterns ofdata/control flows and memory access patterns to enablejudicious energy management. Third, a growing numberof energy-aware programming language designs [2, 36, 8,21, 48] harvest programmer knowledge to make application-specific energy management decisions. One common char-acteristic shared in this group of related work is energy op-timization/management is performed on a per-applicationbasis. Aequitas may serve as a concrete example to guide

these systems to adapt to the reality of underlying OS andhardware.

8. CONCLUSIONThis paper describes Aequitas, a practical system to co-

ordinate energy management of multiple power-managingapplication runtimes. Through time-sliced power man-agement with two-tiered domination, Aequitas achievesequalitarian coordination and efficiency in energy optimiza-tion among co-existing applications deployed on co-managedpower domains. The minimalistic approach yields signifi-cant energy efficiency in a DVFS-enabled cooperating multi-environment with little overhead.

Acknowledgments.We thank the anonymous reviewers for their useful com-

ments. This project is sponsored by US NSF CCF-1526205,and CAREER Award CCF-1054515.

9. REFERENCES

[1] Acar, U. A., Chargueraud, A., and Rainey, M.Scheduling parallel programs by work stealing withprivate deques. In PPoPP ’13 (2013), pp. 219–228.

[2] Baek, W., and Chilimbi, T. M. Green: a frameworkfor supporting energy-conscious programming usingcontrolled approximation. In PLDI’10 (2010),pp. 198–209.

[3] Bartenstein, T., and Liu, Y. D. Green streams fordata-intensive software. In ICSE’13 (2013).

[4] Blelloch, G., Fineman, J., Gibbons, P., Kyrola,A., Shun, J., Tangwonsan, K., and Simhadri,H. V. Problem based benchmark suite, 2012.

[5] Blumofe, R. D. Executing Multithreaded ProgramsEfficiently. PhD thesis, Massachusetts Institute ofTechnology, Department of Electrical Engineering andComputer Science, 1995.

[6] Blumofe, R. D., and Leiserson, C. E. Schedulingmultithreaded computations by work stealing. J. ACM46, 5 (1999), 720–748.

[7] Burd, T. D., and Brodersen, R. W. Design issuesfor dynamic voltage scaling. In ISLPED’00 (2000),pp. 9–14.

[8] Cohen, M., Zhu, H. S., Emgin, S. E., and Liu,Y. D. Energy types. In OOPSLA’12 (2012).

[9] Cong, G., Kodali, S., Krishnamoorthy, S., Lea,D., Saraswat, V., and Wen, T. Solving large,irregular graph problems using adaptive work-stealing.In ICPP’08 (2008), pp. 536–545.

[10] Corporation, N. I. Ni labview signalexpress,January 2014.

[11] Ding, X., Wang, K., Gibbons, P. B., and Zhang,X. Bws: balanced work stealing for time-sharingmulticores. In EuroSys’12 (2012), pp. 365–378.

[12] Fei, Y., Zhong, L., and Jha, N. An energy-awareframework for coordinated dynamic softwaremanagement in mobile computers. In MASCOTS’04(2004), pp. 306–317.

[13] Flinn, J., and Satyanarayanan, M. Energy-awareadaptation for mobile applications. In SOSP’99(1999), pp. 48–63.

[14] Frigo, M., Leiserson, C. E., and Randall, K. H.The implementation of the cilk-5 multithreadedlanguage. In PLDI’98 (1998), pp. 212–223.

[15] Horowitz, M., Indermaur, T., and Gonzalez, R.Low-power digital design. In Low Power Electronics,1994. Digest of Technical Papers., IEEE Symposium(1994), pp. 8–11.

[16] Hsu, C.-H., and Kremer, U. The design,implementation, and evaluation of a compileralgorithm for cpu energy reduction. In PLDI’03(2003), pp. 38–48.

[17] Intel. Intel cilk plus, 2015.

[18] Intel. Intel threading building blocks (intel tbb),2015.

[19] Isci, C., and Martonosi, M. Runtime powermonitoring in high-end processors: Methodology andempirical data. In MICRO’03 (2003), p. 93.

[20] Kandemir, M., Vijaykrishnan, N., Irwin, M. J.,and Ye, W. Influence of compiler optimizations onsystem power. In DAC’00 (2000), pp. 304–307.

[21] Kansal, A., Saponas, S., Brush, A. B.,McKinley, K. S., Mytkowicz, T., and Ziola, R.The latency, accuracy, and battery (lab) abstraction:Programmer productivity and energy efficiency forcontinuous mobile context sensing. In OOPSLA’13(2013), pp. 661–676.

[22] Koufaty, D., Reddy, D., and Hahn, S. Biasscheduling in heterogeneous multi-core architectures.In EuroSys’10 (2010), pp. 125–138.

[23] Kumar, V., Blackburn, S. M., and Grove, D.Friendly barriers: Efficient work-stealing with returnbarriers. In VEE ’14 (2014), pp. 165–176.

[24] Kumar, V., Frampton, D., Blackburn, S. M.,Grove, D., and Tardieu, O. Work-stealing withoutthe baggage. In OOPSLA’12 (2012), pp. 297–314.

[25] Lea, D. A java fork/join framework. In Proceedings ofthe ACM 2000 conference on Java Grande (2000),JAVA’00, pp. 36–43.

[26] Leijen, D., Schulte, W., and Burckhardt, S.The design of a task parallel library. In OOPSLA’09(2009), pp. 227–242.

[27] Liu, K., Pinto, G., and Liu, Y. D. Data-orientedcharacterization of application-level energyoptimization. In FASE ’15.

[28] Liu, Y. D. Energy-efficient synchronization throughprogram patterns. In Proceedings of GREENS’12(2012).

[29] Marlow, S., Peyton Jones, S., and Singh, S.Runtime support for multicore haskell. In ICFP’09(2009), pp. 65–78.

[30] Michael, M. M., Vechev, M. T., and Saraswat,V. A. Idempotent work stealing. In PPoPP’09 (2009),pp. 45–54.

[31] Morrison, A., and Afek, Y. Fence-free workstealing on bounded tso processors. In ASPLOS ’14(2014), pp. 413–426.

[32] Pinheiro, E., Bianchini, R., Carrera, E. V., andHeath, T. Load balancing and unbalancing for power

and performance in cluster-based systems. InWorkshop on compilers and operating systems for lowpower (2001), vol. 180, Barcelona, Spain, pp. 182–195.

[33] Pinto, G., Castor, F., and Liu, Y. D.Understanding energy behaviors of threadmanagement constructs. In OOPSLA’14 (October2014).

[34] Ribic, H., and Liu, Y. D. Energy-efficientwork-stealing language runtimes. In ASPLOS’14(2014), pp. 513–528.

[35] Roy, A., Rumble, S. M., Stutsman, R., Levis, P.,Mazieres, D., and Zeldovich, N. Energymanagement in mobile devices with the cinderoperating system. In EuroSys’11 (2011), pp. 139–152.

[36] Sampson, A., Dietl, W., Fortuna, E.,Gnanapragasam, D., Ceze, L., and Grossman, D.Enerj: Approximate data types for safe and generallow-power computation. In PLDI’11 (2011).

[37] Shen, X., Zhong, Y., and Ding, C. Locality phaseprediction. In ASPLOS’04 (2004), pp. 165–176.

[38] Sherwood, T., Perelman, E., and Calder, B.Basic block distribution analysis to find periodicbehavior and simulation points in applications. InPACT’01 (2001), pp. 3–14.

[39] Sherwood, T., Perelman, E., Hamerly, G., andCalder, B. Automatically characterizing large scaleprogram behavior. In ASPLOS’02 (2002), pp. 45–57.

[40] Sherwood, T., Sair, S., and Calder, B. Phasetracking and prediction. In ISCA’03 (2003),pp. 336–349.

[41] Taliver, Heath, T., Pinheiro, E., Hom, J.,Kremer, U., and Bianchini, R. Codetransformations for energy-efficient devicemanagement. IEEE Transactions on Computers 53(2004), 2004.

[42] Varatkar, G., and Marculescu, R.Communication-aware task scheduling and voltageselection for total systems energy minimization. InICCAD’03 (2003), pp. 510–517.

[43] Weiser, M., Welch, B., Demers, A., andShenker, S. Scheduling for reduced cpu energy. InOSDI’94 (1994).

[44] Xie, F., Martonosi, M., and Malik, S.Compile-time dynamic voltage scaling settings:opportunities and limits. In PLDI’03 (2003),pp. 49–62.

[45] Yuan, W., and Nahrstedt, K. Energy-efficient softreal-time cpu scheduling for mobile multimediasystems. In SOSP’03 (2003), pp. 149–163.

[46] Zeng, H., Ellis, C. S., Lebeck, A. R., andVahdat, A. Ecosystem: managing energy as a firstclass operating system resource. In ASPLOS’02(2002), pp. 123–132.

[47] Zhang, Y., Hu, X., and Chen, D. Task schedulingand voltage selection for energy minimization. InDesign Automation Conference, 2002. Proceedings.39th (2002), pp. 183–188.

[48] Zhu, H. S., Lin, C., and Liu, Y. D. A programmingmodel for sustainable software. In ICSE’15 (2015).