Handbook of Research on P2P and Grid Systems for Service ...cs.txstate.edu/~zz11/publications/bookchapter.pdf · High performance Grid platforms and parallel computing technologies

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing: Models, Methodologies, and Applications

Nick AntonopoulosUniversity of Surrey, UK

George ExarchakosUniversity of Surrey, UK

Maozhen LiBrunel University, UK

Antonio LiottaUniversity of Essex, UK

Hershey • New YorkInformatIon scIence reference

Volume I

Director of Editorial Content: Kristin KlingerDirector of Book Publications: Julia MosemannDevelopment Editor: Christine BuftonPublishing Assistant: Kurt SmithTypesetter: Carole CoulsonQuality control: Jamie SnavelyCover Design: Lisa Tosheff Printed at: Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global)701 E. Chocolate AvenueHershey PA 17033Tel: 717-533-8845Fax: 717-533-8661E-mail: [email protected] site: http://www.igi-global.com/reference

Copyright © 2010 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Handbook of research on P2P and grid systems for service-oriented computing : models, methodologies and applications / Nick Antonopoulos ... [et al.]. p. cm. Includes bibliographical references and index. Summary: "This book addresses the need for peer-to-peer computing and grid paradigms in delivering efficient service-oriented computing"--Provided by publisher. ISBN 978-1-61520-686-5 (hardcover) -- ISBN 978-1-61520-687-2 (ebook) 1. Peer-to-peer architecture (Computer networks)--Handbooks, manuals, etc. 2. Computational grids (Computer systems)--Handbooks, manuals, etc. 3. Web services--Handbooks, manuals, etc. 4. Service oriented architecture--Handbooks, manuals, etc. I. Antonopoulos, Nick. TK5105.525.H36 2009 004.6'52--dc22 2009046560

British Cataloguing in Publication DataA Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.

519

Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Chapter 22

Improving Energy-Efficiency of Computational

Grids via SchedulingZiliang Zong

South Dakota School of Mines and Technology, USA

Xiaojun RuanAuburn University, USA

Adam ManzanaresAuburn University, USA

Kiranmai BellamAuburn University, USA

Xiao QinAuburn University, USA

AbsTRACT

High performance Grid platforms and parallel computing technologies are experiencing their golden age because of the convergence of four critical momentums: high performance microprocessors, high-speed networks, free middleware tools, and highly increased needs of computing capability. We are witnessing the rapid development of computational Grid technologies. Dozens of exciting Grid infra-structures and projects like Grid-tech, Grid Portals, Grid Fora, and Commercial Grid Initiatives are being built all over the world. However, the fast growing powerconsumption of data centers has caused serious concerns for building more large-scale supercomputers, clusters, and Grids. Therefore, design-ing energy-efficient computational Grids to make them economically attractive and environmentally friendly for parallel applications becomes highly desirable. Unfortunately, most previous studies in Grid computing primarily focus on the improvement of performance, security, and reliability, while completely ignoring the energy conservation issue. To address this problem, we propose a general architecture for building energy-efficient computational Grids and discuss the potential possibilities for incorporating power-aware techniques to different layers of the proposed Grid architecture. In this chapter, we first

DOI: 10.4018/978-1-61520-686-5.ch022

520

Improving Energy-Efficiency of Computational Grids via Scheduling

1. INTRODUCTION

We are now in an era of information explosion. Billions of data is generated in the time it takes to blink your eyes. In order to process these massive data sets, large-scale high-performance comput-ing platforms, like clusters and computational Grids, have been widely deployed. This accom-modates the rapid growth of cluster and Grid ap-plications in both academic and the commercial area. A large fraction of applications running in these high-performance computing platforms are computing-intensive and storage-intensive, since these applications deal with a large amount of data transferred either between memory and storage systems or among hundreds of computing nodes via interconnection networks. There is no doubt that high performance computing has significantly changed our lives today. We will definitely benefit more once computation power provided by Grids become standard and pervasive similar to electric and water services.

However, these large computational power increases come at a cost. Increasing evidence has shown that the powerful computing capability is actually at the cost of huge energy consumption demands. For example, Energy User News stated that the power requirements of today’s data centers range from 75 W/ft2 to 150-200 W/ft2 and will in-crease to 200-300 W/ft2 in the near future (Moore, 2002). The new data center capacity projected for 2005 in the U.S. would require approximately 40 TWh ($4B at $100 per MWh) per year to run 24x7 unless they become more efficient (Kappiah et al.,

2005). The supercomputing center in Seattle is forecast to increase the city’s power demands by 25% (Bryce, 2000). The Environment Protection Agency reported that the total energy consump-tion of servers and data centers of the United States was 61.4 billion KWh in 2006, which is more than double the energy usage for the same purpose in 2000 (Environment Protection Agency, 2006). Even worse, the EPA has predicted that the power usage of servers and data centers will be doubled again within five years if historical trends continue (Environment Protection Agency, 2006). Even worse, huge-energy consumption will simultaneously cause serious environmental pollution. Based on the data from EPA, generating 1 kWh of electricity in the United States results in an average 1.55 pounds (lb) of carbon dioxide (CO2) emissions. Therefore, it is highly desirable to design economically attractive and environ-mentally friendly supercomputing platforms like computational Grids.

Most previous research on Grid computing has been primarily focused on the improvement of performance, security, and reliability. However, the energy conservation issue in the context of computational Grids has received little attention. This study aims at reducing power consump-tion in computational Grids with a minimum influence on performance. More specifically, in this chapter we address the issue of conserving energy for computational Grids through power-aware scheduling. We propose an energy-efficient scheduling algorithm, which can balance energy-performance by reducing interconnection power

provide necessary background on computational Grids, Grid computing, and parallel scheduling. Next, we illustrate the general Grid architecture and explain the functionality of different layers. Followed by that, we discuss the design and implementation details of applying the energy-efficient job-scheduling technique, which is called Communication Energy Conservation Scheduling (or CECS for short), to computational Grids. Finally, we present extensive simulation results to prove the improvement of energy-efficiency of computational Grids.

521


cost for communication-intensive applications running on Grids. The major contribution of this research is to design and implement an energy-efficient parallel job-scheduling module (herein-after referred to as CECS) to be integrated into a general computational Grids architecture. The proposed CECS module consists of two phases. The first phase aims to minimize communication overheads among parallel tasks by duplicating communication-intensive tasks. In the second phase, CECS judiciously allocates grouped tasks to the most energy-efficient computing nodes to optimize the energy conservation of the Grid system.

The rest of the chapter is organized as follows. In Section 2, we present a brief description of related work and provide background informa-tion required for this research. Section 3 depicts a general Grid system architecture. In Section 4, we illustrate the Grid scheduling framework and explain the function of each module and their relationships. Section 5 and Section 6 address the details about the task analyzer and energy-efficient CECS scheduling algorithm. Experimental results and performance evaluations are presented in Sec-tion 7. Finally, Section 8 discusses the future re-search trends and concludes the entire chapter.

2. bACKgROUND

2.1 grid systems

A computational Grid is a type of parallel and distributed system that enables the sharing, se-lection, and aggregation of resources distributed across multiple administrative domains based on the resources’ availability, capacity, performance, cost and users’ quality-of-service requirements. Literally speaking, a large scale distributed system, which qualifies the following three conditions 1) computing resources are not administered centrally 2) open standards are used 3) non-trivial quality of service is achieved, could be called a Grid.

A Grid system consists of a set P = {p1, p2,..., pm} of heterogeneous computing nodes (hereinafter referred to as nodes) connected by a high-speed interconnect. A heterogeneous Grid can be represented by a graph, where computing nodes are vertices. There exists a weighted edge if a pair of corresponding nodes can communi-cate with each other. An n×m binary allocation matrix X is used to reflect a mapping of n tasks to m heterogeneous nodes. Thus, element xij in X is “1” if task ti is assigned to node pj and is “0”, otherwise. Grid systems are a complicated com-puting environment because of four main reasons. First, different nodes have different preferences with respect to tasks, meaning that a node of-fering task ti a shorter execution time does not necessarily run faster for another task tj. Thus, different nodes in a heterogeneous cluster favor different kinds of tasks. Second, execution times of tasks on different nodes may vary because the nodes may have various clock speed and process-ing capabilities. Third, the transmission rates of network interconnections depend on underlying network types. Lastly, the energy consumption rates of the nodes and interconnections may not necessarily be identical.

To simplify the system model without loss of generality, we assume that all nodes are fully connected with dedicated and reliable network interconnections. Each node communicates with other nodes through message passing; the com-munication time between two tasks assigned to the same node is negligible. In addition, we as-sume computation and communication can take place simultaneously in our system model. This assumption is reasonable because each computing node in a modern cluster has a communication coprocessor that can be used to free the proces-sor in the node from communication tasks. Since we primarily focus on energy consumption, each node in the system has an energy consumption rate measured by Joule per unit time. Furthermore, each network link is characterized by its energy consumption rate that heavily relies on the link’s

522


transmission rate, which is modeled by weight wij of the edge between nodes pi and pj.

2.2 Related Work in Energy-Efficiency of Computational grids

In the past ten years, the issue of conserving energy in Grid computing platforms has not attracted enough attention because the perfor-mance, reliability, and security issues have been assigned a higher priority (Srinivasan & Jha, 1999). Recently, people realized that the power consumption may become the limiting factor for future growth of high performance computers (HPC), like Grids, since the energy demands of HPC have grown so fast because of the increas-ing number of supercomputing centers. How to build and operate energy-efficient data centers are challenging questions we have to address today. Tackling these energy-efficiency problems requires extensive research on a wide range of issues, from creating new computer architectures to innovative building designs.

From the perspective of HPC architecture design, there exist two fundamental design ap-proaches for conserving energy for large-scale computational Grids. The first one is to build com-putational Grids using low-frequency, low-power processors with modest performance. In doing so, performance might be efficiently enhanced through parallelisms than through using higher-power, higher-frequency processors. Destiny at the Los Alamos National Laboratory makes use of this approach to significantly reduce energy consumption per unit as compared to the Accel-erated Strategic Computing Initiative (ASCI) Q machine (Warren et al., 2002). Another success-ful commercial example of using this idea is the IBM Blue Gene®/L supercomputer (Gara et al., n.d.). A second approach is to build computational Grids using power-hungry and high-performance processors coupled with smart power management mechanisms. Grids designed using the second ap-proach are usually called power-scalable Grids.

In power-scalable Grids, the power level will be scaled down when Grids are not fully utilized and scaled up when Grids are busy. Dynamic Voltage and Frequency Scaling or DVFS (Kappiah et al., 2005; Ge et al., 2005) is one of the most effec-tive techniques to reduce energy consumption in power-scalable Grids. For example, Intel devel-oped the SpeedStep technology (Intel, 2004) and AMD developed the PowerNow! and Cool’n’Quiet technology (“Cool’n’Quiet Technology Installa-tion Guide,” n.d.).

Processors, networks, disks, and cooling sys-tems are four major power consumers in a Grid computing system. A handful of previous studies have investigated energy-aware processor and memory design techniques to reduce energy con-sumption of CPU and memory resources (Benini & Micheli, 1998; Rabaey & Pedram, 1998). Very recently, a large number of papers have reported the latest progresses in utilizing the DVFS technol-ogy to reduce energy dissipations in clusters and high-performance computing platforms (Lorch & Smith, 2001; Martin, 2001; Miyoshi et al., 2002; Kappiah et al., 2006; Annavaram et al., 2005; Bianchini & Rajamony, 2004; Lim, 2006; Springer et al., 2006; Hsu & Feng, 2005a; Hotta et al., 2006; Hsu & Feng, 2005b). Results show that these proposed schemes can achieve high-energy efficiency for processors, indicating that DVFS is capable of saving a significant amount of energy for computation-intensive applications. However, the benefits of DVFS may diminish when it comes to communication-intensive applications, because energy consumed by interconnects is likely to dominate the total power consumption in Grid computers.

Increasing evidence has shown that in addi-tion to processors, high-speed interconnections consume significant amounts of energy in clusters and grids. For example, it is observed that the interconnect consumes 33 percent of the total energy in an Avici switch (Dally et al., 1998), whereas routers and links consume 37 percent of the total power budget in a Mellanox server

523


blade (Mellanox Technologies Inc., 2004). This situation is getting worse with the emergence of next generation high-speed interconnections like Gigabit Ethernet, Infiniband, Myrinet and QsNetII. For instance, measurements have shown that 1Gbps Ethernet consumes about 4W more energy than 100 Mbps Ethernet (Gunaratne, 2005). 10 Gbps Ethernet may consume 10W to 20W more energy in average (Gunaratne, 2005). Saving energy in interconnects becomes even more critical for communication-intensive paral-lel applications, in which a huge amount of data must be transferred among precedence constrained parallel tasks. To the best of our knowledge, no existing algorithm can efficiently address the power consumption issue of both interconnects and processors. One of the feasible approaches to conserving power consumption caused by in-terconnects is to duplicate tasks. Compared with existing energy-efficient techniques like DVFS, duplication-based scheduling strategies have their unique advantages. First, task replicas can avoid communication overheads among tasks, thereby improving performance. Second, large energy consumption in interconnects can be reduced for communication-intensive applications.

Novel techniques proposed to conserve en-ergy in storage systems include dynamic power management schemes (Douglis et al., 1994), power-aware cache management strategies (Zhu et al., 2004), power-aware prefetching schemes (Sun & Kandemir, 2006), software-directed power management techniques (Son et al., 2005), and multi-speed settings (Gurumurthi et al., 2003). In 2002, Colarelli and Grunwald developed a massive array of idle disks or MAID (Colarelli & Grunwald, 2002), which attempts to save en-ergy by reducing the number of active disks and increasing the utilization of each active disk. The popular data concentration technique or PDC was proposed by Pinheiro and Bianchini in 2004 (Pinheiro & Bianchini, 2004). The basic idea of PDC is to migrate data across disks according to frequency of access, or popularity. The goal

is to lay data out in such a way that popular and unpopular data are stored on different disks. This layout keeps disks storing unpopular data mostly idle, so that more idle disks can be switched to the standby mode.

Research on energy-efficient cooling systems also plays an important role in building energy-efficient computational Grids. A diverse set of experts are working together to tackle problems at various levels, including studies of air tem-perature at different locations in a machine room, understanding opportunities for more efficient cooling, using computational fluid dynamics to model cooling flows, and design of entirely new architectures with the goal of reducing the power consumption from the very outset.

Since performance evaluation tools like SPECFP (“Standard Performance Evaluation Corporation CFP2000,” 2000), the NAS Parallel Benchmarks (“NAS Parallel Benchmarks,” 2007), and the Top500 (“Top500 Supercomputer List,” 2008) list have received considerable attention over the past decade, researchers are suggest-ing to make similarly robust assessments of power-efficiency of data centers (Bailey, 2006).. The purpose of power-efficiency metrics is to foster an industry wide focus on keeping power consumption under control. These appropriate power-efficiency metrics can be used to compare and rate systems in a manner similar to that of LINPACK used to measure achievable compiled performance on supercomputers. Such efforts are already underway for commercial data centers (Green 500 list, 2009).

2.3 Task Allocation Techniques in grids

Task allocation strategies, which can be divided into task partitioning and scheduling strate-gies, play an important role in achieving high-performance for parallel applications on clusters and Grids. The goal of a partitioning algorithm is to partition a parallel application into a set

524


of precedence constrained tasks represented in the form of a directed acyclic graph (DAG), whereas a scheduling algorithm is deployed to schedule the DAG onto a set of homogeneous or heterogeneous computational nodes. Scheduling strategies deployed in Grids have large impacts on the overall system performance.

Resource allocation techniques can be gener-ally classified into two types: static and dynamic schemes. Static allocation schemes (Pinheiro & Bianchini, 2004; Sih & Lee, 1993; Sih & Lee, 1993b; Page et al., 1993) rely on prior knowledge of applications, including execution times, com-munication overhead, and the like. Static resource allocation schemes aim to produce scheduling solutions that optimize objectives at compile time. Static scheduling, for example, is extremely ex-pensive (NP-Complete Problem) in computational power for numerous complicated applications. In contrast, dynamic allocation strategies (Bowen et al., 1992; Yajnik et al., 1994; Efe, 1982) - less expensive in computational power - provide sub-optimal resource management results.

Priority-based scheduling, group-based sched-uling, and task-duplication based scheduling (Bryce, 2000) are three major policies for schedul-ing parallel tasks in Grid systems. Priority-based scheduling assigns priorities to tasks and then maps the tasks to computing nodes based upon different priorities. Group-based scheduling tries to group intercommunicating tasks together and allocate them to a single computing node, thereby elimi-nating communication overheads. Duplication-based scheduling makes use of computing nodes’ idle times to replicate predecessor tasks (Zong, 2006). Many researchers have demonstrated that task-duplication scheduling is applicable for com-munication intensive workload conditions (Bryce, 2000; Colarelli & Grunwald, 2002; Ranaweera & Agrawal, 2000; Zong, 2006). The energy-aware scheduling strategy proposed for Grids is actually the combination of group-based scheduling and duplication-based scheduling.

3. gRID sYsTEm ARCHITECTURE

Grids are complicated distributed systems which consist of the following four major layers: appli-cation, middleware, resource, and network layer (See Figure1).

The network layer assures the connectivity for all resources in the Grid. The resource layer contains the actual resources, such as processors, caches, memories, hard drives, and even satellites or other instruments, which can be connected to the network layer. The middleware layer serves as the “heart” of the Grid system because it contains most of the intelligent models like the resource broker, security access, task analyzer, task sched-uler, communication service, information service, and reliability control, which make all various computing resources work together smoothly. The highest layer of the Grid architecture is the application layer, which includes various user applications coming from science, engineering, business and financial, portals, and development toolkits supporting these applications.

All seven modules (probably more in special Grids) in the middle layer collaborate with each other to assure the performance of the Grid system. In particular, the resource broker is the first gate to the application layer. Once users install the

Figure 1. Grid system architecture

525


corresponding Grid-supported software, they can connect to the Grid system by submitting their re-quests to the resource broker. The security module will check if the users have the authority required to link to the Grid system. Once approved, the task analyzer will split a big job into a number of small tasks and the dependency of these tasks will also be decided. Next, the task scheduler will allocate tasks to distributed computing resources based on different scheduling strategies. In a large-scale Grid, the efficiency of the task scheduler largely decides the performance of the entire system. Actually, one of the main contributions of this research is to design and implement an efficient energy-aware scheduler for Grid systems. In addition, the communication service module is responsible for supporting services like a remote function call. The information service module will keep all the detailed executing information about the tasks running on different computing resources. Last but not the least, the reliable control module will guarantee the reliability and availability of the Grid system. For example, the reliable control module could reject a decision made by the task scheduler if it judges that the target computing resource is not reliable based on its previous record.

4. gRID sCHEDULINg FRAmEWORK

Grids are complex multivariate environments, which are made up of numerous Grid entities that need to be managed. To make coherent and coordinated use of ubiquitous and heterogeneous resources, Grid scheduling mechanisms are critical components that need to be developed to achieve high performance and energy efficiency. Basi-cally, Grid scheduling needs to manage two major things effectively: resources and jobs. How to find available resources within the shortest period? How to allocate jobs to these available resources in a reasonable way? These questions will be answered by the Grid scheduling module. Figure 2 illustrates the Framework of Grid scheduling. In this framework, the global level scheduler (or Grid level scheduler) coordinates multiple local scheduling or helps to select the best resources for a job among different possible resource offers. Typically, the global level scheduler itself has no direct control over resources. Therefore, it has to communicate with and appropriately guide sev-eral local level schedulers to complete the tasks submitted by users. Those local level schedulers either control resources directly or have some kind of access to their local resources. The global level scheduler is also responsible for collaborating with other important supportive middleware enti-

Figure 2. A framework for Grid scheduling

526


ties like the information service, communication service, and reliability control modules.

As long as the Grid scheduling module has collected all the information of currently avail-able resources, it will judiciously choose target recourses based on its scheduling policy and al-locate the sub tasks analyzed by the task analyzer to these chosen resources for parallel execution. Figure 3 shows the job scheduling flow in a Grid system. During the process of execution, the re-sults collector periodically checks the randomly returned sub results and transfers these sub results to the Grid level scheduler. The Grid scheduler passes the latest information to all tasks, ensuring that the tasks with dependency can be immedi-ately executed once the tasks obtain the needed sub results.

In order to deal with resource heterogeneity and support our energy-efficient scheduling policy, the Grid level task scheduler has the following important attributes:

• Withdraw allocations: The target resourc-es could be taken back by the local ad-ministrator and used to execute tasks with higher priority. In this case, the scheduler

must be able to withdraw its allocations and reallocate corresponding tasks to other available sources.

• Allocation and Migration: Any task can be suspended at one computing site and migrated along with the corresponding data to another site. Migrations can boost the reliability and efficiency of theGrid,because entire applications do not need be re-executed if failures occur in the Grid for unexpected or uncontrollable reasons.

• Exclusive allocations: Different comput-ing resources might have a particular pref-erence or exclusiveness for different type of tasks. That means a processor, which offers task ti a shorter execution time, does not necessarily run faster for another task tj. Even worse, some processors are exclu-sivetospecifictypeoftasks.Theschedulerhastoavoidtheconflictsbetweentasksandresources.

• Tentative allocations: To make a smart finaldecision, theschedulerneeds tocal-culate and compare the allocation cost by tentatively allocating tasks to different available resources. The scheduler should

Figure 3. Job scheduling in Grid

527


be able to easily complete revocation of tentative allocation.

• Dependent task allocations: Task de-pendency is another important factor the scheduler needs to consider. In our frame-work, the task analyzer will provide the de-tailed information of tasks to scheduler and the scheduler is always trying to allocate tasks with high dependency to the same re-sources in order to reduce the communica-tion overhead.

5. TAsK ANALYzER

The task analyzer is the center to analyze tasks characteristic, task dependencies, and estimate the possible execution time of each task based on the task type or user provided information. In our framework, parallel tasks with dependen-cies are represented by Directed Acyclic Graphs (DAGs). A parallel application running in a Grid is specified as a pair, i.e, (T, E), where T = {t1, t2, ..., tn} represents a set of parallel tasks, E is a set of weighted and directed edges representing the communication cost among tasks, e.g., (ti, tj)∈ E is a message transmitted from task ti to tj. Precedence constraints of the parallel tasks are represented by all edges in the set E. Communication time spent in delivering a message (ti, tj) ∈ E from task ti on node ission rate of a link connecting node pu and pv. The execution times of a taspu to tj on pv is determined by sij/buv, where sij is the message’s data size and buv is the transmk ti running on a set of heterogeneous computing nodes are mod-

eled by a vector, i.e., c c c ci i i i

m= ( )1 2, , , where c

ij represents the execution time of ti on the jth

computing node. If task ti cannot be executed on node pj, the corresponding execution time ci

j in the vector ciismarkedas∞.Wedefineatask as an entry task if it does not have any predecessor tasks and; similarly, a task is called an exit task if there is no task following behind it. The task analyzer will take the user request (usually it contains the

necessary task description information) as input and generate DAGs as output.

Figure 4 illustrates the task description of a parallel application represented by a task graph, a mapping matrix, and a cluster with three heteroge-neous computing nodes. The task graph contains ten tasks; the computing node graph (or processor graph) has three heterogeneous computing nodes. EN

activei is the energy consumption of the ith

computing node in the active mode, and ENidlei is

the energy consumption of the ith computing node in the idle (or standby) mode. Similarly, EL

activeij

and ELidleij is the energy consumption of the link

between the ith and jth nodes when data is being transferred and when no data is being transmitted. For example, the energy consumptions of the net-work link between nodes p1 and p2 are EL

active12 30=

and ELidle12 = 10 when the link is in the busy and

idle modes, respectively. The energy consumption of node p1 is EN

activei = 25 and EN

idlei = 8 when

it is active and idle, respectively. We assume that the transmission rate between two computing nodes is the same in both directions. The execution time vector of each task on the three processors is illus-trated in Figure 5.1(d). For example, the execution times of task t1 on nodes p1, p2, and p3 are 6.7, 3.9, and 2.0 time units, respectively. Here we should note that task t9 couldn’t be allocated to p1 because the corresponding item in the mapping matrix is markedas∞.

6. ENERgY-EFFICIENT sCHEDULINg

In this section, we present the details of the scheduling policy used in our Grid scheduling framework. Here we suppose that the task analyzer has provided accurate information about task dependencies. Basically, there are three major phases in our CECS scheduling. The first phase -called grouping – groups tasks with the highest dependency together. Phase two - called task du-plication - aims to duplicate as many tasks as pos-sible provided that the energy can be significantly

528


reduced without noticeably affecting performance. In phase three, the scheduler tentatively allocates the grouped tasks to different available resources while calculating the energy cost. Eventually, the scheduler completes the resource allocations by dispatching tasks coupled with data to target com-puting nodes. Since phase two and three heavily depend on the calculation of energy consumption, let us start the presentation of the algorithm by introducing an energy consumption model in the context of Grids as follows.

6.1 Energy Consumption model

In our energy consumption model, we consider the energy cost of both computing nodes and interconnections.

Let eni be the energy consumption caused by task ti running on a computing node, of which the energy consumption rate is PN

activej , and the

energy dissipation of task ti can be expressed as Eq. (1)

en x PN ci ij active

jij

j

m

= × ×( )=å

1, (1)

where xij is an element in allocation matrix X, and

xt

iji=

ìíïïï

îïïï

1

0

if task is allocated to nodeOtherwise

p j

Given a parallel application with a task set T and allocation matrix X, we can calculate the energy consumed by all the tasks in T using Eq. (2).

EN en en

x PN c

active ii

T

ii

n

ij activej

ij

j

m

i

n

= =

= × ×( )= =

==

å å

åå1 1

11

(2)

For a Grid where energy consumption rates of all the computing nodes are identical, i.e.,PN PN PN

active active active1 2= = = , Eq. (2) can

be simplified as follows

EN PN x cactive active ij i

j

j

m

i

n

= × ×( )==åå

11 (3)

Figure 4. Task analyzer generates DAG

529


Recall that PNidlej is the energy consumption

rate of the jth computing node when it is inactive, and fi is the completion time of task ti. The energy consumed by an inactive node is a product of the idle energy consumption rate PN

idle and an idle

period. Thus, we can use Eq. (4) to obtain the energy consumed by the jth computational node in a Grid when the node is sitting idle.

EN PN f x cidlej

idlej

i ij ij

j

m

i

n

i

n

= × ( )- ×( )æ

è

ççççççç

ö

ø===

ååmax11

1

÷÷÷÷÷÷÷÷÷ , (4)

Where maxi

n

if

=( )

1 on the right-hand side of Eq.

(4) is the schedule length (also known as makespan

time), and maxi

n

i ij ij

j

m

i

n

f x c=

==

( )- ×( )åå1

11 in Eq. (4)

is the total idle time on the jth node. The total energy consumption of all the idle nodes in the Grid can be derived from Eq. (4) as

EN EN

PN f x c

idle idlej

j

m

idlej

i

n

i ij ij

j

m

i

n

=

= × ( )- ×( )=

===

å

å1

111

max åååæ

èçççç

ö

ø÷÷÷÷

æ

èçççç

ö

ø

÷÷÷÷÷=j

m

1

(5)

F o r a G r i d w h e r e PN PN PN

idle idle idle1 2= = = , Eq. (5) can be

simplified as

EN PN f x cidlej

idle i

n

i ij ij

j

m

i

n

= × ( )- ×( )æ

è

ççççççç

ö

ø

÷

===ååmax

111

÷÷÷÷÷÷÷÷=åj

m

1 (6)

Consequently, the total energy consumption of the parallel application running on the Grid can be derived from Eqs. (2) and (5) as

EN EN EN

x PN c PN

active idle

ijj

ij

j

m

i

n

idlej

active

= +

= × ×( )+ ×==åå

11

maaxi

n

i ij ij

j

m

i

n

j

f x c=

===

( )- ×( )æ

èçççç

ö

ø÷÷÷÷

æ

èçççç

ö

ø

÷÷÷÷÷åå1

1111

11

m

ijj

ij

idlej

i

n

i ij ij

j

m

x PN c PN f x cactive

å

= × ×( )+ × ( )- ×( )=

=

max ååååå===

æ

èçççç

ö

ø÷÷÷÷

æ

èçççç

ö

ø

÷÷÷÷÷

æ

è

ççççç

ö

ø

÷÷÷÷÷i

n

j

m

i

n

111 (7)

Since the Grid can be envisioned as a special case of heterogenous clusters, the total energy consumption of computing nodes of a homogenous cluster can be derived from Eqs. (3) and (6):

EN EN EN

PN x c PN

active idle

active ij ij

j

m

i

n

idle

= +

= × ×( )+ ×==åå

11

maxii

n

i ij ij

j

m

i

n

j

m

f x c=

===

( )- ×( )æ

èçççç

ö

ø÷÷÷÷ååå

1111

(8)

We denote elij as the energy consumed by the transmission of message (ti, tj)∈ E from task ti on node pu to tj on pv. We can compute the energy consumption of the message as a product of its communication cost sij/buv and the power PL

active

uv of the active link between pu and pv. Thus, we have

el PL x xs

bijuv

iu jv

ij

uvv

m

u

m

active===

× × ×æ

è

ççççç

ö

ø

÷÷÷÷÷åå11

(9)

The energy consumed by a network link between pa and pb is the cumulative energy consumption caused by all messages transmit-ted over the link. Therefore, the link’s energy consumption is expressed by Eq. (10) as follows, where Luv is a set of messages delivered via the link between pu and pv, and Luv can be written as L e E u v m x x

uv ij iu jv= " Î £ £ = Ù ={ }, ,1 1 1

EL el PLs

bactiveuv

ije L

uv ij

uveij uv

active

ij

= = ×æ

è

ççççç

ö

ø

÷÷÷÷÷Îå

ÎÎ

= ¹=

å

å= × × ×æ

è

ççççç

ö

ø

÷÷÷÷÷

L

uviu jv

ij

uvj j

n

i

uv

activePL x x

s

b1 11 ,

nn

å

(10)

530


The energy consumption of the whole intercon-nection network is derived from Eq. (10) as the summation of all the links’ energy consumption. Thus, we have

EL EL

PL x xs

b

active activeuv

v v u

m

u

m

activeuv

iu jv

ij

u

=

= × × ×

= ¹=åå11 ,

vvv v u

m

u

m

j j i

n

i

n æ

è

ççççç

ö

ø

÷÷÷÷÷= ¹== ¹=åååå1111 ,,

(11)

We can express energy consumed by a link when it is inactive as a product of the idle energy consumption rate and the idle period of the link. Thus, we have

EL PL f x xs

bidleuv

idleuv

i

n

i iu jv

ij

uvj

= × ( )- × × ×æ

è

ççççç

ö

ø

÷÷÷÷÷max

== ¹=åå

æ

è

çççççççç

ö

ø

÷÷÷÷÷÷÷÷÷÷11 ,j i

n

i

n

(12)

where PLidle

uv is the power of the link when it is

inactive, and max

i

n

i iu jv

ij

uvj j i

n

i

n

f x xs

b( )- × × ×

æ

è

ççççç

ö

ø

÷÷÷÷÷= ¹=åå11 ,

is the total idle time of the link during the course of the parallel application’s execution. We can express energy incurred by the network intercon-nections during the idle periods as

EL EL

PL f

idle idleuv

v v u

m

u

m

idleuv

v v u

m

u

m

i

n

=

=

= ¹=

= ¹=

åå

åå

11

11

,

,

maxii iu jv

ij

uvj j i

n

i

n

x xs

b( )- × × ×

æ

è

ççççç

ö

ø

÷÷÷÷÷

æ

è

çççççççç = ¹=åå11 ,

öö

ø

÷÷÷÷÷÷÷÷÷÷ (13)

Total energy consumption exhibited by the interconnection is derived from Eqs. (11) and (13) as below

EL EL EL

PL x xs

b

active idle

activeuv

iu jv

ij

uv

= +

= × × ×æ

è

ççççç

ö

ø

÷÷÷÷÷vv v u

m

u

m

j j i

n

i

n

idleuv

v v u

m

u

m

i

n

PL f

= ¹== ¹=

= ¹=

åååå

åå+

1111

11

,,

,

maxii iu jv

ij

uvj j i

n

i

n

x xs

b( )- × × ×

æ

è

ççççç

ö

ø

÷÷÷÷÷

æ

è

çççççççç = ¹=åå11 ,

öö

ø

÷÷÷÷÷÷÷÷÷÷ (14)

Now, we can compute energy dissipation experienced by a parallel application on a Grid using Eqs. (7) and (14). Hence, we can express the total energy consumption of the Grid executing the application as follows

E EN EL x PN c PN f x cij active

jij

idlej

i

n

i ij ij

j

= + = × ×( )+ × ( )- × ×( )=

max1111 ,j i

n

i

n

i

n

¹==ååå

æ

è

ççççççç

ö

ø

÷÷÷÷÷÷÷÷

æ

è

çççççççç

ö

ø

÷÷÷÷÷÷÷÷÷jj

m

activeuv

iu jv

ij

uvv v u

m

u

PL x xs

b

=

= ¹=

å

å= × × ×æ

è

ççççç

ö

ø

÷÷÷÷÷

1

11 ,

mm

j j i

n

i

n

idleuv

v v u

m

u

m

i

n

i iu jvPL f x x

ååå

åå

= ¹=

= ¹=

+ ( )- × ×

11

11

,

,

max ××æ

è

ççççç

ö

ø

÷÷÷÷÷

æ

è

çççççççç

ö

ø

÷÷÷÷÷÷÷÷÷÷= ¹=åå

s

bij

uvj j i

n

i

n

11 ,

(15)

6.2 grouping Phase

The responsibility of the grouping phase is to associate the most relevant tasks (i.e. tasks in the same critical paths) into groups. Given a parallel application modeled as a task graph or DAG, the grouping phase yields a group-based graph of the DAG. Since all tasks in a group are allocated to the same computing node where there are no waiting times among the tasks within the group, we can significantly reduce communication overheads of highly dependent tasks with intensive communi-cations. Additionally, a task-duplication strategy is applied in the process of grouping to further improve system performance by replicating tasks into multiple computing nodes if schedule lengths can be shortened. More specifically, the grouping phase can be finely divided into two sub-steps, namely, original task sequence generating and parameters calculating respectively.

531


6.2.1 Original Task Sequence Generating

Dependence constraints of a set of parallel tasks have to be guaranteed by executing predeces-sor tasks before their corresponding successor tasks. To achieve this goal, the first step in the grouping phase is to construct an original ordered task sequence using the concept of levels, where the level of each task is defined as the length in computation time of the longest path from the task to the exit task. In other words, the level of each task means how far it is from current task to the completion time of the exit task. Note that a similar approach proposed by Srinivasan and Jha can be found in(Srinivasan & Jha, 1999). We define the level L(ti) of task ti as below

L t

c t

level ki

i i

k succ i

( ) =( ) =

( )( )+Î ( )

, if successor

max

F

cci, otherwise.

ì

í

ïïïï

îïïïï

(16)

The level of an exit task is equal to its execution time. The levels of other tasks can be obtained in a bottom-up fashion by specifying the level of the exit task as its execution time and then recursively applying the second term on the right-hand side of Eq. (16) to calculate the levels of all the other tasks. Initially, all ungrouped tasks are marked NOT_GROUPED. The list of groups is initialized to an empty set. Next, all the tasks are sorted in an increasing order of the levels and then considered for grouping in that order.

6.2.2 Parameter Calculations

The second step in the grouping phase is to calcu-late some important parameters, which will be used in the third step to generate duplication-based task sequences. The important notation and parameters are summarized in Table 1. It should be noted that similar notation was used by Darbha and Agrawal in (Colarelli & Grunwald, 2002).

The earliest start time of the entry task is 0 (see the first term on the right side of Eq. (17). The earliest start times of all the other tasks can be computed in a top-down manner by recursively applying the second term on the right side of Eq. (17).

EST t

t

ECT ti

i

e E e E t t jji ki k j

( ) =( ) =

( )Î Î ¹

0,

,

if predecessor

min max

F

,,,

ECT ts

bk

ji( )+æ

è

ççççç

ö

ø

÷÷÷÷÷

æ

è

ççççç

ö

ø

÷÷÷÷÷÷

ì

í

ïïïïï

î

ïïïïï

ottherwise

(17)

The earliest completion time of task ti is ex-pressed as the summation of its earliest start time and execution time. Thus, we have

ECT t EST t x ci i ij i

j

i

m

( ) = ( )+ ×=å

1

(18)

Allocating task ti and its favorite predecessor FP(ti) on the same computing node can lead to a short schedule length. As such, the favorite predecessor FP(ti) is defined as below

FP t t e E e E j k ts

bECT t

s

bi j ji ki j

ji

kki( ) = " Î Î ¹ ( )+ ³ ( )+, , ,where : ECT

(19)

As shown by the first term on the right-hand side of Eq. (20), the latest allowable completion time of the exit task equals to its earliest comple-tion time. The latest allowable completion times of all the other tasks are calculated in a top-down manner by recursively applying the second term on the right-hand side of Eq. (20).

Table 1. Important notation and parameters

Notation Definition

EST(ti) Earliest start time of task ti

ECT(ti) Earliest completion time of task ti

FP(ti) Favorite predecessor of task ti

LACT(ti) Latest allowable completion time of task ti

LAST(ti) Latest allowable start time of task ti

532


LACT t

ECT t t

LASi

i i

e E t FP tij i j

( ) =( ) ( ) =

Î ¹ ( )

,

,

if successor

min min

F

TT ts

bLAST t

j

ij

e E t FP tj

ij i j

( )-æ

è

ççççç

ö

ø

÷÷÷÷÷( )( )

æ

è

ççÎ ¹ ( )

,,minçççç

ö

ø

÷÷÷÷÷÷

ì

í

ïïïïï

î

ïïïïï, otherwise

(20)

The latest allowable start time of task ti is derived from its latest allowable completion time and execution time. Hence, the LAST(ti) can be written as

LAST t LACT t x ci i ij i

j

i

m

( ) = ( )- ×=å

1 (21)

6.3 Duplication Task sequence generating

Once the original task sequence and important parameters are available, we are ready to apply the duplication strategy to further reduce energy consumption caused by frequently transferring data among computing nodes. Figure 5 illustrates the implementation flow chart of forming a final task group graph. Initially, no task is marked as “grouped” and the task list is initialized to be empty. Next, the scheduler considers the first task and inserts it to a newly formed group called

G1. Then in the following iterations, the sched-uler goes through all the tasks along the favorite predecessor chain, attempting to add all tasks in the critical path to the same group. Once a task is added to a group, it will be immediately marked as “grouped”. If the task being processed is the entry task, the current iteration will end and a new iteration will start in the next loop by choosing the first ungrouped task from the original task sequence generated in step 1. During the pro-cess of walking through multiple critical paths, we may find some tasks have been added to a group. At this point, the duplication strategy is responsible for making the decision whether or not to duplicate this task to multiple groups by comparing the value of LAST(t) - LACT(t’) and the communication time cc(t, t’). A task will be duplicated if the schedule length can be reduced and the task will not be duplicated otherwise. At the end of the process, the task graph has been divided into groups. Finally, creating edges among all the groups communicating with each other generates the group graph. The scheduler then issues a weight for each edge to represent the corresponding communication cost.

Figure 5. Implementation flow chart for task duplication

533


6.4 Energy-Efficient group Allocation Phase

After the grouping and duplication stages above, the DAG has been partitioned into a number of groups, which will be allocated to heterogeneous computing nodes by the next step in an energy-efficient way. The main objective for this phase is to generate an allocation list with minimized energy dissipation. Recall that there might be exclusion relations among some tasks and nodes, we have to verify whether or not a node and a group are exclusive to each other. In other words, we have to assure that all tasks in the group are exclusion compatible with the node to be allocated on. If any task is exclusive to a current node, our algorithm performs the same verification process on another computing node until an exclusion compatible node is identified. Once a group and a computing node pass the compatible verification process, the group will be temporarily allocated to the node. Next, the scheduler will calculate the energy consumption caused by the group running on the node. The estimation of the energy consumption can be carried out using the energy consumption model described in Section 6.1. The value of the energy consumption is saved in an energy cost array. The scheduler applies the same procedure to the next type of compatible node. This proce-dure is repeatedly performed until all candidate compatible nodes with respect to the group have been considered. Finally, the scheduler updates the allocation list with a compatible node that leads to the minimized energy dissipation. After the group allocation phase is accomplished, the scheduler provides a final allocation solution with the minimized energy consumption of the Grid. The following code shows a way of implementing the energy-efficient group allocation phase.

Pseudo code of group allocation to minimize energy consumption

Energy_Efficient_Allocation () { set allocation list is empty;

for each cluster c in the clus-ter graph G { n = Energy_Efficient_Calculation (c, N); mark c is finally al-located to n, update allocation list; } return allocation list; } Energy_Efficient_Calculation (c, N) { i = 1; while (n is not the last node in N) { Legal_n = Exclusion_Verify (c, n); Add Legal_n to the Legal_Node_List; n = the next node follow-ing Legal_n in N ; } for each node n in Legal_Node_List { if (n has not been allo-cated with any cluster) { mark c to be tempo-rarily allocated to n; temp_en-ergy_cost[i] = Energy_Consumption(c,n); //here Energy_Consumption()will calcutlate energy cost assumming c is al-locted to n; i++; } } return the node with minimized value in array temp_energy_cost[] } Exclusion_Verify (c, n) { for each task t in cluster c {

534


if (t is exclusive with n) { n = the next node following n in N ; Exclusion_Verify (c, n); } }

7. PERFORmANCE EVALUATION

Now we are ready to evaluate the effectiveness of the proposed energy-efficient scheduling frame-work. To show the strength of our novel schedul-ing scheme, we conducted extensive experiments using parallel application models including Gaussian Elimination and Fast Fourier Transform applications. In this section, we compare CECS with two existing scheduling algorithms: the Non-Duplication Scheduling algorithm (NDS) and the Task Duplication Scheduling algorithm (TDS) (Ranaweera & Agrawal, 2000). In addition, we compare our algorithms with a baseline algorithm: Energy-Efficient Non-Duplication Scheduling (EENDS). The NDS, TDS and EENDS algorithms are briefly described below.

1. NDS: This is a non-duplication-based algo-rithm (also known as the static priority-based Modified Critical Path (MCP) algorithm) with time complexity of O(n2(logn + m)), where n and m are the number of tasks and nodes, respectively. NDS, which does not duplicate any task, makes scheduling deci-sions using the critical-path method.

2. TDS: The TDS algorithm allocates all tasks that are in a critical path to one computing node. If tasks have already been dispatched to other nodes, TDS only duplicates the tasks that can potentially shorten the scheduling length. TDS aims to generate a schedule of a parallel application with the shortest schedule length.

3. EENDS: To the best of our knowledge, EENDS is a baseline algorithm that could not be found in the literature. In order to comprehensively understand the impact of the grouping phase, we combine the sec-ond phase of our algorithm with the NDS grouping to form a new EENDS scheduling algorithm.

The basic yet important method we used in our experiments is called OTOP (Once Tuning One Parameter). Specifically, in each experimental study we only vary one parameter while keeping the other parameters unchanged. By tuning one pa-rameter at a time, we are allowed to clearly observe its impact on clusters and easily analyze system sensitivities to this specific parameter. Important system parameters tuned in our experimental studies include Communication-to-Computation Ratio (or CCR for short), network heterogeneity, and computing heterogeneity.

Note that the CCR value of a parallel applica-tion on a Grid is defined as the ratio between the average communication cost of |E| messages and the average computation cost of n parallel tasks in the application on the given cluster with m heterogeneous computing nodes. Formally, the CCR of an application (T, E) is expressed by Eq. (22) as below.

CCR T EE m m

s

bij

uvv

m

u

m

j

n

i

n

,( ) =-( )

æ

è

ççççç

ö

ø

÷÷÷÷÷÷====ååå1 1

1 1111åå

åå

åå

==

==

æ

èçççç

ö

ø÷÷÷÷

=× -( )

æ

è

1 1

1

1

11

11

n mc

E m

s

b

ij

j

m

i

n

ij

uvv

m

u

mçççççç

ö

ø

÷÷÷÷÷

æ

èçççç

ö

ø÷÷÷÷

==

==

åå

åå

j

n

i

n

ij

j

m

i

n

nc

11

11

1

(22)

Generally speaking, communication-intensive applications have high CCRs, whereas CCRs of computation-intensive applications are low.

Table 2 summarizes the configuration param-eters of simulated Grid used in our experiments. On the right hand side of each row in Table 2, parameters in the first part are fixed, whereas parameters in the second part are varied or ran-domly generated using uniform distributions. In order to illustrate the heterogeneity of computing nodes, we choose to test four heterogeneous Grid

535


computing environments, in which the numbers of four types of computing nodes are different in processors. The last row in Table 2 represents the network heterogeneity by setting various energy consumption rates. Figure 6 shows the energy consumption parameters of different types of processors used in computing nodes. All this data comes from the latest test report of Xbit Lab (http://www.xbitlabs.com). Figure 7 depicts the task graphs of the Gaussian and FFT parallel applications.

7.1 CCR sensitivity

Figure 8 shows the impact of varying CCR has on the energy dissipation of the Gaussian Elimination application and FFT application, respectively. Four observations are evident from this group of experi-ments. First, the energy consumption of Gaussian Elimination under all the four algorithms is very sensitive to CCR. Second, NDS outperforms TDS with respect to energy conservation when the CCR values are small. However, TDS is superior to NDS when CCR becomes large (e.g., CCR is

Figure 6. Parameters used in simulation (data from test report of Xbit Lab)

Table 2. Characteristics of experimental system parameters

Parameters Value (Fixed) - (Varied)

Different trees to be examined Gaussian elimination, Fast Fourier Transform

Execution time of Gaussian Elimination {5, 4, 1, 1, 1, 1, 10, 2, 3, 3, 3, 7, 8, 6, 6, 20, 30, 30 }-(random)

Execution time of Fast Fourier Transform {15, 10, 10, 8, 8, 1, 1, 20, 20, 40, 40, 5, 5, 3, 3 }-(random)

Computing node type

AMD Athlon 64 X2 4600+ with 85W TDP (Type 1) AMD Athlon 64 X2 4600+ with 65W TDP (Type 2) AMD Athlon 64 X2 3800+ with 35W TDP (Type 3) Intel Core 2 Duo E6300 processor (Type 4)

CCR set Between 0.1 and 10

Computing node heterogeneity

Environment1: # of Type 1: 4 # of Type 2: 4 # of Type 3: 4 # of Type 4: 4




Network energy consumption rate 20W, 33.6W, 60W

536


greater than or equal to 4). Third, CECS works well for both Gaussian and FFT and it has overall better performance compared with the other three algorithms. Last, the energy savings exhibited by CECS become more pronounced with the increas-ing values of CCR, which indicates that CECS is more appropriate for communication-intensive applications as opposed to computation-intensive applications.

7.2 Computing Nodes Heterogeneity

Figure 9 illustrates the impact of varying the computing nodes heterogeneity. First, we observe that CECS can conserve more energy in E1 and E3 compared with E2, from which we can draw the conclusion that less energy is consumed with

more energy-efficient computing nodes. Second, the computing nodes heterogeneity has a signifi-cant impact on the energy efficiency of CECS. For example, when CCR is equal to 0.1 in three different environments, CECS reduces energy consumption (compared with TDS) by 36.4%, 47.1%, and 45.6%, respectively. These experimen-tal results indicate that CECS can conserve even more energy for heterogeneous clusters that are comprised of energy-consuming computing nodes. Third, Figure 9 shows a performance trend that CECS can significantly conserve energy because TDS consumes a large amount of energy when CCR is small and NDS consumes more energy when CCR is large due to the huge energy dis-sipation in the network interconnections.

Figure 7. Parallel task models used in simulation

Figure 8. CCR sensitivity for Gaussian and FFT application

537


8. CONCLUsION

In this chapter, we propose a general architec-ture for building power efficient computational Grids and discuss the possibility of incorporat-ing energy-efficient techniques to each layer of this architecture. We believe that the discussions concerning the architecture level are necessary and valuable because these discussions can help us understand the importance of energy-efficiency for high-performance computing platforms and provide a big picture of this research area. It also can provide meaningful guidance for follow-up researchers. Next, we explained the Grid system architecture and the function of each module in the middleware layer. In addition, we present the CECS algorithm and illustrate how to improve energy efficiency of computational Grids via power-aware scheduling. More specifically, we discussed in detail about the Grid scheduling framework and the new features of our scheduler. Rather than just talking about the concepts, we addressed the implementation details of the CECS scheduling policy used in our energy-efficient Grid scheduler. The CECS policy aims to allocate and schedule tasks of parallel applications running on a distributed Grid system in a way to conserve energy without adversely affecting performance. Basically, CECS is designed and implemented based on the group-based and duplication-based algorithms, which can minimize the communica-tion overheads of parallel tasks with dependency constraints. It consists of three scheduling phases,

which aim to make the best tradeoffs between en-ergy savings and performance. The experimental results show that compared with three existing algorithms, CECS can significantly reduce energy dissipation for parallel applications running on heterogeneous Grid systems with only a marginal degradation in performance.

ACKNOWLEDgmENT

This work was made possible partially thanks to NSF awards CCF-0845257 (CAREER), CNS-0757778 (CSR), CCF-0742187 (CPA), CNS-0831502 (CyberTrust), OCI-0753305 (CI-TEAM), DUE-0837341 (CCLI), and DUE-0830831 (SFS), an Intel gift (number 2005-04-070) and an Auburn University startup grant as well as a South Dakota School of Mines and Technology startup grant.

REFERENCEs

Annavaram, M., Grochowski, E., & Shen, J. (2005, June). Mitigating Amdahl’s law through EPI throttling. In Proc. 32nd Annual International Symposium on Computer Architecture (ISCA’05) (pp. 298-309).

Bailey, D. (2006). Power efficiency metrics, for the top500. Top500 BoF, Supercomputing 2006.

Figure 9. Energy consumption for Gaussian with different resources

538


Benini, L., & Micheli, G. D. (1998). Dynamic Power Management: Design Techniques and CAD Tools. Norwell, Massachusetts: Kluwer Academic Publishers.

Bianchini, R., & Rajamony, R. (2004). Power and energy management for server systems. Computer, 37, 68–74. doi:10.1109/MC.2004.217

Bowen, N. S., Nikolaou C. N., & Ghafoor A. (1992, March). On the Assignment Problem of Arbitrary Process Systems to Heterogeneous Distributed Computer Systems. Institute of Elec-trical and Electronics Engineers Transaction on Computers, 41(3).

Bryce, R. (2000, December). Power struggle. Interactive Week. Retrieved on November 9, 2008 from http://www.zdnet.com.au/news/business/soa/Power-struggle/0,139023166,120107749,00.htm.

Colarelli, D., & Grunwald, D. (2002, November). Massive Arrays of Idle Disks for Storage Archives. Proc. of the 15th High Performance Network-ing and Computing Conference. Cool’n’Quiet Technology Installation Guide for AMD Athlon 64 Processor Based Systems. (n.d.). Retrieved on February 22, 2008 from http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Cool_N_Quiet_Installation_Guide3.pdf

Cuenca, J., Gimenez, D., & Martinez, J.-P. (2006). Heuristics for work distribution of a homoge-neous parallel dynamic programming scheme on heterogeneous systems. Institute of Electrical and Electronics Engineers on Heterogeneous Computing Workshop.

Dally, W., Carvey, P., & Dennison, L. (1998, Au-gust). The Avici Terabit Switch/Rounter. In Proc. Hot Interconnects (Vol. 6, pp. 41-50).

Douglis, F., Krishnan, P., & Marsh, B. (1994). Thwarting the Power-Hunger Disk. In Proc. Winter USENIX Conf., (pp.292-306).

Efe, K. (1982, June). Heuristic Models of Task Assignment Scheduling in Distributed Systems. IEEE Transactions on Computers, 50–60.

Elnozahy, E. N. M., Kistler, M., & Rajamony, R. (2002, February). Energy-Efficient Server Clusters. International Workshop Power-Aware Computer Systems.

Environment Protection Agency. (2006). Power Usage of Data Centers. Last retrieved on No-vember 9, 2008 from http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf

Gara, A., Blumrich, M. A., Chen, D., Chiu, G. L.-T., Coteus, P., Giampapa, M. E., et al. (n.d.). Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development. Re-trieved on November 9, 2008 from http://www.research.ibm.com/journal/rd49-23.html.

Ge, R., Feng, X. Z., & Cameron, K. W. (2005, November). Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. In Proceedings of the ACM/IEEE SC 2005 Conference (Supercomput-ing 05) (pp. 34).

Green 500 list. (2009). Retrieved from http://www.green500.org

Gunaratne, C., Christensen, K., & Nordman, B. (2005). Managing energy consumption costs in desktop PCs and LAN switches with proxying, split TCP connections, and scaling of link speed. International Journal of Network Management, 15(5), 297–310. doi:10.1002/nem.565

Gurumurthi, S., Sivasubramaniam, A., Kandemir, M., & Fanke, H. (2003, June). DRPM: Dynamic Speed Control for Power Management in Server Class Disks. In Proc. Int’l Symp. of Computer Framework (pp. 169-179).

539


Hotta, Y., Sato, M., Kimura, H., Matsuoka, S., Boku, T., & Takahashi, D. (2006). Profile-based optimization of power performance by using dy-namic voltage scaling on a PC cluster. In Proc. 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS’06).

Hsu, C.-H., & Feng, W.-C. (2005a). A power-aware run-time system for high-performance computing. Proc. ACM/IEEE Supercomputing’05 (SC’05).

Hsu, C.-H., & Feng, W.-C. (2005b). A feasibility analysis of power awareness in commodity-based high-performance clusters. In Proc. 2005 IEEE In-ternational Conference on Cluster Computing.

Intel. (2004). Enhanced Intel SpeedStep Tech-nology for the Intel Pentium M Processor. Intel white paper. Retrieved on February 22, 2008 from ftp://download.intel.com/design/network/papers/30117401.pdf

Kappiah, N., Lowenthal, D. K., & Freeh, V. W. (2005). Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. In Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05).

Kappiah, N., Lowenthal, D. K., & Freeh, V. W. (2006). Just in time dynamic voltage scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. In Proceedings of the 2005 ACM/IEEE SC|05 Conference (SC’05).

Kim, S. J., & Browne, J. C. (1988, August). A General Approach to Mapping of Parallel com-putations upon Multiprocessor Architectures. In International conference on Parallel Processing, University Park, Pennsylvania (Vol. 3, pp. 1-8).

Lim, M. Y., Freeh, V. W., & Lowenthal, D. K. (2006). Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs. In Proc. ACM/IEEE Supercomput-ing’06 (SC’06).

Lorch, J., & Smith, A. (1998, June). Software Strat-egies for Portable Computer Energy Management. Institute of Electrical and Electronics Engineers Personal Communication, 5, 60-73.

Lorch, J. R., & Smith, A. J. (2001). Improving dynamic voltage scaling algorithms with PACE. SIGMETRICS/Performance, 50-61.

Martin, T. L. (2001). Balancing Batteries, Power, and Performance: System Issues in CPU Speed-Setting for Mobile Computing. PhD thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania.

Mellanox Technologies Inc. (2004). Mellanox Per-formance, Price, Power, Volumn Metric (PPPV). Retrieved on November 9, 2008 from http://www.mellanox.co/products/shared/PPPV.pdf.

Miyoshi, A., Lefurgy, C., Hensbergen, E. C., Rajamony, R., & Rajkumar, R. (2002). Critical power slope: understanding the runtime effects of frequency scaling. Proceedings of the 16th international conference on Supercomputing, pages 35– 44.

Moore, B. (2002, August). Taking the data cen-tre power and cooling challenge. Energy User News.

NAS Parallel Benchmarks. (2007). Retrieved from http://www.nas.nasa.gov/Resources/Software/npb.html

Page, I., Jacob, T., & Chen, E. (1993, Febru-ary). Fast Algorithms for Distributed Resource Allocation. IEEE Transactions on Paral-lel and Distributed Systems, 4(2), 188–197. doi:10.1109/71.207594

Pinheiro, E., & Bianchini, R. (2004, June). Energy Conservation Techniques for Disk Array-Based Servers. Proc. of the 18th International Confer-ence on Supercomputing.

540


Qin, X., & Jiang, H. (2005, August). A Dynamic and Reliability-driven Scheduling Algorithm for Parallel Real-time Jobs on Heterogeneous Clusters. Journal of Parallel and Distrib-uted Computing, 65(8), 885–900. doi:10.1016/j.jpdc.2005.02.003

Qin, X., & Jiang, H. (2006, June). A Novel Fault-tolerant Scheduling Algorithm for Precedence Constrained Tasks in Real-Time Heterogeneous Systems. Parallel Computing, 32(5-6), 331–356. doi:10.1016/j.parco.2006.06.006

Rabaey, J., & Pedram, M. (Eds.). (1998). Lower Power Design Methodologies. Norwell, Massa-chusetts: Kluwer Academic Publisher.

Ranaweera, S., & Agrawal, D. P. (2000, May). A Task Duplication Based Scheduling Algorithm for Heterogeneous Systems. Parallel and Distributed Processing Symposium, 445-450.

Shivle, S., Siegel, H. J., Maciejewski, A. A., Banka, T., Chindam, K., Dussinger, S., et al. (2006). Mapping subtasks with multiple versions on an ad-hoc Grid, Institute of Electrical and Electronics Engineers Heterogeneous Comput-ing Workshop.

Sih, G. C., & Lee, E. A. (1993a, June). Decluster-ing: A New Multiprocessor Scheduling Technique. Institute of Electrical and Electronics Engineers Transaction on Parallel and Distributed System, 4(6), 625–637.

Sih, G. C., & Lee, E. A. (1993b, February). A Compile-Time Scheduling Heuristic for Inter-connection-Constrained Heterogeneous Processor Architectures. Institute of Electrical and Elec-tronics Engineers Transaction on Parallel and Distributed System, 4(2), 175–186.

Son, S. W., & Kandemir, M. (2006, May). Energy-aware data prefetching for multi-speed disks. Proc. ACM International Conference on Computing Frontiers, Ischia, Italy.

Son S.W., Kandemir, M., and Choudhary A. (2005, April). Software-directed disk power manage-ment for scientific applications. Proc. Int’l Symp. Parallel and Distributed Processing.

Springer, R., Lowenthal, D. K., Rountree, B., & Freeh, V. W. (2006). Minimizing execution time in MPI programs on an energyconstrained,power-scalable cluster. In Proc. 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06) (pp. 230-238).

Srinivasan, S., & Jha, N. K. (1999, March). Safety and Reliability Driven Task Allocation in Distributed Systems. Institute of Electrical and Electronics Engineers Transaction on Parallel and Distributed Systems, 10(3), 238–251.

Standard Performance Evaluation Corporation CFP2000. (2000). Retrieved from http://www.spec.org/cpu/CFP2000/

Top500 Supercomputer List. (2008). Retrieved from http://www.top500.org

Warren, M., Weigle, E., & Feng, W. (2002, No-vember). High-density computing: a 240-node Beowulf in one cubic meter. In Proc. ACM/IEEE Supercoputing’02 (SC’02).

Woodside, C. M., & Monforton, G. G. (1993, Feb-ruary). Fast Allocation of Processes in Distributed and Parallel systems. Institute of Electrical and Electronics Engineers Transaction on Parallel and Distributed System, 4(2), 164–174.

Woodside, C. M., & Monforton, G. G. (1993, February). Fast Allocation of Processes in Distrib-uted and Parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4(2), 164–174. doi:10.1109/71.207592

Yajnik, S., Srinivasan, S., & Jha, N. K. (1994, October). TBFT: A Task-Based Fault Tolerance Scheme for Distributed Systems. International Conference Parallel and Distributed Computer Systems.

541


Yao, F., Demers, A., & Shenker, S. (1995). A Scheduling Model for Reduced CPU Energy. Institute of Electrical and Electronics Engineers Annual Foundations of Computer Science, 374-382.

Zhu, Q., David, F. M., Devaaraj, C. F., Li, Z., Zhou, Y., & Cao, P. (2004). Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management. Proc. High-Performance Computer Framework.

Zong, Z.-L., Manzanares, A., Stinar, B., & Qin, X. (2006, September). Energy-Efficient Duplication Strategies for Scheduling Precedence Constrained Parallel Tasks on Clusters. International Confer-ence Cluster Computing.

KEY TERms AND DEFINITIONs

Cluster: Also known as “computational clus-ters”, which represents a supercomputing platform with homogeneous configurations, e.g. identical processors and networks, etc.; similar to “homo-geneous clusters”; associated in the manuscript with “cluster computing” and “heterogeneous clusters”.

Energy-Efficient Computing: Also known as “green computing”, “energy-aware computing”; similar to “power-efficient computing”, “power-aware computing”; associated in the manuscript with “energy-efficiency“, “energy consumption”, and “power consumption”.

Grid: Also known as “computational grid”, which represents a supercomputing platform

with heterogeneous configurations, e.g. different types of processors and networks, etc.; similar to heterogeneous clusters; associated in the manu-script with grid computing, computational grid, and heterogeneous computing nodes.

Interconnection: Also known as “network”, which usually represents the high speed network layer in a high performance computing platform; associated in the manuscript with “Gigabit Ether-net”, “Infiniband”, “Myrinet” and “QsNetII”.

Job: Also known as “computer job”, which represents an execution of a full program; similar to “task” and “computer task”; associated in the manuscript with “job partition” and “job alloca-tion”.

Parallel Computing: Also known as “parallel computation”, which represents parallel program execution or parallel file/data processing; similar to “high-performance computing”, “cluster com-puting” and “grid computing”; associated in the manuscript with “parallel computing“, “parallel scheduling”, “parallel applications”, and “parallel execution”.

Scheduling: Also known as “task schedul-ing”; similar to “job scheduling”; associated in the manuscript with “task scheduling “and “job scheduling”.

Task: Also known as “computer task”, which represents a partial execution of a full program. A job may consist of a large number of tasks; similar to: “jobs” and “computer jobs”. Associated in the manuscript with “task scheduling”, “task migra-tion” and “task allocation”.

Documents

Handbook of Research on P2P and Grid Systems for Service ...cs.txstate.edu/~zz11/publications/bookchapter.pdf · High performance Grid platforms and parallel computing technologies