UNIVERSITY OF CALIFORNIA, SAN DIEGOmescal.imag.fr/membres/derrick.kondo/pubs/thesis_kondo.pdf · II Desktop Grid System Design and Implementation: State of the Art . . . . . . 12

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Scheduling Task Parallel Applications For Rapid Turnaround on Desktop Grids

A dissertation submitted in partial satisfaction of the

requirements for the degree Doctor of Philosophy

in Computer Science and Engineering

by

Derrick Kondo

Committee in charge:

Professor Henri Casanova, Co-ChairmanProfessor Andrew A. Chien, Co-ChairmanProfessor Phillip BourneProfessor Larry CarterProfessor Rich Wolski

2005

Copyright

Derrick Kondo, 2005

All rights reserved.

The dissertation of Derrick Kondo is approved, and it is ac-

ceptable in quality and form for publication on microfilm:

Co-Chair

Co-Chair

University of California, San Diego

2005

3

TABLE OF CONTENTS

Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Vita, Publications, and Fields of Study . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1A. Desktop Grids: Past and Present . . . . . . . . . . . . . . . . . . . . . . . 2B. Prospects and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5C. Goal, Motivation, and Approach . . . . . . . . . . . . . . . . . . . . . . . . 8D. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

II Desktop Grid System Design and Implementation: State of the Art . . . . . . 12A. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12B. System Anatomy and Physiology . . . . . . . . . . . . . . . . . . . . . . . 14

1. Client Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152. Application and Resource Management Level . . . . . . . . . . . . . . . 163. Worker Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

a. Worker Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17b. Worker Sandbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4. Design Trade-offs of Centralization . . . . . . . . . . . . . . . . . . . . 20a. Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21b. Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

III Resource Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A. The Ideal Resource Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25B. Related Work on Resource Measurements and Modelling . . . . . . . . . . 27

1. Host Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272. Host Load and CPU Utilization . . . . . . . . . . . . . . . . . . . . . . 283. Process Lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

C. Trace Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31D. Trace Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1. SDSC Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362. DEUG and LRI Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . 383. UCB Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

E. Characterization of Exec Availability . . . . . . . . . . . . . . . . . . . . . 401. Number of Hosts Available Over Time . . . . . . . . . . . . . . . . . . 402. Temporal Structure of Availability . . . . . . . . . . . . . . . . . . . . . 443. Temporal Structure of Unavailability . . . . . . . . . . . . . . . . . . . 47

4

4. Task Failure Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495. Correlation of Availability Between Hosts . . . . . . . . . . . . . . . . . 506. Correlation of Availability with Host Clock Rates . . . . . . . . . . . . 54

F. Characterization of CPU Availability . . . . . . . . . . . . . . . . . . . . . 571. Aggregate CPU Availability . . . . . . . . . . . . . . . . . . . . . . . . 572. Per Host CPU Availability . . . . . . . . . . . . . . . . . . . . . . . . . 60

G. An Example of Applying Characterization Results: Cluster Equivalence . 651. System Performance Model . . . . . . . . . . . . . . . . . . . . . . . . 652. Cluster Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

H. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

IV Resource Management: Methods, Models, and Metrics . . . . . . . . . . . . . 72A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72B. Models and Instantiations . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

1. Platform model and instantiation . . . . . . . . . . . . . . . . . . . . . 762. Application model and instantiation . . . . . . . . . . . . . . . . . . . . 80

C. Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81D. Measuring and Analyzing Performance . . . . . . . . . . . . . . . . . . . . 82

1. Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822. Method of Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 83

E. Computing the Optimal Makespan . . . . . . . . . . . . . . . . . . . . . . 871. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882. Single Availability Interval On A Single Host . . . . . . . . . . . . . . . 89

a. Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 89b. Proof of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3. Multiple Availability Intervals On A Single Host . . . . . . . . . . . . . 944. Multiple Availability Intervals On Multiple Hosts . . . . . . . . . . . . 955. Optimal Makespan with Checkpointing Enabled . . . . . . . . . . . . . 97

V Resource Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A. Resource Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

1. Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B. Resource Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061. Excluding Resources By Clock Rate . . . . . . . . . . . . . . . . . . . . 1062. Using Makespan Predictions . . . . . . . . . . . . . . . . . . . . . . . . 108

a. Evaluation on Different Desktop Grids . . . . . . . . . . . . . . . . 110C. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115D. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

VI Task Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119B. Measuring and Analyzing Performance . . . . . . . . . . . . . . . . . . . . 122

1. Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222. Method of Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 122

C. Proactive Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 1221. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5

D. Reactive Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 1261. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

E. Hybrid Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 1291. Feasibility of Predicting Probability of Task Completion . . . . . . . . 1312. Probabilistic Model of Task Completion . . . . . . . . . . . . . . . . . . 1313. REP-PROB Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385. Evaluating the benefits of REP-PROB . . . . . . . . . . . . . . . . . . 142

F. Estimating application performance . . . . . . . . . . . . . . . . . . . . . . 145G. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

1. Task replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1482. Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

H. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

VII Scheduler Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154A. Overview of the XtremWeb Scheduling System . . . . . . . . . . . . . . . . 154B. EXCL-PRED-TO Heuristic Design and Implementation . . . . . . . . . . . 155

1. Task Priority Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1552. Makespan Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

VIIIConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157A. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 157B. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

A Defining the IQR Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161A. IQR Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B Additional Resource Selection and Exclusion Results and Discussion . . . . . 164

C Additional Task Replication Results and Discussion . . . . . . . . . . . . . . . 167A. Proactive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B. Reactive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169C. Hybrid Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6

VITA

1999 B.S. in Computer ScienceStanford University

2002 M.S. in Computer Science and EngineeringUniversity of California, San Diego

PUBLICATIONS

D. Kondo, A. A. Chien, and H. Casanova. Resource Management for Rapid Applica-tion Turnaround on Enterprise Desktop Grids In Proceedings of ACM Conference onHigh Performance Computing and Networking, SC2005, November 2005, Pittsburgh,Pennsylvania.

D. Kondo and H. Casanova. Computing the Optimal Makespan for Jobs with Identicaland IndependentTasks Scheduled on Volatile Hosts. Technical Report CS2004-0796,Dept. of Computer Science and Engineering, University of California at San Diego,July, 2004.

D. Kondo, M. Taufer, C. Brooks, H. Casanova, and A. A. Chien. Characterizing andEvaluating Desktop Grids: An Emprical Study. Proceedings of the International Paralleland Distributed Processing Symposium 2004, May 2004.

D. Kondo, H. Casanova, E. Wing and F. Berman. Models and Scheduling Mechanismsfor Global Computing Applications. In Proceeding of the International Parallel andDistributed Processing Symposium 2002, April 2002, Fort Lauderdale, Florida.

S. Joseph, M. Whirl, D. Kondo, and H. Noller, and R. Altman, Calculation of theRelative Geometry of tRNAs in the Ribosome from Directed Hydroxyl-Radical ProbingData. RNA 6:220-232. 2000.

FIELDS OF STUDY

Major Field: Computer ScienceStudies in Parallel and Distributed ComputingProfessor Henri Casanova

Major Field: Computer ScienceStudies in Computational BiologyProfessor Russ B. Altman

vii

LIST OF FIGURES

II.1 A Common Anatomy of Desktop Grid Systems . . . . . . . . . . . . . . . 16II.2 CPU Availability During Task Execution. . . . . . . . . . . . . . . . . . . 19

III.1 Distribution of “small” gaps (<2 min.). . . . . . . . . . . . . . . . . . . . 34III.2 Host’s clock rate distribution in each platform . . . . . . . . . . . . . . . . 38III.3 Number of hosts available for a given week for each platform. . . . . . . . 42III.3 * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43III.4 Cumulative distribution of the length of availability intervals in terms of

time for business hours and non-business hours. . . . . . . . . . . . . . . . 44III.5 Cumulative distribution of the length of availability intervals normalized

to total duration of availability in terms of time for business hours andnon-business hours for the UCB platform. . . . . . . . . . . . . . . . . . . 46

III.6 Cumulative distribution of the length of availability intervals in terms ofoperations for business hours and non-business hours. . . . . . . . . . . . 47

III.7 Unavailability intervals in terms of hours . . . . . . . . . . . . . . . . . . . 48III.8 Task failure rates during business hours . . . . . . . . . . . . . . . . . . . 49III.9 Correlation of availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53III.10Percentage of time when CPU availability is above a given threshold, over

all hosts, for business hours and non-business hours. . . . . . . . . . . . . 58III.11CPU availability per host in SDSC platform. . . . . . . . . . . . . . . . . 61III.12CPU availability per host in DEUG platform. . . . . . . . . . . . . . . . . 62III.13CPU availability per host in LRI platform. . . . . . . . . . . . . . . . . . 63III.14CPU availability per host in UCB platform. . . . . . . . . . . . . . . . . . 64III.15Model of application work rate for entire SDSC desktop grid, in number of

operations per seconds versus task size,in number of minutes of dedicatedCPU time on a 1.5GHz host. . . . . . . . . . . . . . . . . . . . . . . . . . 67

III.16Cluster equivalence of a desktop grid CPU as a function of the applicationtask size. Two lines are shown, one for the the resources on weekdays andweekends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

III.17Cumulative percentage of total platform computational power for SDSChosts sorted by decreasing effectively delivered computational power andfor hosts by clock rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

IV.1 Cumulative task completion vs. time. . . . . . . . . . . . . . . . . . . . . 75IV.2 Scheduling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76IV.3 Cumulative clock rate distributions from real and simulated platform . . . 79IV.4 Laggers for an application with 400 tasks. . . . . . . . . . . . . . . . . . . 85IV.5 INTG: helper function for the scheduling algorithm . . . . . . . . . . . . . 90IV.6 Scheduling algorithm over a single availability interval. . . . . . . . . . . . 91IV.7 An example of task execution for OPTINTV (higher) and OPTDELAY

(lower) at the beginning of the job. Both jobs arrive at the same time.In the case of OPTINTV, the first task is scheduled immediately and anoverhead of h is incurred. In the case of OPTDELAY, the scheduler waitsof a period of w1 before scheduling the task. . . . . . . . . . . . . . . . . . 92

viii

IV.8 An example of task execution for OPTINTV (higher) and OPTDELAY(lower) in the middle of the job. . . . . . . . . . . . . . . . . . . . . . . . . 93

IV.9 Scheduling algorithm over multiple availability intervals. . . . . . . . . . . 95IV.10Scheduling algorithm over multiple availability intervals over multiple hosts 96

V.1 Subintervals denoted by the double arrows for each availability interval.The length of each subinterval is shown, and the subinterval lengths differby 10 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

V.2 Performance of resource prioritization heuristics on the SDSC grid. . . . . 103V.3 Complementary CDF of Prediction Error When Using Expected Opera-

tions or Time Per Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 105V.4 Number of tasks to be scheduled (left y-axis) and hosts available (right

y-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106V.5 Performance of heuristics using thresholds on SDSC grid . . . . . . . . . . 107V.6 Heuristic performance on the SDSC grid . . . . . . . . . . . . . . . . . . . 110V.7 Cause of Laggers (IQR factor of 1) on SDSC Grid. 1 → FCFS. 2 →

PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED . . . . . . . . . . . . . . . . 112V.8 Length of task completion quartiles on SDSC Grid. 0→ OPTIMAL. 1→

FCFS. 2 → PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED . . . . . . . . . . 113V.9 Heuristic performance the GIMPS grid . . . . . . . . . . . . . . . . . . . . 114V.10 Heuristic performance on the LRI-WISC grid . . . . . . . . . . . . . . . . 115

VI.1 Performance Of Heuristics Combined With Replication On SDSC Grid. . 124VI.2 Waste Of Heuristics Using Proactive Replication On SDSC Grid. . . . . . 125VI.3 Performance of reactive replication heuristics on SDSC grid. . . . . . . . . 127VI.4 Waste of reactive replication heuristics on SDSC grid. . . . . . . . . . . . 128VI.5 Probability of task completion per day for several task lengths. . . . . . . 132VI.6 CDF of prediction errors of the probability of task completion from one

day to the next for 5, 15, 35 minute tasks on a dedicated 1.5GHz host . . 133VI.7 Finite automata for task execution. . . . . . . . . . . . . . . . . . . . . . . 134VI.8 Timeline of task completion. . . . . . . . . . . . . . . . . . . . . . . . . . . 134VI.9 Performance of REP-PROB on SDSC grid. . . . . . . . . . . . . . . . . . 139VI.10Waste of REP-PROB on SDSC grid. . . . . . . . . . . . . . . . . . . . . . 139VI.11Cause of Laggers (IQR factor of 1) on SDSC Grid. 1 → FCFS. 2 →

PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

VI.12CDF of task failure rates per host. . . . . . . . . . . . . . . . . . . . . . . 144VI.13Performance difference between EXCL-PRED-TO and transformed UCB-

LRI platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146VI.14Performance of checkpointing heuristics on SDSC grid. . . . . . . . . . . . 150VI.15Length of task completion quartiles on SDSC Grid. 0 → OPTIMAL. 1

→ FCFS. 2 → PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED 5 → EXCL-PRED-TO. 6 → REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . 153

A.1 Cause of Laggers (IQR factor of .5) on SDSC Grid. 1 → FCFS. 2 →PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

ix

A.2 Cause of Laggers (IQR factor of 1.5) on SDSC Grid. 1 → FCFS. 2 →PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

B.1 Performance of resource selection heuristics on the DEUG grid . . . . . . 164B.2 Performance of resource selection heuristics on the LRI grid . . . . . . . . 165B.3 Performance of resource selection heuristics on the UCB grid . . . . . . . 165

C.1 Performance of proactive replication heuristics on DEUG grid. . . . . . . 167C.2 Performance of proactive replication heuristics on LRI grid. . . . . . . . . 168C.3 Performance of proactive replication heuristics on UCB grid. . . . . . . . 168C.4 Waste of proactive replication heuristics with EXCL-PRED-DUP-TIME

and EXCL-DUP-TIME-SPD. . . . . . . . . . . . . . . . . . . . . . . . . . 169C.5 Waste of proactive replication heuristics on DEUG grid. . . . . . . . . . . 170C.6 Waste of proactive replication heuristics on LRI grid. . . . . . . . . . . . . 170C.7 Waste of proactive replication heuristics on UCB grid. . . . . . . . . . . . 171C.8 Performance of proactive replication heuristics when varying replication

level on SDSC grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171C.9 Performance of reactive replication heuristics on DEUG grid. . . . . . . . 172C.10 Performance of reactive replication heuristics on LRI grid. . . . . . . . . . 172C.11 Performance of reactive replication heuristics on UCB grid. . . . . . . . . 173C.12 Waste of reactive replication heuristics on DEUG grid. . . . . . . . . . . . 173C.13 Waste of reactive replication heuristics on LRI grid. . . . . . . . . . . . . 174C.14 Waste of reactive replication heuristics on UCB grid. . . . . . . . . . . . . 174C.15 Performance of hybrid replication heuristic on DEUG grid. . . . . . . . . 175C.16 Performance of hybrid replication heuristic on LRI grid. . . . . . . . . . . 175C.17 Performance of hybrid replication heuristic on UCB grid. . . . . . . . . . 176C.18 Waste of hybrid replication heuristic on DEUG grid. . . . . . . . . . . . . 176C.19 Waste of hybrid replication heuristic on LRI grid. . . . . . . . . . . . . . . 177C.20 Waste of hybrid replication heuristic on UCB grid. . . . . . . . . . . . . . 177

x

LIST OF TABLES

I.1 Characteristics of desktop grid applications [63] . . . . . . . . . . . . . . . 7

III.1 Characteristics of desktop grid applications. (Deriv. denotes “derivable”) 30III.2 Correlation of host clock rate and other machine characteristics during

business hours for the SDSC trace . . . . . . . . . . . . . . . . . . . . . . 55III.3 Correlation of host clock rate and failure rate during business hours. Task

size is in term of minutes on a dedicated 1.5GHz host. . . . . . . . . . . . 56

IV.1 Qualitative platform descriptions. . . . . . . . . . . . . . . . . . . . . . . . 79

VI.1 Mean performance difference relative to EXCL-PRED-DUP when increas-ing the number of replicas per task. . . . . . . . . . . . . . . . . . . . . . . 124

VI.2 Mean performance difference and waste difference between EXCL-PRED-DUP and EXCL-PRED-TO. . . . . . . . . . . . . . . . . . . . . . . . . . . 128

VI.3 Mean performance and waste difference between EXCL-PRED-TO andREP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

VI.4 Makespan statistics of EXCL-PRED-TO for the SDSC platform. Lowerconfidence intervals are w.r.t. the mean. The mean, standard deviation,and median are all in units of seconds. . . . . . . . . . . . . . . . . . . . . 147

VI.5 Summary of replication heuristics. . . . . . . . . . . . . . . . . . . . . . . 152

C.1 Makespan statistics for the DEUG platform. Lower confidence intervalsare w.r.t. the mean. The mean, standard deviation, and median are allin units of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

C.2 Makespan statistics for the LRI platform. Lower confidence intervals arew.r.t. the mean. The mean, standard deviation, and median are all inunits of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

C.3 Makespan statistics for the UCB platform. Lower confidence intervals arew.r.t. the mean. The mean, standard deviation, and median are all inunits of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

xi

ABSTRACT OF THE DISSERTATION

Scheduling Task Parallel Applications For Rapid Turnaround on Desktop Grids

by

Derrick Kondo

Doctor of Philosophy in Computer Science and Engineering

University of California, San Diego, 2005

Professors Henri Casanova and Andrew Chien, Co-Chairs

Since the early 1990’s, the largest distributed computing systems in the world

have been desktop grids, which use the idle cycles of mainly desktop PC’s to support large

scale computation. Despite the enormous computing power offered by such systems, the

range of supportable applications is largely limited to task parallel, compute-bound, and

high-throughput applications. This limitation is mainly because of the heterogeneity and

volatility of the underlying resources, which are shared with the desktop users. Our work

focuses on broadening the applications supportable by desktop grids, and in particular,

we focus on the development of scheduling heuristics to enable rapid turnaround for

short-lived applications.

To that end, the contributions of this dissertation are as follows. First, we mea-

sure and characterize four real enterprise desktop grid systems; such characterization is

essential for accurate modelling and simulation. Second, using the characterization, we

design scheduling heuristics that enable rapid application turnaround. These heuristics

are based on three scheduling techniques, namely resource prioritization, resource exclu-

sion, and task replication. We find that our best heuristic uses relatively static resource

information for prioritization and exclusion, and reactive task replication to achieve per-

formance within a factor of 1.7 of optimal. Third, we implement our best heuristic in a

real desktop grid system to demonstrate its feasibility.

xii

Chapter I

Introduction

Since the late 1990’s, the largest distributed computing systems in the world

have been desktop grids, which aggregate the idle CPU cycles of mostly desktop PC’s to

support large scale computations. The main motivation for using desktop grids is that

these platforms offer high computational power at low cost. That is, one can reuse an

existing infrastructure of resources (e.g., systems staff, machine hardware) to support

large computational demands. Numerous studies have shown that desktops often have

CPU availability of 80% or more [62, 8], and as desktop PC’s are getting less expensive

and more prevalent [11, 6, 71], the savings in infrastructure costs when using the idle

cycles of desktop PC’s can be as high as a factor of five or ten [22].

Virtually all desktop grid applications that run in wide-area environments are

task parallel, compute-bound, and high-throughput. An application that is task parallel

consists of tasks that are independent of one another. A compute-bound application has

a high computation to communication ratio. A high-throughput application has many

more tasks than the number of hosts.

These desktop grids applications span a wide range of scientific domains in-

cluding computational biology [49, 87, 1, 38], climate modelling [2], physics [4, 5], as-

tronomy [86], and cryptography [3]. Desktop grids enable these applications to utilize

TeraFlops of computing power provided by hundreds of thousands of hosts at relatively

little cost, which has allowed these applications to explore enormous parameter spaces

and/or run simulations at high levels of detail that would otherwise be impossible. For

instance, the Folding@home [49] and Prediction@home [87] projects have resulted in nu-

1

2

merous published discoveries [95, 83, 82, 87] that that have furthered our understanding

of protein folding and structure prediction.

The set of hosts scattered over the Internet that participate in these desktop

grid projects is incredibly diverse in terms of usage patterns, hardware configurations,

and network connectivity. For example, many home machines are used as little as 23

hours per month (which averages to 47 minutes per day) [60], while machines in enter-

prise environments are often power-on for the entire day. The configurations of hosts

participating in the SETI@home desktop grid project spans over 170 operating systems

(including version variants) and 160 CPU types (including family variants) [78]. The

network connectivity of hosts ranges from dial-up and cable/DSL to 10/100/1000 Mbps

Ethernet. Given this diversity of resources, developing a software infrastructure for

harnessing idle cycles has been a challenging endeavor.

I.A Desktop Grids: Past and Present

Soon after computers were first networked together, the notion of using the idle

cycles of desktop PC’s arose. In this section, we describe how desktop grid systems have

evolved since the early 1980’s when the first desktop grid systems were implemented and

deployed. We outline the design of each system, discuss strengths and weaknesses.

The Xerox Worm [47] was one of the earliest desktop grid systems, which had

basic resource management and security schemes, and also mechanisms that limited its

resource obtrusiveness. The worm spread itself among the ∼100 machines at Xerox

PARC by sequentially scanning through a list of resource addresses. For each address,

the worm segment would send a probe to a corresponding host, whose response would

indicate its availability. If the host was idle, the worm segment would replicate itself

and begin execution on the new host. During execution, the worm segment avoided disk

accesses entirely to limit its obtrusiveness. The applications that were run included a

telephone alarm clock, image display, and Ethernet network testing. The worm could

grow uncontrollably, and the worm’s only control mechanism was a special kill packet

that could be broadcasted to kill the worm entirely. In principle, the Xerox Worm with

little modification could work in the modern-day Internet, and a number of malicious

3

worms such as the Blaster worm have used similar spreading mechanisms. The main

challenge would be the controllability and manageability of such a worm, and decentral-

ized approaches to desktop grid computing is the focus of ongoing research [27, 64, 53].

In the mid 1990’s, Java applet-based systems such as Javelin [24] and Bayani-

han [75] allowed Java applications that ran in a secure sandboxed environment to harvest

the idle cycles of computers distributed across Internet. Users would use their browsers

to download and execute tasks in the form of an portable Java applet. In addition to

portability, Java applets were executed in a sandbox by most browsers, which reduced the

risk of harmful code being downloaded and executed on the host machine. Despite the

many security mechanisms in place, there were no mechanisms that limited the obtru-

siveness of the running applet. The applet would be able to consume CPU and memory

resources without restriction, and limiting obstrusiveness may have been impossible as it

often requires inspection of system performance counters, which applets are not allowed

to do. Also, the security mechanisms for Java applets however could be too restrictive

for desktop grid applications; for example, applets could not read or write to the local file

system. These academic systems were tested only with relatively few hosts (about 30)

and so the robustness and scalability of such systems were never proven. Furthermore,

these systems lacked tools for manageability, which are essential for large systems; for

example, there was no way to ensure that the Java Runtime Environment supported by

the browser was up to date.

At the same time, in the mid 1990’s, the authors in [11] argued that networks

of shared workstations built from commodity components could have similar or even

better performance than Massively Parallel Processor machines (MPP’s) at a fraction

of the cost. The enormous volume at which commodity components were produced re-

duced the production costs dramatically, as the Gordon Bell rule stated that “doubling

volume reduces unit cost to 90%” [11]. Moreover, the engineering lag time of developing

specialized applications, operating systems and hardware for MPP’s made a network of

workstations (NOW’s) an attractive alternative. Indeed, as of November, 2004, commod-

ity clusters make up almost 60% of the list of top 500 supercomputers [89]. A plethora

of research has gone into supporting high performance computing on NOW’s, and to

this end, research in this area has ranged from cooperative file caching to implement-

4

ing RAID over a set of workstations [11]. In terms of harnessing the idle computing of

desktops for compute-intensive tasks, the Condor distributed batch system is one of the

most relevant.

Since its inception in the 1991, Condor has been one the most extensively used

desktop grid systems in enterprise settings. Within the U.S., over 600 Condor pools

exist containing a total of 38,000 hosts [31], with each pool often containing hundreds

of hosts. (The Condor pool in the computer science department at the University of

Wisconsin has over 1000 machines.) The system supports remote checkpointing, process

migration, network data encryption, and recovery from faults of any component of the

system. Numerous operating systems are supported including Windows (with limited

functionality) and UNIX variants., and installation does not require superuser privileges,

although not all features are supported in this case.

In Condor, a user submits his/her application through a submission daemon

(schedd). An execution daemon startd runs on each resource and is responsible for

managing the task execution, such as checkpointing or terminating the task if there is

other user activity. The condor scheduler (matchmaker) determines which resources are

suitable for the application and vice versa, using the requirements of the application

(e.g., only machines with clock rates > 1 GHz) specified through the schedd and using

the requirements of the resources (e.g., the task can only run at night) specified through

the startd. Then, the schedd and startd contact each other to bind a task to a specific

resource.

Because Condor was designed primarily for local area environments, running

in wide-area environments where hosts can be in private networks behind firewalls is

problematic. In particular, Condor often uses UDP instead of TCP for communication

which makes deployment across firewalls or through congested networks difficult. Al-

though there have been nascent efforts to address this problem [81], these new methods

have not yet been added to the stable release (as of September, 2004).

In the late 1990’s, the growth of the Internet exploded and many distributed

computing projects sought to exploit the TeraFlops of potential computing power offered

by tens of millions of hosts [38, 23, 49, 44, 86]. The largest and most well-known project

is SETI@home [86], which runs an embarrassingly parallel astronomy application that

5

currently utilizes 20 TeraFlops/sec from about 500,000 active desktops. In SETI@home,

a worker daemon runs on each participating host, which requests tasks from a central-

ized server. The worker daemon ensures that the task runs only when the host is idle.

Software on a centralized server consists of an HTTP server so that hosts behind fire-

walls can download/update tasks, and a database server, which stored the location of

inputs, outputs, and various statistics about participants. The first implementation of

SETI@home was application-specific and had few tools for managing the system.

One impact of the SETI@home project was proving that people in significant

numbers were willing to donate the idles cycles of their desktops for large-scale computing

projects. A remarkable social phenomenon was that when SETI@home listed on the web

the top contributors to the project in terms of completed tasks, soon teams composed

of enthusiastic desktop users formed, and sought to gain higher status in the rankings.

The success of SETI@home spurred numerous other projects, and also academic

and industrial endeavors for developing multi-application desktop grid software. In early

2000, academic desktop grid infrastructures such as XtremWeb [37] and BOINC [39] were

implemented. Also, several commercial companies, such as Entropia [28] and United De-

vices [90], were founded, and these companies developed industrial-grade desktop grid

software that was professionally tested and supported for the purpose of deploying task

parallel applications. Many of these systems have tools for large-scale system manage-

ment, and support user authentication and data encryption.

I.B Prospects and Challenges

As commodity compute, storage, and network technologies technology improve,

become less costly and more pervasive, desktop grids are increasingly attractive for

running large-scale applications. As of April, 2005, for about $400, one can purchase a

Dell Dimension 3000, which has a 2.8GHz Pentium Processor with a 533 MHz front side

bus, 512MB SDRAM at 400MHz, a 80GB Ultra ATA/100 Hard Drive (7200 RPM), and a

100Mbps network interface. For $215, one can buy a Dell PowerConnectTM2716 16 Port

Gigabit Ethernet Switch. With $150,000, a company or university could purchase about

367 of these desktops and about 12 switches, excluding costs of installation, maintenance,

6

space, and power. Given that CPU availability in shared desktop environments is often

80% or more [62] and average free disk space [18] is about 50%, the resulting platform

would have an aggregate computing power close to 1 TeraFlop and about 15 TeraBytes

of disk space. Thus, desktop grids can have a high return on investment.

We believe that desktops in enterprise settings are especially useful as they

contribute a significant fraction of the cumulative computing power of Internet desktop

grids. (Many desktop grid researchers [29, 69] have reported that the ratio in useful hosts

to non-useful hosts participating in desktop grid projects can be as low as 1 to 10.) This

is because the enterprise desktops have relatively high availability and usually constant

network connectivity. As such, many desktop grid companies such as Entropia [28],

and United Devices [90] have separate products that target enterprise environments

exclusively.

Desktop grids restricted to enterprise environments are attractive for several

other reasons. First, the hosts are often under the same or limited number of system

administrative domains. So the software configuration of hosts (e.g., operating system

and version, and software libraries) are similar, which can simplify the process of the

software infrastructure and application deployment; developers do not need to make the

software or application portable for every combination of operating system, operating

system version, and programming library. Second, security in terms of ensuring that

the application executable and data have not been tampered with is less of an issue.

Presumably, the desktop users within the company or university are not malicious and

attempting to thwart the computation. This certainly does not preclude accidental harm

to the application, but it does reduce the risk of such an occurrence.

Although enterprise desktop grids are attractive, there exists a wide dispar-

ity between the structural complexity of applications runnable on MPP’s and current

desktop grid applications. Most Internet desktop grid applications are task parallel and

compute bound. Table I.1 obtained from [63] shows a list of typical applications run on

enterprise desktop grids. The third column in the table is the server bandwidth required

to support 1000 workers, and the forth column is the maximum number of workers as-

suming they can use only 20 Mbits/sec. Virtually all desktop grid applications deployed

over the Internet resemble the docking application shown in the table. That is, the ap-

7

CharacteristicsApplication Task run

timeTaskdata size

Serverband-width(1000workers)

Maximumworkersusing 20Mbits/sec

Docking 20 min. 1 Mbyte 6.67Mbits/sec

2,998

small data, med run 10 min. 1 Mbyte 13.3Mbits/sec

1,503

BLAST 5 min. 10Mbyte

264Mbits/sec

75

large data, large run 20 min. 20Mbyte

132Mbits/sec

150

Table I.1: Characteristics of desktop grid applications [63]

plications are compute-bound and task parallel with task sizes on the order of kilobytes

or megabytes and run times on the order of minutes or hours. The higher capacity of

networks in enterprise environments allow applications with higher communication and

computation ratios to run on desktop grids, as shown in the lower rows of the table.

Moreover, the majority of applications run on desktop grids are high-throughput, i.e.,

these applications have many more tasks than hosts available.

The reason most applications deployed on desktop grids are task parallel, com-

pute bound and high-throughput is that the hosts are volatile and heterogeneous. The

hosts are volatile in the sense that CPU availability for the desktop grid application may

fluctuate dramatically over time because the host is shared with the user/owner of the

machine. The host is shared in a way that user/owner’s activities (such keyboard/mouse

activity and other processes) get higher priority than the desktop grid task and so, the

host can not be reserved for any block of time. Moreover, hosts often have a wide range

of clock rates, which makes application deployment even more complicated.

8

I.C Goal, Motivation, and Approach

The goal of this thesis is to broaden the range of applications that can utilize

desktop grids. In particular, we focus on designing scheduling heuristics to enable rapid

application turnaround on enterprise desktop grids. By rapid application turnaround,

we mean turnaround on the order of minutes or hours (versus days or months, which is

typical of high-throughput applications run on desktop grids).

Our own experience in discrete-event simulation suggests that users often desire

turnaround within time windows that are hours or minutes in length, for example, having

results before the lunch hour, or by the next morning. This is true especially in industry

where results are required by short-term deadlines. Others have also indicated the need

for fast turnaround with respect to biological docking simulations [66]. Often these

simulations (especially simulations that explore a range of parameters) can be organized

into hundreds or thousands of independent tasks where each task consists of data input

sizes on the order of kilobytes, and each task takes on the order of minutes or hours to

run. Also, most applications from MPP workloads are less a day in length, indicating

that short jobs are not uncommon [58].

Applications consisting of independent tasks with soft real-time requirements

are also commonly found in the area of interactive scientific visualization [57, 80, 40].

An example of such an application that requires rapid turnaround is on-line parallel to-

mography [80]. Tomography is the construction of 3-D models from 2-D projections, and

it is common electron microscopy to use tomography to create 3-D images of biological

specimens. When a electron microscopist takes 2-D images of a specimen, the 3-D model

would ideally be refreshed after a series of projections, incorporating the additional in-

formation obtained from the new projections. After a refresh, the microscopist could

view the new 3-D model and then redirect his/her attention to a different area in the

specimen or correct the configuration of the microscope. Interactively viewing the model

after a set of projection allows the microscopist to converge on a correct model quickly,

and this in turn, reduces the chance of damage to the sample from excessive exposure

to the election beam [77].

In [80], the authors determine that on-line tomography is amenable to grid

9

computing environments (which include networks of workstations), and they develop

scheduling heuristics for supporting the soft deadline of the application. In particular,

the tomography application is embarrassingly parallel as each 2-D projection can be de-

composed independent slices that must be distributed to a set of resources for processing.

Each slice is on the order of kilobytes or megabytes in size, and there are typically hun-

dreds or thousands of slices per projection, depending on the size of each projection.

Ideally, the processing time of a single projection can be done while the user is acquiring

the next image from the microscope, which typically takes several minutes [46]. As such,

on-line parallel tomography could potentially be executed on desktop grids if there were

effective heuristics for meeting the application’s relatively stringent time demands.

In this thesis, we develop heuristics to allow applications requiring rapid turnaround

to utilize desktop grids effectively, focusing particularly on enterprise environments. Our

approach is to first develop a characterization of the volatility and heterogeneity in real

enterprise desktop grids. We then use this characterization to influence the design of

scheduling heuristics. Our heuristics are based on three scheduling techniques, namely

resource prioritization, resource exclusion, and task replication. Often, there is large

difference in the effective compute rate of hosts in a desktop grid, and so doing resource

prioritization would cause tasks to be assigned to the best hosts first. Moreover, the

worst hosts can significantly impede application execution, and excluding such hosts

may remove the bottleneck. We examine various criteria by which to exclude some hosts

and never use them to run application tasks. Finally, replicating a task on multiple

hosts can be used to reduce the chance that a task fails and slows application execution.

This method has the drawback of wasting CPU cycles, which could be a problem if the

desktop grid is to be used by more than one application. We investigate several issues

pertaining to replication, including which task to replicate and which host to replicate

to.

I.D Contributions

The crux of this dissertation can be summarized in the following thesis state-

ment:

10

”Scheduling heuristics based on resource prioritization, exclusion,

and reactive task replication techniques that use relatively static information

about resources can result in tremendous performance gains for task parallel,

compute-bound applications needing rapid turnaround.”

To that end, the contributions of the thesis are as follows:

1. An accurate measurement and characterization of desktop grids.

We use a simple but novel method for measuring availability of resources in four

desktop grid platforms. This method records the availability that would be ex-

perienced by a real application. We then characterize the temporal structure of

CPU availability for each platform and individual resources, identifying important

similarities and differences. Our measurement and characterization can be useful

for creating generative, predictive, and explanatory models, driving desktop grid

simulations, and shaping the design of scheduling heuristics.

2. Effective resource management heuristics for rapid application turnaround.

Using the desktop grid characterization, we design heuristics based on three schedul-

ing techniques, namely resource prioritization, resource exclusion, and task dupli-

cation.

We evaluate these heuristics through trace-driven simulations of four representa-

tive desktop grid configurations. We find that ranking desktop resources according

to their clock rates, without taking into account their availability history, is sur-

prisingly effective in practice. Our main result is that a heuristic that uses the

appropriate combination of resource prioritization, resource exclusion, and task

replication achieves performance often within a factor of 1.7 of optimal.

3. Scheduler prototype for scheduling applications

We implement a scheduler prototype for scheduling application requiring rapid

turned. Our implementation proves the feasibility of our heuristics in real settings.

The thesis is structured as follows. First, in Chapter II we will give describe

the state of the art of desktop grid systems. Then, in Chapter III, we will describe our

11

measurement and characterization of a desktop grid system. We will detail the method

by which we made measurements and how this method differs from other studies. Then

in Chapter IV we will outline the design and evaluation of our scheduling heuristics for

rapid application turnaround by describing our simulation models, general scheduling

techniques, and performance metrics. In Chapter V we describe scheduling heuristics

that use prioritization and exclusion effectively for resource selection, and quantify their

performance according to the optimal schedule achievable by an omniscient scheduler.

Even with the best resource selection techniques, task failures can continue to impede

application execution and so in Chapter VI, we investigate methods for masking task

failures by means of task replication. We will examine issues such as when to replicate

and which host to replicate on. We then implement our best heuristic to demonstrate

its feasibility and describe the implementation in Chapter VII. Finally, in Chapter VIII,

we will summarize the conclusions and impact of the thesis.

Chapter II

Desktop Grid System Design and

Implementation: State of the Art

II.A Background

A desktop grid system consists of a large set of network-connected computa-

tional and storage resources that are harvested when unused for the purpose of large-scale

computations. The computational resources are usually shared with the users or owners

of the machines, who often demand priority over desktop grid applications. As a result,

the resources are unreserved in that the availability of any set of machines cannot be

guaranteed for any period of time. Moreover, the resources are often volatile due to

user activity, machine hardware failures, and network failures, for example, and these

factors in turn prevent tasks from running to completion. In addition to being volatile,

the resources are usually heterogeneous in terms of clock rate, memory and disk size and

speed, network connectivity, and other characteristics.

Terminology related to the components of desktop grids are defined as follow.

We use the term client to refer to the user that has an application for submission. To

utilize a desktop grid, a client submits an application, which consists of a set of tasks,

to the server. The scheduler on the server then assigns tasks to each available worker,

which is a daemon that manages task execution and runs on each host. We use the term

host and resource synonymously.

12

13

The ideal desktop grid system would have the following characteristics:

1. Scalability: The throughput of the system should increase proportionally with

the number of resources.

2. Fault tolerance: The system must be tolerant of both server failure (for example,

data server crashes) and worker failure (for example, the user shutting off his/her

machine). (Traditionally, the term failure refers to a defect of hardware or software.

We use the term failure broadly to include all causes of task failure, including not

only failure of the host’s hardware or worker software, but also keyboard/mouse

activity that causes the worker to kill a running task.)

3. Security: The machine including its data, hardware, and processes must be pro-

tected from a misbehaving desktop grid application. Conversely, the application’s

executable, input, and output data, which may be proprietary, must be protected

from user inspection and corruption.

4. Maneagability: Increasingly, human resources are more costly than computing

resources. Systems should provide tools for installing and updating workers eas-

ily, and also tools for managing applications and resources, and monitoring their

progress.

5. Unobtrusiveness: Since the desktop grid application shares the system with the

user, the user processes must have priority over the client’s. When the worker

detects user activity, the task should be suspended temporarily until the activity

subsides, or the task should be killed and restarted later when the host becomes

available again.

6. Usability: Integration of an application within a desktop grid system should be

as transparent as possible; in many cases, the complexity of the (legacy) program

or the fact that the source code is proprietary and is not available makes it difficult

to modify the code to use a desktop grid system.

14

II.B System Anatomy and Physiology

Currently, there exist a number of academic and industrial desktop grid systems

that harvest the idle cycles of desktop PC’s in Internet environments and/or enterprise

environments. We describe how these systems achieve (or fail to achieve) the design

goals described in the previous section. These systems share many features of architec-

tural design and organization, and we give an overview of the anatomy and physiology

of current systems, identifying commonalities and important differences at the client,

application and resource management, and worker levels (see Figure II.1, which reflects

logical organization of the various components of a desktop grid system). (Note that the

physical organization may be different than what is shown in Figure II.1. For example,

components of the client level often reside on the same host as the worker.)

At the Client Level, a user submits an application to a desktop grid, using tools

for controlling the application’s execution and monitoring its status. At the Application

and Resource Management Level, the application is then scheduled on workers, and

information about applications and workers is stored. At the Worker Level, the worker

ensures the application’s task executes transparently with respect to other user processes

on the hosts.

As an overview, we first give a procedural outline for the submission and exe-

cution of a desktop grid application, noting where action fits in parentheses with respect

to Figure II.1:

1. The user that has an application to submit authenticates him/herself to the desktop

grid server (Client Level & Application and Resource Management Level)

2. As an optional first step, the application input data (e.g., database of protein

sequences) is partitioned into work units, and then organized into batches of tasks.

(Client Level)

3. Task batches generated from either the client manager or the application itself are

sent to the application manager. Once the application is submitted, the client

manager can be used to control and monitor the application. (Client Level &

Application and Resource Management Level)

15

4. The application manager assigns the application to a scheduler that oversees its

completion. (Application and Resource Management Level)

5. The scheduler assigns work to the workers according to the application/worker

constraints and scheduling heuristic. (Application and Resource Management Level

& Worker Level)

6. When available, the worker computes its task and returns a result to the scheduler,

which relays it to the application manager after the application has been completed.

(Worker Level & Application and Resource Management Level)

7. The application manager tallies the results and returns them to the application or

client manager, which does post-processing as necessary. (Client Level)

We detail next the various components at each level shown in Figure II.1 that

are involved with the above procedure of application submission, management, and exe-

cution. When relevant, we inject in the discussion details about four particular systems

(namely Entropia [28], United Devices [90], XtremWeb [37] and BOINC [39]) used cur-

rently by large projects that incorporate hundreds to thousands of resources. Entropia

and United Devices are commercial companies that offer desktop grid software that is

professionally developed, tested, and supported. Both companies have separate products

tailored for either enterprise or Internet environments. In our discussion below, our ref-

erences to the Entropia or United Devices frameworks refer to the software designed for

Internet environments. XtremWeb and BOINC are open source Internet desktop grid

frameworks. The XtremWeb system is an academic project developed at the University

of Paris-Sud, and has been used on hundreds of machines for over 10 projects. The

BOINC system has been deployed over hundreds of thousand of hosts and is currently

used to support the SETI@home project as well as five other large projects.

II.B.1 Client Level

In order for a user to submit his/her application to the desktop grid system,

the user must register the application binary with the application manager by sending

the executable and specifying the access permissions. Then, the application’s input data

16

APPLICATION

CLIENTMANAGER

APPLICATION MANAGER

CLIENTLEVEL

DATABASE

APPLICATION& RESOURCEMANAGEMENT

LEVEL

SANDBOX

WORKER DAEMON

WORKER APPLICATIONWORKER LEVEL

SCHEDULER

Figure II.1: A Common Anatomy of Desktop Grid Systems

(stored in a database or as flat files, for example) must be partitioned and formatted

into tasks. Several systems such as XtremWeb [37], Entropia [63], United Devices [90],

and Nimrod-G [7] provide tools packaged as part of the client manager for creating tasks

from a range or set of parameters.

The client manager most often provides a command-line interface through which

the user can submit the tasks to the application manager. Another option offered by the

Entropia and United Devices systems is to use the application manager’s API to submit

tasks programmatically. After the application is submitted, the client manager can be

used to monitor the progress of the application and control its execution. Many systems

such as Entropia and XtremWeb provide the functionality of the client manager through

a web browser as well.

II.B.2 Application and Resource Management Level

When an application binary is submitted, the application manager creates a

corresponding entry in the application table of a relational database to record the path

to the corresponding binary, permissions for accessing information about the application,

and any constraints on the resources to which the application’s tasks can be scheduled

(e.g., minimum CPU speed, memory size). When a set of tasks is submitted, the ap-

17

plication manager creates an entry in the task table of the database to record which

application each task corresponds to and the paths to the corresponding input files on

the server.

Moreover, the application manager is responsible for supplying tasks of the

application to the scheduler, which oversees resource selection and binding. When the

scheduler receives a request from the worker, it makes a scheduling decision based on

information (such as CPU speed, memory size, disk space, network speed) about the

worker stored in the worker table in the database and the resource constraints of the

application. Then the scheduler packages an application binary with data inputs and

sends the inputs back to the worker in response. Most schedulers [39, 37] in current

systems assign tasks to resources in First-Come-First-Server (FCFS) order, and thus are

tailored towards high throughput jobs. (The Entropia system uses a multi-level priority

queue for task assignment [63].)

Schedulers are passive in the sense that they cannot “push” tasks to workers;

instead the scheduler must wait for a worker to make a connection to the server before

being able to assign a task to it. This is due to the fact that hosts found on the

Internet (including enterprises) are often protected by firewalls that block all incoming

connections, but these firewalls usually allow some kind of outgoing connections . Thus,

any connection made between the worker and the server must be initiated by the worker.

After the worker successfully completes the task, it returns the task to the

application manager, which then records the completion in the results database table

(e.g., storing the time of completion and which user completed the task) and credits the

user whose host completed the task.

II.B.3 Worker Level

II.B.3.a Worker Daemon

On each host, a worker daemon runs in the background to control communi-

cation with the server and task execution on the host, while monitoring the machine’s

activity. The worker has a particular recruitment policy used to determine when a task

can execute, and when the task must be suspended or terminated. The recruitment

policy consists of a CPU threshold, a suspension time, and a waiting time. The CPU

18

threshold is some percentage of total CPU use for determining when a machine is con-

sidered idle. For example, in Condor, a machine is considered idle if the current CPU

use is less than the 25% CPU threshold by default. The suspension time refers to the

duration that a task is suspended when the host becomes non-idle. A typical value for

the suspension time is 10 minutes. If a host is still non-idle after the suspension time

expires, the task is terminated. When a busy host becomes available again, the worker

waits for a fixed period of time of quiescence before starting a task; this period of time

is called the waiting time. In Condor, the default waiting time is 15 minutes.

Figure II.2 shows an example of the effect of recruitment policy on CPU avail-

ability. The task initially uses all the CPU for itself. Then, after some user key-

board/mouse activity, the task gets suspended. (The various causes of task termination

to enforce unobtrusiveness include user-level activity such as mouse/keyboard activity,

other CPU processes, and disk accesses, or even machine failures, such as a reboot, shut-

down or crash.) After the activity subsides and the suspension time expires, the task

resumes execution and completes. The worker then uploads the result to the server and

downloads a new task; this time is indicated by the interval labelled “gap”. The task

begins execution then gets suspended and eventually killed, again due to user activity;

usually all of the task’s progress is lost as most systems do not have system-level sup-

port for checkpointing. Later, after the host becomes available for task execution and

the waiting time expires, the task restarts and shortly after beginning execution the host

is loaded with other processes but the CPU utilization is below the threshold, and so

the task continues executing, but only receiving a slice of CPU time.

In addition to controlling the execution of the desktop grid application, the

worker daemons in Xtremweb and Entropia periodically poll the server to indicate the

current state of the worker (for example, running a task, or waiting for the machine to

be idle) and whether host and worker are up. If a task has been assigned to a worker

and the worker stops sending heartbeats to the server, the worker is assumed to have

failed, and the task is reassigned a another worker.

19

Figure II.2: CPU Availability During Task Execution.

II.B.3.b Worker Sandbox

To ensure the protection of the underlying host when a task is executing, several

systems provide some form of a sandboxed environment [19, 21]. In particular, Entropia

provides a virtual machine as a sandbox that guards the machine from errant worker

processes. The virtual machine is a user-level program that simulates the Windows

kernel, and the worker application runs as a thread of this virtual machine. When the

application runs and makes a system call, the virtual machine catches the system call

(presumably using a call analogous to Ptrace in Linux) and executes it in the simulated

environment. The virtual machine is configured to map application virtual file accesses

to file accesses on the actual machine. For example, many applications make changes to

the Windows registry, which could be potentially obtrusive to the host. The Entropia

virtual machine has a shadow registry within its installation directory to which writes

are made, thereby preventing modifications to the actual registry.

There are several other benefits of the virtual machine. The virtual machine

enables fine grain control of network, memory, disk, and computing resources used by

the application in order to limit its obtrusiveness. Also, the Entropia virtual machine

simplifies application integration into the desktop grid system by allowing any (propri-

etary) Windows executable to be run by the worker without any changes to the (legacy)

20

source code or recompiling to link with special libraries.

Other than Entropia, the Xtremweb research group investigated the use of a

user-level sandbox [19], which intercepts any system calls of the application, and for each

intercepted system call, runs a security check to ensure the call is valid before allowing

its execution. Specifically, XtremWeb deploys this method by using Ptrace to allow

a parent process to retain control over its child when specific operations are executed

by system calls. When a ptraced child process makes a system call, its execution is

paused, and the parent process can inspect the parameters of the call before allowing

its execution. If the child’s system call fails its parent’s check, the parent can kill the

child process. The drawback to the above sandboxing techniques is the overhead of at

least two context switches between the child and parent processes. So applications with

significant IO will perform poorly on such systems.

Alternatively, the Xtremweb group considered using a a kernel-level sandbox

technique, where a kernel patch is installed that adds hooks at the beginning or end

of particular kernel functions. A superuser can insert a module that implements these

hooks to define a specific security policy. The advantage of this method is that no context

switches are necessary. XtremWeb considered using such a technique but because it

required root privileges this method was not implemented.

While sandboxes protect the host machine from a misbehaving application,

workers have several security mechanisms to protect the application (including its data)

from the user. To deter inspection, the application executable and data can be encrypted

with multiple keys to make examination difficult [28, 90] and modifications detectable.

II.B.4 Design Trade-offs of Centralization

At one end of the spectrum, a desktop grid can be completely centralized where

the client manager and all the components in the Application and the Resource Manage-

ment Layer are located on a single machine. At the other end, each host in the desktop

grid would be completely autonomous with little or no knowledge of other hosts or ap-

plications in the system. While most desktop grid systems are centralized, we identify

the trade-offs of centralized versus decentralized design with respect to the system goals

outlined in Section II.A, focusing particularly on scalability, server fault tolerance, task

21

result verification, and worker software manageability.

II.B.4.a Scalability

We focus on two aspects, namely resource management and application data

management. Two important parts of resource management are monitoring of the re-

sources to determine dynamic information such as CPU or network activity, and resource

selection. Several systems, such as NWS [93], use a hierarchical approach, which is

amenable to incorporation with desktop grid systems, to allow for scalable monitoring of

resources. Regarding resource selection, there exist several centralized systems [48]. For

example, the system in [48] can execute expressive resource queries (including ranking,

clustering) over a large set of attributes of millions of resources on the order of sec-

onds on a modest machine. The particular implementation used a relational database to

store the hierarchical structure of a set of resources, and using an XML database could

improve performance even further. Also, the authors in [68] showed that decentralized

resource management (specifically, monitoring and selection) is not always advantageous

performance-wise; the authors found that strategically placing 4-node server clusters to

support resource discovery results in performance comparable to that of decentralized

approaches based on distributed hash tables (DHT’s). In terms of ease of implementa-

tion, our own experience suggests that resource selection is greatly simplified if there is

a global view of the resources in the system, which is lost in a fully decentralized system.

At the same time, there are have been several efforts to decentralize resource

monitoring and discovery to achieve scalability and fault tolerance, such as in SWORD [67],

Xenoservers [84], and GUARD [64]. The general approach in SWORD [67] and Xenoservers [84]

is to use DHT’s for distributing the data about resources and related queries among a

set of hosts. For example, to store data about host CPU availability, one host may store

values between 0 and 20%, another host may store values between 20 and 40% and so

on. Queries in the form of <attribute, value> are mapped to unique keys, which are

then routed to the host containing the corresponding data. The advantage of such an

approach is that it can tolerate host failures as the DHT will automatically restructure

itself as needed. The approach use in GUARD [64] is to create a “gossiping” protocol

based on distance vectors where resource information propagates automatically to a node

22

through its neighbors. The protocol is designed to be scalable and to withstand host

failures.

While benefits of decentralized resource management relative to centralized

management are debatable, one of the most limiting aspects of a centralized design is

application data management, in particular storage and distribution. In [63], the authors

show that an application with medium input sizes (10MB) and low execution times (5

minutes) requires significant bandwidth (264 Mbps) for a medium number of workers

(1000). Many applications can have much higher data input sizes, and distributing such

data inputs from a centralized server could be infeasible, and thus mandates decentralized

approaches, such as peer-to-peer (P2P) methods described in [85, 16, 72]. For example,

the Chord protocol provides a fast method by which to locate a datum stored on a set

of volatile hosts on a wide area. In particular, Chord is based on a distributed hash

table primitive that supports data lookups using only log(H) messages, where H is the

number of hosts in the system. The hosts in a DHT are organized using a logical overlay

that maps a unique id corresponding to a host to some position in the overlay. Each

node contains a routing table that indicates which of its neighbors are “closer” to the

datum. A datum has a unique identifier and is mapped to the “closest” node in the

logical overlay. Although there are many P2P methods for locating a datum on a set

of volatile hosts, linking the computation with the data (and addressing issues such as

locality) is still an open problem.

II.B.4.b Fault Tolerance

Centralization can cause the server to become a single point of failure. To

avoid failure, we argue that replicating the server, i.e., components of the application

and resource management level, can reduce the probability of failure significantly. For

example, in 2001, the SETI@home server (including the web and data servers) become

nonfunctional 16 times [79]. (The causes of failure included hardware upgrades, updates

of the database or database software, RAID cards failing, electrical storms and repairs,

power outages, full disks, database failures, and rearranging of hardware.) Assuming

the server fails at a rate of 16 times per year, if the server was replicated on two servers

that had the same and independent failure rates, the probability that all servers fail at

23

once is less then 10−8 or approximately once every 38,000 decades. The point is that

setting up a few extra servers or a server farm (versus a totally decentralized solution),

which several systems such as BOINC and XtremWeb support, could reduce the chance

of failure down to near-zero.

Even if a failure does occur, the effect is small since most Internet desktop grid

applications are high-throughput and an outage of a few hours is not significant as the

applications do not have stringent time requirements. Also, systems such as BOINC and

XtremWeb have mechanisms for graceful recovery. For example, when the data server

fails, all workers finish computing their tasks [10], and when the server comes back up,

it could become inundated by worker upload requests. To reduce the storm of requests,

BOINC and XtremWeb force exponential backoff of their workers when the server is

overloaded.

Regarding task result verification, the result returned by a worker can often be

erroneous. In particular, the authors in [88] found that significant computation errors or

differences could be caused by hardware malfunctions, incorrect software modifications,

malicious attacks, or differences in floating-point hardware, libraries and compilers. The

error rates for two scientific application (MFold and CHARMM) deployed over an Inter-

net desktop grid was 1.9% and 8.7% respectively.

Task replication has been used as a means for detection and correction [88,

74, 87]. Multiple copies of a work unit are sent to different workers. When the results

are returned, they are compared and the result that appears most often or has been

computed by a credible worker is assumed to be correct. When a worker is found to

have computed a bad result, it is blacklisted to prevent the worker from effecting the

application further. Blacklisting a worker in a centralized system is trivial, but a fully

decentralized system could require a notification of each node hosting components of

the Application and Resource Management level in order to prevent the sabotager from

participating in the computation.

Regarding manageability, Entropia, XtremWeb, and United Devices all provide

some form of a command-line tools or web interfaces by which a single administrator can

manage applications, monitor their progress, or install workers and send updates. How

to effectively manage applications in an decentralized environment is still an open area

24

of research.

In summary, while there certainly many potential beneficial of centralization,

there are significant challenges that must be overcome before the costs of decentralization

outweigh the benefits. Currently, there are at least two research efforts for developing a

decentralized desktop grid system, namely the Cluster Computing On the Fly system [53]

and the Organic Grid [27]. The authors in [53] propose the Cluster Computing On the Fly

system that uses distributed hash table techniques for locating available resources. The

authors in [27] describe a prototype of the Organic Grid, which is a fully decentralized

system based on mobile agents.

Chapter III

Resource Characterization

The measurement and characterization of desktop grids is useful for several

reasons. First, the data can be used for the performance evaluation of the entire system,

subsets, or individual hosts. For example, one could determine the aggregate compute

power of the entire desktop grid over time. Second, the data can be used to develop pre-

dictive, generative, or explanatory models [35]. For example, a predictive model could be

formed to predict the availability of a host given that it has been available for some pe-

riod of time. A generative model based on the data could be used to generate host clock

rate distribution or availability intervals for simulation of the platform. After showing

a precise fit between the data and the model, the model can often help explain certain

trends shown in the data. Third, the measurements themselves could be used to drive

simulation experiments. Fourth, the characterization should influence design decisions

for resource management heuristics. We discuss in this chapter our measurement tech-

nique for obtaining traces of several real desktop grids, and a statistical characterization

of each system as a whole and also individual hosts.

III.A The Ideal Resource Trace

The design and evaluation of our scheduling heuristics requires an accurate

characterization of a desktop grid system. An accurate characterization involves obtain-

ing detailed availability traces of the underlying resources. The term “availability” has

different meanings in different contexts and must be clearly defined for the problem at

25

26

hand [15]. In the characterization of these data sets, we distinguished among three types

of availability:

1. Host availability. This is a binary value that indicates whether a host is reachable,

which corresponds to the definition of availability in [18, 8, 15, 30, 76]. Causes of

host unavailability include power failure, or a machine shutoff, reboot, or crash.

2. Task execution availability. This is binary value that indicates whether a task

can execute on the host or not, according to the worker’s recruitment policy. We

refer to task execution availability as exec availability in short. Causes of exec

unavailability include prolonged user keyboard/mouse activity, or a user compute-

bound process.

3. CPU availability. This is percentage value that quantifies the fraction of the CPU

that can be exploited by a desktop grid application, which corresponds to the

definition in [12, 25, 80, 32, 92]. Factors that affect CPU availability include system

and user level compute-intensive processes.

Host unavailability implies exec unavailability, which implies CPU unavailabil-

ity. Clearly, if a host becomes unavailable (e.g., due to shutdown of the machine), than

no new task may begin execution and any executing task would fail. If there is a period

of task execution unavailability (e.g., due to keyboard/mouse activity), then the desktop

grid worker will stop the execution of any task, causing it to fail, and disallow a task to

begin execution; as a result of task execution unavailability, the task will observe zero

CPU availability.

However, CPU unavailability does not imply exec unavailability. For example,

a task could be suspended and therefore have zero CPU availability, but since the task

can resume and continue execution, the host is available in terms of task execution.

Similarly, exec unavailability does not imply host unavailability. For example, a task

could be terminate due to user mouse/keyboard activity, but the host itself could be still

up.

Given these definitions of availability, the ideal trace of availability would have

the following characteristics:

27

1. The trace would log CPU availability in terms of the CPU time a real application

would receive if it were executing on that host.

2. The trace would record exec availability, in particular when failures occur. From

this, one can find the temporal structure of availability intervals. We call the inter-

val of time in between two consecutive periods of exec unavailability an availability

interval.

3. The trace would determine the cause of the failures (e.g., mouse/user activity,

machine reboot or crash). This would enable statistical modeling (for the purpose

of prediction, for example) of each particular type of failure.

4. The trace would capture all system overheads. For example, some desktop grid

workers run within virtual machines [19, 21], and there may be overheads in terms

of start-up, system calls, and memory costs.

III.B Related Work on Resource Measurements and Mod-

elling

Although a plethora of work has been done on the measurement and char-

acterization of host and CPU availability, there are two main deficiencies of this re-

lated research. First, the traces do not capture all causes of task failures (e.g., users’

mouse/keyboard activity), and inferring task failures and the temporal characteristics of

availability from such traces is difficult. Second, the traces may reflect idiosyncrasies of

the OS [92, 32] instead of showing the true CPU contention for a running task.

In this section, we highlight the shortfalls of the trace methods used in these

studies, and explain why the many of the statistical models founded on this trace data

are inapplicable to desktop grids. Table III.1 summarizes methods of the representative

studies and the shortfalls.

III.B.1 Host Availability

Several traces have been obtained that log host availability throughout time.

In [20], the authors designed a sensor that periodically records each machine’s uptime

28

from the /proc file system and used this sensor to monitor 83 machines in a student

lab. In [56], the authors periodically made RPC calls to rpc.statd, which runs as part

of the Network File System (NFS), on 1170 hosts connected to the Internet (see the

row corresponding to the data set Long in Table III.1). A response to the call indicated

the host was up and a missing response indicated a failure. In [15], a prober runs on

the Overnet peer-to-peer file-sharing system looking up host ID’s. A machine with a

corresponding ID is available if it responds to the probe; about 2,400 machines were

monitored in this fashion. The authors in [30, 76] determine availability by periodically

probing IP addresses in a Gnutella system.

Using traces that record only host availability for the purpose of modeling

desktop grids is problematic because it is hard to relate uptimes to CPU cycles usable

by a desktop grid application. Several factors can affect an application’s running time

on a desktop grid, which include not only host availability but also CPU load and user

activity. Thus, traces that indicate only uptime are of dubious use for performance

modeling of desktop grids or for driving simulations.

III.B.2 Host Load and CPU Utilization

There are numerous data sets containing traces of host load or CPU utilization

on groups of workstations. Host load is usually measured by taking a moving average

of the number of processes in the ready queue maintained by the operating system’s

scheduler, whereas CPU utilization is often measured by the CPU time or clock cycles

per time interval received by each process. Since host load is correlated with CPU

utilization, we discuss both type of studies in this section.

The CPU availability traces described in [62, 92, 93] are obtained using the

UNIX tools ps and vmstat, which scan the /proc filesystem to monitor processes. In

particular, the author’s in [62] used ps to measure CPU availability on about 13 VAXsta-

tionII workstations over a period of 3 months, and then later monitored 20 workstations

over a period of 4 months (see the row corresponding to the data set Condor in Ta-

ble III.1). Then they post-processed the data to determine each machine’s unavailability

intervals. A host was considered unavailable if its CPU utilization by user processes

went over 25%. They assumed a waiting time of 1 minute for the first 3 month period

29

of traces, and 5 minutes for second 4 month trace period. Similarly, the authors in [35]

measured host load by periodically using the exponential moving average of the number

of processes in the ready queue recorded by the kernel (see the row corresponding to

the data set Dinda in Table III.1). The study was based on about 38 machines over a 1

week period in August 1997. In contrast to previous studies on UNIX systems, the work

in [18] measured CPU availability from 3908 Windows NT machines over 18 days by

periodically inspecting windows performance counters to determine fractions of cycles

used by processes other than the idle process.

None of the trace data sets mentioned above record the various events that

would cause application task failures (e.g., keyboard/mouse activity) nor are the data

sets immune to OS idiosyncrasies. For example, most UNIX process schedulers (the

Linux kernel, in particular) give long running processes low priority. So, if a long running

process were running on a CPU, a sensor would determine that the CPU is completely

unavailable. However, if a desktop grid task had been running on that CPU, the task

would have received a sizable chunk of CPU time. Furthermore, processes may have a

fixed low priority. Although the cause of task failures could be inferred from the data,

doing so is not trivial and may not be accurate.

III.B.3 Process Lifetimes

In [45], the authors conduct an empirical study on process lifetimes, and pro-

pose a function that fits the measured distribution of lifetimes. Using this model, they

determine which process should be migrated and when. Inferring the temporal struc-

ture of availability from this model of process lifetimes would be difficult because it is

not clear how to determine the starting point of each process in time in relationship

to one another. Moreover, the study did not monitor keyboard/mouse activity, which

significantly impacts availability intervals [73] in addition to CPU load.

30

Char

acte

rist

ics

Dat

aSet

OS

Tra

ceda

tes

#of

host

sM

etho

dU

ser

base

CPU

thre

sh-

old

Wai

ting

tim

eSu

spen

dtim

eLe

ssth

an10 ye

ars

old?

Hos

tav

ail-

abil-

ity?

Exe

cav

ail-

abil-

ity?

Tru

eC

PU

avai

l-ab

il-ity?

Long

[56]

UN

IX3

mon

th,

1995

1170

host

srp

cca

llsto

rpc.

stat

d

mix

over

the

Inte

rnet

N/A

N/A

N/A

YY

NN

Dinda

[32]

Dig

ital

UN

IX1

wee

k,8/

9738

host

sm

ovin

glo

adav

erag

e

fron

t-en

d,in

ter-

acti

ve,

batc

hho

sts,

com

pute

serv

ers,

desk

tops

,cl

uste

r

N/A

N/A

N/A

YY

NN

Condor

[62]

4.2B

SDU

nix

7 mon

ths

tota

l,9/

86-

1/87

,9/

87-

12/8

9

13VA

Xs-

tati

onII

wor

ksta

-ti

ons,

20w

orks

ta-

tion

s

load

av-

erag

evi

aps

facu

lty,

syst

empr

o-gr

am-

mer

s,gr

adua

test

uden

ts

25%

5m

in,

1m

inno

neN

YY

N

SDSC

Win

dow

s1

mon

thto

tal

∼220

host

sre

alm

easu

re-

men

tta

sks

secr

etar

ies,

inco

n-fe

renc

ero

oms,

adm

inis

-tr

atio

ns,

rese

arch

staff

20%

10m

in10

min

YY

YY

XtremWeb

Lin

ux1

mon

th,

1/05

∼100

host

sre

alm

easu

re-

men

tta

sks

clus

ter,

stud

ents

10%

30se

c0

YY

YY

UCB

[12]

Ult

rix

4.2a

46-d

ays,

2/94

-3/

94

85D

EC

5000

wor

ksta

-ti

ons

user

leve

lda

emon

EE

/CS

grad

stud

ents

5%1

min

none

NY

Der

iv.

N

Tab

leII

I.1:

Cha

ract

eris

tics

ofde

skto

pgr

idap

plic

atio

ns.

(Der

iv.

deno

tes

“der

ivab

le”)

31

III.C Trace Method

We gather traces by submitting measurement tasks to a desktop grid system

that are perceived and executed as real tasks. These tasks perform computation and

periodically write their computation rates to file. This method requires that no other

desktop grid application be running, and allows us to measure exactly the compute power

that a real, compute-bound application would be able to exploit. Our measurement

technique differs from previously used methods in that the measurement tasks consume

the CPU cycles as a real application would.

During each measurement period, we keep the desktop grid system fully loaded

with requests for our CPU-bound, fixed-time length tasks, most of which were around 10

minutes in length. The desktop grid worker running on each host ensured that these tasks

did not interfere with the desktop user and that the tasks were suspended/terminated as

necessary; the resource owners were unaware of our measurement activities. Each task of

fixed time length consists of an infinite loop that performs a mix of integer and floating

point operations. A dedicated 1.5GHz Pentium processor can perform 110.7 million such

operations per second. Every 10 seconds, a task evaluates how much work it has been

able to achieve in the last 10 seconds, and writes this measurement to a file. These files

are retrieved by the desktop grid system and are then assembled to construct a time

series of CPU availability in terms of the number of operations that were available to

the desktop grid application within every 10 second interval.

For the Windows version of the measurement task, we implement the timing as-

pect of a task using Window’s multimedia timer. The timer is implemented by spawning

a high-priority thread that sets a kernel timer and then blocks. When the thread wakes

up, it executes our callback routine posting the number of operations completed during

the last 10 seconds in the windows application’s message queue, and then sleeps. (Note

that the message queue is sufficiently large to preclude overfilling of the queue during

the 15 minutes of measurements per task.) The frequency at which the kernel interrupts

to check the timer expiration is set when initializing the timer. We tried various fre-

quencies from once every millisecond to once every second, and noticed little difference

in the total operations logged per time period. Thus, we used a timer resolution of 1 ms

32

assuming the overhead of using the timer is negligible. Regardless, the overhead of using

the timer should be constant across time intervals, and so the number of operations per

time interval would be equally affected.

We implement the computational aspect of a task by iteratively performing

integer and floating calculations within a integer array and a double array. Intra-loop

dependencies were added to prevent compiler optimization. The size of each array (60

and 224 bytes respectively) were small enough to fit in cache, and so we excluded the

costs of memory accesses from our traces. The linux version of the measurement task

was implemented in a similar manner.

The main advantage of obtaining traces in this fashion is that the application

experiences host and CPU availability exactly as any real desktop grid application would.

This method is not susceptible to OS idiosyncrasies because the logging is by done by

a CPU-bound task actually running on the host itself. Also, this approach captures

all the various causes of task failures, including but not limited to mouse/keyboard

activity, operating system failures, and hardware failures, and the resulting trace reflects

the temporal structure of availability intervals caused by these failures. Moreover, our

method takes into account overhead, limitations, and policies of accessing the resources

via the desktop grid infrastructure.

Every measurement method has weaknesses, and our method is certainly not

flawless. One weakness compared to the ideal trace data set is that we cannot identify the

specific causes of failures, and so we cannot distinguish failures caused by user activity

versus power failures, for example. This, in turn, could make stochastic failure prediction

models more difficult to derive, as one source of failure could skew the distribution of

another source. Nevertheless, all types of failures are still subsumed in our traces, in

contrast to other studies that often omit many types of desktop failures as described in

earlier sections.

Also, the tasks were executed by means of a desktop grid worker, which used a

particular recruitment policy. This means that the trace may be biased to the particular

worker settings used in the specific deployment. However, with knowledge of these

settings, it is straightforward to infer reliably at which points in the trace the bias

occurs and thus possible to remove such bias. After removing the bias, one could post-

33

process the traces according to any other CPU-based recruitment policy to determine the

corresponding CPU availability. This makes it possible to collect data using our trace

method from a desktop grid with one recruitment policy, and then simulate another

desktop grid with a different recruitment policy using the same set of traces with minor

adjustments.

III.D Trace Data Sets

Using the previously described method, we collected data sets from two desktop

grids. One of these desktop grids consisted of desktop PC’s at the San Diego Super

Computer Center (SDSC) and ran the commercial Entropia [28] desktop grid software.

We refer to the data collected from the SDSC environment as the SDSC trace. The other

desktop grid consisted of desktop PC’s at the University of Paris South, and ran the open

source XtremWeb [37] desktop grid software. The Xtremweb desktop grid incorporated

machines from two different environments. The first environment was a cluster used

by a computer science research group for running parallel applications and benchmarks,

and we refer to the data set collected from this cluster as the LRI trace. The second

environment consisted of desktop PC’s in classrooms used by first-year undergraduates,

and we refer to the data set as the DEUG trace. Finally, we obtained the traces described

in [12] which were measured using a different trace method and refer to this data set as

the UCB trace. (We describe this method in Section III.C.) The advantages of using

these data sets versus others are highlighted in Table III.1, and labelled as data sets

SDSC, XtremWeb, and UCB in the table.

The traces that we obtained from our measurements contain gaps. This is

expected as desktop resources become unavailable for a variety of reasons, such as the

rebooting or powering off of hosts, local processes using 100% of the CPU, the desktop

grid worker detecting mouse or keyboard activity, or user actively pausing the worker.

However, we observe that a very large fraction (≥ 95%) of these gaps are clustered in the

2 minute range. Figures III.1(a) and III.1(b) plot the distribution of these small gaps

for the Entropia desktop grid at SDSC and the Xtremweb desktop grid at the University

of Paris-Sud respectively. The average small gap length is 35.9 seconds on the Entropia

34

grid, and 19.5 seconds on the Xtremweb grid.

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

4

Num

ber

of G

aps

Gap length (sec)

(a) SDSC

0 20 40 60 80 100 1200

1

2

3

4

5

6

7

8

9

10x 10

4

Num

ber

of G

aps

Gap length (sec)

(b) Xtremweb

Figure III.1: Distribution of “small” gaps (<2 min.).

After careful examination of our traces, we found that these short gaps occur

exclusively in between the termination of a task and the beginning of a new task on the

same host. We thus conclude that these small gaps do not correspond to actual exec

unavailability, but rather are due to the delay of the desktop grid system for starting a

new task. In the SDSC grid, the majority of the gaps are spread in the approximate range

of 5 to 60 seconds (see Figure III.1(a)). The sources of this overhead include various

system costs of receiving, scheduling, and sending a task as well as an actual built-in

limitation that prevents the system from sending tasks to resources too quickly. That

is, the Entropia server enforces a delay between the time it receives a request from the

worker to the time it sends a task to that worker. This is to limit the damaging effect of

the “black-hole” problem where a worker does not correctly execute tasks, and instead,

repeatedly and frequently sends requests for more tasks from the server. Without the

artificial task sending delay, the result of the “black-hole” problem is applications with

thousands of tasks that completed instantly but erroneously.

In the XtremWeb grid, the majority of the gaps are between 0 to 5 seconds and

40 to 45 seconds in length. When the Xtremweb worker is available to execute a task,

it sends a request to the server. If the server is busy or there is no task to execute, the

35

worker is told to make another request after a certain period of time (43 seconds) has

expired. This explains the bimodal distribution of gaps length in the XtremWeb system.

Therefore, these small availability gaps observed in both the Entropia and

XtremWeb grids would not be experienced by the tasks of a real application, but only in

between tasks. Consequently, we eliminated all gaps that were under 2 minutes in our

traces by performing linear interpolation.

Specifically, we interpolate gaps under 2 minutes in length using the following

method. Let prevnumops be the number of operations measured in the subinterval of

length prevduration that ends just before the gap begins. Let postnumops be the number

of operations measured in the subinterval of length postduration that begins just after

the gap ends. (prevduration and postduration are most often 10 seconds in length since

measurements are made usually every 10 seconds.) Let gapduration be the gap length,

and let gapnumops be the interpolated number of operations that are available during

the gap. We calculate gapnumops using a weighted average of the rate of operations

completed immediately before and after the gap so that the rate of operations during

the longer of the two subintervals carries more weight in the interpolation:

gapnumops = gapduration×{(

prevnumops

prevduration

)×

(prevduration

prevduration + postduration

)

+(

postnumops

postduration

)×

(postduration

prevduration + postduration

)}

Usually, the subintervals immediately preceding and following the gap are 10 seconds in

length and so the interpolated rate is just the average of the rates before and after the

gap.

A small portion of the gaps larger than 2 minutes may be also attributed to the

server delay and this means that our post-processed traces may be slightly optimistic.

Note that although we use interpolation, we use the average small gap length in our

performance models, which we describe in Section III.G, to account for the server delay.

For a real application, the gaps may be larger due to transfer of input data files necessary

for task execution. Such transfer cost could be added to our average small gap length

and thus easily included in the performance model developed in Section III.G.

The weakness of interpolating the relatively small gaps is that this in effect

36

masks short failures less than 2 minutes in length. Failures due to fast machine reboots

for example could be overlooked using this interpolation method.

III.D.1 SDSC Trace

The first data set was collected using the Entropia DCGridTMdesktop grid

software system deployed at SDSC for a cumulative period of about 1 month across 275

hosts. We conducted measurements during four distinct time periods: from 8/18/03 until

8/22/03, from 9/3/03 until 9/17/03, from 9/23/03 to 9/26/03, and from 10/3/03 and

10/6/03 for a total of approximatively 28 days of measurements. For our characterization

and simulation experiments, we use the longest continuous period of trace measurements,

which was the two-week period between 9/3/03 - 9/17/03.

Of the 275 hosts, 30 are used by secretaries, 20 are public hosts that are avail-

able in SDSC’s conference rooms, 12 are used by system administrators, and the remain-

ing are used by SDSC staff scientists and researchers. The hosts are all on the same

class C network, with most clients having a 10Mbit/sec connection and a few having

a 100Mbit/sec connection. All hosts are desktop resources that run different flavors

of WindowsTM. The Entropia server was running on a dual-processor XEON 500MHz

machine with 1GB of RAM.

To validate the measurements made by our monitoring application on the SDSC

grid, we isolated a small set of machines on which we accessed system counters to deter-

mine the CPU utilization of each process while our monitoring task was running on each

host controlled through the Entropia worker daemon. In particular, while continuously

sending our monitoring tasks to the Entropia system, we used the Windows Management

Instrumentation (WMI) to remotely access the system counters of seven machines at the

SDSC to determine the clock ticks devoted to each process running on the host. Only

a limited set of machines could be accessed for about 2 hours on September 12, 2003

as we needed superuser privileges to use WMI. Note that because this method of moni-

toring the machines went through the network to record each machine’s system counter

readings, measurements could be delayed due to network congestion. We compared the

WMI measurements with the task measurements, and found that the availability and

non-availability intervals recorded by the monitoring tasks corresponded to the times at

37

which the task appeared in list of processes found through WMI. Moreover, CPU avail-

ability measured by the monitoring task closely matched the CPU utilization measured

by the WMI queries.

During our experiments, about 200 of the 275 hosts were effectively running

the Entropia worker (on the other hosts, the users presumably disabled the worker) and

we obtained measurements for these hosts. Their clock rates ranged from 179MHz up

to 3.0GHz, with an average of 1.19GHz. Figure III.2 shows the cumulative distribution

function (CDF) of clock rates. The curve is not continuous as for instance no host has

a clock rate between 1GHz and 1.45GHz. The curve is also skewed as for instance over

30% of the hosts have clock rates between 797MHz and 863MHz, which represents under

3.5% of the clock rate range.

An interesting feature of the Entropia system is its use of a Virtual Machine

(VM) to insulate application tasks from the resources. While this VM technology is

critical for security and protection issues, it also makes it possible for fine grain control

of an executing task in terms of the resources it uses, such as limiting CPU, memory, and

disk usage, and restricting I/O, threads, processes, etc. The design principle is that an

application should use as much of the host’s resources as possible while not interfering

with local processes. One benefit is that this allows an application to use from 0% to

100% of the CPUs with all possible values in between. It is this CPU availability that

we measure. Note that our measurements could easily be post-processed to evaluate a

desktop grid system that only allows application tasks to run on host with, say, more

than 90% available CPU.

One weakness of our measurement method used for the SDSC data set is that

the resolution of the traces is limited to the length of the task. That is, in Entropia,

when a task is terminated, the task’s output is lost and as a result, the trace data is

lost at the same time. Consequently, the unavailability intervals observed in the data

set are pessimistic by at most the task length, and the statistical analysis may suffer

from some periodicity. However, we believe we can still use the data set for modelling

and simulation, and we cross-validate our findings using three other data sets where this

limitation in measurement method was removed.

Another weakness is that the interpolation of gaps may have hidden short

38

failures, such as reboots. However, the SDSC system administrations recorded only

seven reboots after applying Windows patches during the entire 1-month trace period.

(In particular, server reboot at 6PM on 8/18/03, reboot of all desktops at 6:30PM on

8/21, reboot of all desktops on 9/5/03 at 3AM, possible reboot of all desktops if user

was prompted on 9/7, server reboot at 6PM on 9/8, reboot of all machines at 11PM on

9/10/03, and reboot of all machines on 9/11/03 (not simultaneous).) Although this is

only a lower bound, we believe that the hosts were rebooted infrequently given that there

was usually a single desktop allocated for each user and in that sense were the desktops

were dedicated systems. So the phenomenon described in [20] of undergraduates sitting

at the desktop rebooting their machines to “clean” the system of remote users causing

high load does not occur.

Finally, because the Entropia system at SDSC was shared with other users, we

could only take measurements a few days at a time. By taking traces over consecutive

days versus weeks or months, we do not keep track the long-term (e.g., monthly) churn

of machines and could lose the long-term temporal structure of availability.

0 500 1000 1500 2000 2500 30000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clock Rate (MHz)

Frac

tion

SDSCDEUGLRIUCB

Figure III.2: Host’s clock rate distribution in each platform

III.D.2 DEUG and LRI Traces

The second data set was collected using the XtremWeb desktop grid software

continuously over about a 1 month period (1/5/05 - 1/30/05) on a total of about 100

39

hosts at the University of Paris-Sud. In particular, Xtremweb was deployed on a cluster

(denoted by LRI) with a total of 40 hosts, and a classroom (denoted by DEUG) with 40

hosts respectively. The LRI cluster is used by the XtremWeb research group and other

researchers for performance evaluation and running scientific applications. The DEUG

classroom hosts are used by first year students. Typically, the classroom hosts are turned

off when not in use. Classroom hosts are turned on on weekends only if there is a class

on weekends.

The XtremWeb worker was modified to keep the output of a running task if it

had failed, and to return the partial output to the server. This removes any bias due to

the failure of a fixed-sized task, and the period of availability logged by the task would

be identical to that observed by a real desktop grid application.

Compared to the clock distribution of hosts in the SDSC platform, the hosts in

the DEUG and LRI platforms have relatively homogeneous clock rates (see Figure III.2).

A large mode in the clock rate distribution for the DEUG platform occurs at 2.4GHz,

which is also the median; almost 70% of the hosts have clock rates of 2.4GHz. The clock

rates in the DEUG platform range from 1.6GHz to 2.8GHz. In the LRI platform, a mode

in the clock rate distribution occurs at about 2GHz, which is also the median; about

65% of the hosts in have clock rates at that speed. The range of clock rates is 1.5GHz

to 2GHz.

III.D.3 UCB Trace

We also obtained an older data set first reported in [12], which used a different

measurement method. The traces were collected using a daemon that logged CPU and

keyboard/mouse activity every 2 seconds over a 46-day period (2/15/94 - 3/31/94) on 85

hosts. The hosts were used by graduate students the EE/CS department at UC Berkeley.

We use the largest continuously measured period between 2/28/94 and 3/13/94. The

traces were post-processed to reflect the availability of the hosts for a desktop grid

application using the following desktop grid settings. A host was considered available

for task execution if the CPU average over the past minute was less than 5%, and there

had been no keyboard/mouse activity during that time. A recruitment period of 1 minute

was used, i.e., a busy host was considered available 1 minute after the activity subsided.

40

Task suspension was disabled; if a task had been running, it would immediately fail with

the first indication of user activity.

The clock rates of hosts in the UCB platform were all identical, but of extremely

slow speeds. In order to make the traces usable in our simulations experiments, we

transform clock rates of the hosts to a clock rate of 1.5GHz (see Figure III.2), which is

a modest and reasonable value relative to the clock rates found in the other platforms,

and close to the clock rate of host in the LRI platform.

The reason the UCB data set is usable for desktop grid characterization is

because the measurement method took into account the primary factors affecting CPU

availability, namely both keyboard/mouse activity and CPU availability. As mentioned

previously, the method of determining CPU availability may not be as accurate as our

application-level method of submitting real tasks to the desktop grid system. However,

given that the desktop grid settings are relatively strict (e.g., a host is considered busy

if 5% of the CPU is used), we believe that the result of post-processing is most likely

accurate. The one weakness of this data set is that it more than 10 years old, and

host usage patterns might have changed during that time. We use this data set to

show that in fact many characteristics of desktop grids have remained constant over the

years. Note that the UCB trace is the only previously existing data set that tracked user

mouse/keyboard activity, which is why it is usable for our desktop grid simulations.

III.E Characterization of Exec Availability

In this section, we characterize in detail exec availability and in our discussion,

the term availability will denote exec availability unless noted otherwise. We report and

discuss aggregate statistics over all hosts in each platform, and when relevant, we also

describe per host statistics.

III.E.1 Number of Hosts Available Over Time

We observed the total number of hosts available over time to determine at which

times during the week and during each day machines are the most volatile. This can

be useful for determining periods of interest when testing various scheduling heuristics.

41

Figures III.3(a), III.3(b), III.3(c), and III.3(d) show the number of available hosts for

a one week period for the SDSC, DEUG, LRI, and UCB traces respectively. The first

date shown in each figure corresponds to a Sunday, and the series of dates proceeds until

Saturday. Each date shown corresponds to 12:01AM on the particular day. The number

of hosts on the y-axis represents number of hosts that had at least a single availability

interval during a one hour range for the SDSC, DEUG, and LRI platforms, and a minute

range for the UCB platform since the unavailability intervals on this platform tended to

be much smaller than in the rest of the platforms.

With the exception of the LRI trace, we observe a diurnal cycle of volatility

beginning in general during weekday business hours. That is, during the business hours,

the variance in the number of machines over time is relatively high, and during non

business hours, the number becomes relatively stable. In the case of UCB and SDSC

trace, the number of machines usually decreases during these business hours, whereas

in the DEUG trace, the number of machines can increase or decrease. This difference

in trends can be explained culturally. Most machines in enterprise environments in the

U.S. tend to be powered on through the day, and so any fluctuation in the number hosts

are usually downward fluctuations. In contrast, in Europe, machines are often powered

off when not in use during business hours (to save power or to reduce fire hazards, for

example), and as a result, the fluctuations can be upward.

Given that students and staff scientists form the majority of the user base at

SDSC and UCB, we believe the cause of volatility during business hours is primarily

keyboard/mouse activity by the user or keyboard, or perhaps short compilations of

programming code rather than long computations, which can be run on clusters or

supercomputers at SDSC or UCB. This is supported by observations in other similar

CPU availability studies [73, 62]. In the DEUG platform, the volatility is most likely

due to machines being powered on and off, in addition to interactive desktop activity.

The number of hosts in the LRI trace (see Figure III.3(c)) does not follow any

diurnal cycle. This trend can be explained by the user base of the cluster, i.e., computer

science researchers that submit long running batch jobs to the cluster. The result is that

hosts tend to be unavailable in groups at a time, which is reflected by the large drop in

host number on 1/10/05. Moreover, there is little interactive use of the cluster, and so

42

0

50

100

150

Time

07−S

ep−2

003

08−S

ep−2

003

09−S

ep−2

003

10−S

ep−2

003

11−S

ep−2

003

12−S

ep−2

003

13−S

ep−2

003

14−S

ep−2

003

To

tal N

um

be

r O

f H

osts

Ava

ilab

le

(a) SDSC

0

10

20

30

40

50

Time

09−J

an−2

005

10−J

an−2

005

11−J

an−2

005

12−J

an−2

005

13−J

an−2

005

14−J

an−2

005

15−J

an−2

005

16−J

an−2

005

Tot

al N

umbe

r O

f Hos

ts A

vaila

ble

(b) DEUG

Figure III.3: Number of hosts available for a given week for each platform.

43

0

2

4

6

8

10

12

14

16

18

Time

09−J

an−2

005

10−J

an−2

005

11−J

an−2

005

12−J

an−2

005

13−J

an−2

005

14−J

an−2

005

15−J

an−2

005

16−J

an−2

005

Tot

al N

umbe

r O

f Hos

ts A

vaila

ble

(c) LRI

0

20

40

60

80

Time

26−F

eb−1

994

27−F

eb−1

994

28−F

eb−1

994

01−M

ar−1

994

02−M

ar−1

994

03−M

ar−1

994

04−M

ar−1

994

05−M

ar−1

994

To

tal N

um

be

r O

f H

ost

s A

vaila

ble

(d) UCB

Figure III.3: *

Number of hosts available for a given week for each platform (cont.)

44

the number of hosts over time remains relatively constant. Also, the LRI cluster (with

the exception of a handful of nodes, possibly front-end nodes) is turned off on weekends,

which explains the drop on Sunday and Saturday.

We refer to the daily time period during which the set of hosts is most volatile as

business hours. After close examination of the number of hosts over time, we determine

that times delimiting business hours for the SDSC, DEUG, and UCB platforms are 9AM-

5PM, 6AM-6PM, and 10AM-5PM respectively. Regarding the LRI platform, we make

no distinction between non-business hours and business hours.

III.E.2 Temporal Structure of Availability

The successful completion of a task is directly related to the size of availabil-

ity intervals, i.e., intervals between two consecutive periods of unavailability. Here we

show the distributions of various types of availability intervals for each platform, which

characterize its volatility.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Interval length (hours)

Cum

ulat

ive

perc

enta

ge

SDSC mean: 2.0372DEUG mean: 0.47727LRI mean: 23.535UCB mean: 0.16663

SDSCDEUGLRIUCB

(a) Business hours.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

perc

enta

ge

SDSC mean: 10.2327

DEUG mean: 17.4239

UCB mean: 0.82374

SDSCDEUGUCB

(b) Non-business hours.

Figure III.4: Cumulative distribution of the length of availability intervals in terms of

time for business hours and non-business hours.

In Figure III.4, we show the length of the availability intervals in terms of

hours over all hosts in each platform. Figures III.4(a) and III.4(b) show the intervals

for business hours and non-business hours respectively where the business hours for each

45

platform are as defined in Section III.E.1. In the case a host is available continuously

during the entire business hour or non-business hour period, we truncate the intervals

at the beginning and end of the respective period. We do not plot availability intervals

for LRI during non-business hours, i.e., the weekend, because most of the machines were

turned off then.

Comparing the interval lengths from business hours to non-business hours, we

observe expectedly that the lengths tend to be much longer on weekends (at least 5

times longer). For business hours, we observe that the UCB platform tends to have the

shortest lengths of availability (µ '10 minutes), whereas DEUG, SDSC have relatively

longer lengths of availability (µ 'half hour, two hours, respectively). The LRI platform

by far exhibits the longest lengths (µ '23.5 hours).

The UCB platform has the shortest length most likely because the CPU thresh-

old of 5% is relatively low. The authors in [62] observed that system daemons can often

cause load up to 25%, and so this could potentially increase the frequency of availability

interruptions. In addition, the UCB platform was used interactively by students, and

often keyboard/mouse activity can cause momentary short bursts of 100% CPU activity.

Since the hosts in the DEUG and SDSC platforms also were used interactively, the in-

tervals are also relatively short. We surmise that the long intervals of the LRI platform

are result of the cluster’s workload, which often consists of periods of high activity and

followed by low activity. So, when the cluster is not in use, the nodes tend to be available

for longer periods of time.

In summary, the lower the CPU threshold, the shorter the availability intervals.

Availability intervals tend to be shorter in interactive environments, and intervals tend

to be longer during business hours than non-business hours.

In Figures III.4(a) and III.4(b), the CDF corresponding to the UCB trace during

business hours appear quite similar to the CDF during non-business hours whereas the

number of hosts shown in Figure III.3(d) varies considerably during business hours versus

non-business hours. The reason for this discrepancy is that the CDF does not weight

the distribution according to the total sum of availability. For example, consider to the

following two data sets A = {1, 1, 1, 1, 1, 10, 10, 10, 10, 10} and B = {1, 1, 1, 1,

1, 10 , 10, 10, 10, 1000} where each element is availability length using the same time

46

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cum

ulat

ive

perc

enta

ge


business hoursnon−business hours

Figure III.5: Cumulative distribution of the length of availability intervals normalized to

total duration of availability in terms of time for business hours and non-business hours

for the UCB platform.

units across both sets. The CDF’s of A and B within the x-range of [0,10] will appear

quite similar, but in fact the platform from which B was derived is considerably more

stable given that it has an availability interval of 1000 time units. In Figure III.5, we

show the cumulative distribution where the interval length is weighted to the total sum

of availability. We can see then that a larger portion of the availability intervals during

business hours are smaller in comparison to the intervals during non-business hours.

While this data is interesting for applications that require hosts to be reachable

for a given period of time (e.g., content distribution) and could be used to confirm and

extend some of the work in [18, 15, 30, 76], it is less relevant to our problem of scheduling

compute-intensive tasks. Indeed, from the perspective of a compute-bound application,

a 1GHz host that is available for 2 hours with average 80% CPU availability is less

attractive than, say, a 2GHz host that is available for 1 hour with average 100% CPU

availability.

By contrast, Figure III.6 plots the cumulative distribution of the availability

intervals, both for business hours and non-business hours, but in terms of the number

of operations performed. So instead of showing availability interval durations, the x-

axis shows the number of operations that can be performed during the interval, which is

47

computed using our measured CPU availability. This quantifies directly the performance

that an application, factoring in the heterogeneity of hosts. Other major trends in the

data are as expected with hosts and CPUs are more frequently available during business

hours than non-business hours. This empirical data enables us to quantify task failure

rates and develop a performance model (which we describe later in Section III.G).

0 0.5 1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Interval length in number of operations (trillions)

Cum

ulat

ive

perc

enta

ge

SDSC mean: 0.80 trilion opsDEUG mean: 0.26 trilion opsLRI mean: 10.35 trilion opsUCB mean: 0.07 trilion ops

SDSCDEUGLRIUCB

(a) Interval length during business hours.

0 0.5 1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Interval length in number of operations (trillions)

Cum

ulat

ive

perc

enta

ge

SDSC mean: 4.2513 trilion opsDEUG mean: 9.6545 trilion opsUCB mean: 0.32814 trilion ops

SDSCDEUGUCB

(b) Interval length during non-business hours.

Figure III.6: Cumulative distribution of the length of availability intervals in terms of

operations for business hours and non-business hours.

III.E.3 Temporal Structure of Unavailability

When scheduling an application, it useful also to know how long a host is

typically unavailable, i.e., unable to execute a task. Given two hosts with identical avail-

ability interval lengths, one would prefer the host with smaller unavailability intervals.

Using availability and unavailability interval data, one can predict whether a host has a

high chance of completing a task by a certain time, for example.

Figure III.7 shows the CDF of the length of unavailability intervals in terms

of hours during business hours and non-business hours for each platform. Note that

although a platform may exhibit a heavy-tailed distribution, it does not necessarily

mean that the platform is generally less available. (We describe CPU availability later

in Section III.F.1.)

48

We observe several distinct trends for each platform. First, for the SDSC,

platform we notice that the unavailability intervals are longer during business hours than

non-business hours. This can be explained by the fact that on weekends patches were

installed and backups were done, and several short unavailability intervals could result

if these were done as a batch. Second, for the DEUG platform, we found unavailability

intervals tend to be much shorter on business hours (µ =∼20 min ) versus non-business

hours (µ =∼32 hours ). The explanation is that many of the machines are turned off at

night or on weekends, resulting in long periods of unavailability. The fact that during

non-business hours more than 50% of DEUG’s unavailability intervals are less than a half

hour in length could be explained by a few machines still being used interactively during

non-business hours. Third, for the LRI platform, we notice that 60% of unavailability

intervals are less than 1 hour in length. This could be due to the fact that most jobs

submitted to clusters or MPP’s tend to be quite short in length [58]. Lastly, for the UCB

platform, the CDF of unavailability intervals appears similar. We believe this is because

the platform’s user base are students and while the number of machines in use during

business hours versus non-business hours is less, the pattern in which the students use

the machines is the same (i.e., using the keyboard/mouse and running short processes),

resulting in nearly identical distributions.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

pece

ntag

e

SDSC mean: 1.2561LRI mean: 3.7564DEUG mean: 0.3569UCB mean: 0.11909

SDSCDEUGLRIUCB

(a)

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

pece

ntag

e

SDSC mean: 1.2645DEUG mean: 31.9517UCB mean: 0.10379 SDSC

DEUGUCB

(b)

Figure III.7: Unavailability intervals in terms of hours

49

III.E.4 Task Failure Rates

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Task size (minutes on a 1.5GHz machine)

Frac

tion

of T

asks

Fai

led

SDSC ρ: 0.99231

DEUG ρ: 0.98812

LRI ρ: 0.99934

UCB ρ: 0.98223

SDSCDEUGLRIUCB

(a) Aggregate

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Failure Rate

Cum

ulat

ive

Frac

tion

SDSCDEUGLRIUCB

(b) Per host for 35 minute tasks

Figure III.8: Task failure rates during business hours

Based on our characterization of the temporal structure of resource availability,

it is possible to derive the expected task failure rate, that is the probability that a

host will become unavailable before a task completes, from the distribution of number

of operations performed in between periods of unavailability (from the data shown in

Figure III.6(a)) based on random incidence. To calculate the failure rate, we choose

hundreds of thousands of random points during periods of exec availability in traces.

At each point, we determine whether a task of a given size (i.e., number of operations)

would run to completion given that the host was available for task execution. If so, we

count this trial as a success; otherwise, we count the trial as a failure. We do not count

tasks that are started during periods of exec unavailability since we assume the worker

in this case would not be connected to the scheduler, and so scheduling a task at that

point would not possible.

Figure III.8(a) shows the expected task failure rates computed for business

hours for each platform. Also, the least squares line for each platform is superimposed

with a dotted magenta line, and the correlation coefficient ρ is shown. For illustration

purposes, the x-axis shows task sizes not in number of operations, but in execution time

on a dedicated 1.5GHz host, from 5 minutes up to almost 7 hours. A maximum task

50

size of 350 minutes was chosen so that a significant number of task executions could be

simulated during business hours.

The expected task failure rate is strongly dependent on the task lengths. (The

weekends show similar linear trends, albeit the failure rates are lower.) The platforms

with failure rates from lowest to highest are LRI, SDSC, DEUG, and UCB, which agrees

with the ordering of each platform shown in Figure III.6(a). It appears that in all

platforms the task failure rate increases with task size and that the increase is almost

linear; the lowest correlation coefficient is 0.98, indicating that there exists a strong linear

relationship between task size and failure rate. By using the least squares fit, we can

define a closed-form model of the aggregate performance attainable by a high-throughput

application on the corresponding desktop grid (see Section III.G). Note, however, that

the task failure rate for larger tasks will eventually plateau as it approaches one.

While Figure III.8(a) shows the aggregate task failure rate of the system, Fig-

ure III.8(b) shows the cumulative distribution of the failure rate per host in each platform,

using particular task size of 35 minutes on a dedicated 1.5GHz host. The heavier the

tail, the more volatile the hosts in the platform. Overall, the distributions appears quite

skewed. That is, a majority of the hosts are relatively stable. For example, with the

DEUG platform, about 75% of the hosts have failure rates of 20% or less. The UCB

platform is the least skewed, but even so, over 70% of the hosts have failure rates of 40%

or less. The fact that most hosts have relatively low failure rates can affect scheduling

tremendously, and we discuss this effect later in Chapter V.

Surprisingly, in Figure III.8(a), SDSC has lower task failure rates than UCB,

yet in Figure III.8(b), SDSC has a larger fraction of hosts with failure rates 20% or less

compared to UCB. The discrepancy can be explained by the fact that UCB still has a

larger fraction of hosts with failure rates 40% or more than SDSC; after averaging, SDSC

has lower failure rates.

III.E.5 Correlation of Availability Between Hosts

An assumption that permeates fault tolerance research in large scale systems

is that resource failure rates can be modelled with independent and identical probability

distributions (i.i.d). Moreover, a number of analytical studies in desktop grids assume

51

that exec availability is i.i.d. [51, 50, 43] to simplify probability calculations. We inves-

tigate the validity of such assumptions with respect to exec availability in desktop grid

systems.

First, we studied the independence of exec availability across hosts by calculat-

ing the correlation of exec availability between pairs of hosts. Specifically, we compared

the availability for each pair of hosts by adding 1 if both machines were available or both

machines were unavailable, and subtracting 1 if one host was available and the other

one was not. This method was used by the authors in [18] to study the correlation of

availability among thousands of hosts at Microsoft.

Figure III.9 shows the cumulative fraction of hosts pairs that are below a partic-

ular correlation coefficient. The line labelled “trace” in the legend indicates correlation

calculated according to the traces. The line labelled “min” indicates the minimum pos-

sible correlation given the percent of time each machine is available, and likewise for the

line labelled “max”.

In Figure III.9b corresponding to the DEUG platform, the point (0.4, 0.6)

means that for 60% of all possible host pairing, the percent of time that the two

host’s availability “matched” was 40% or less than the time the hosts’ availability “mis-

matched”. The difference between points (0.4, 0.6) and (0.66, 0.6) means that for the

same fraction of host pairings, there is 26% more “matching” possible among the hosts.

The difference between points (-0.2, 0.1) and (-0.8, 0.1) indicates for 10% of the host

pairings, there could have been at most 60% more “mismatching” availability. The point

(-0.4, 0.05) means that for 5% of the host pairings, the availability “mismatched” 40%

or more of the time more often than “matched”.

Overall, we can see that in all the platforms at least 60% of the host pairings

had positive correlation, which indicates that a host is often available or unavailable

when another host is available or unavailable respectively. However, Figures III.9(a)

and III.9(d) also show that this correlation is due to the fact that most hosts are usually

available (when combined with the fact that 80% of the time, hosts have CPU availability

of 80% or higher), which is reflected by how closely the trace line follows the random

line in the figure. That is, if two host are available (or unavailable) most of the time,

they are more likely to be available (or unavailable) at the same, even randomly. As

52

a result, the correlation observed in the traces would in fact occur randomly because

the hosts have such high availability. This in turn gives strong evidence (although not

completely sufficient) that the exec availabilities of hosts in the SDSC and UCB platforms

are independent.

We believe the high likelihood of independence of exec availability in the SDSC

and DEUG traces is primarily due to the user base of the hosts. As mentioned previously,

the user base of the SDSC consisted primarily of research scientists and administrators

who we believe used the Windows hosts primarily for word processing, surfing the Inter-

net, and other tasks that did not directly affect the availability of other hosts. Similarly,

for the UCB platform, we believe the students used the host primarily for short dura-

tions, which is evident by the short availability intervals. So, because the primary factors

causing host unavailability, i.e., user processes and keyboard/mouse activity, are often

independent from one machine to another in desktop environments, we observe that exec

availability in desktop grids is often not significantly correlated.

On the other hand, Figures III.9(b) and III.9(c) show different trends where the

line corresponding to the trace differs significantly with respect to correlation (as much

as 20%) from the line corresponding to random correlation. The weak correlation of the

DEUG trace is due to the particular configuration of the classroom machines. These

machines had wake-on-LAN enabled Ethernet adapters, which allowed the machines to

be turned on remotely. The system administrators had configured these machines to

“wake” every hour if they had been turned off by a user. Since most machines are

turned off when not in use, many machines were awakened at the same time, resulting

in the weak correlation of availability. We believe that this wake-on-LAN configuration

is specific to the configuration of the machines in DEUG, and that in general, machine

availability is independent in desktop grids where keyboard/mouse activity is high.

Hosts in the LRI platform also shows significant correlation relative to random.

This behavior is expected as batch jobs submitted to the cluster tend to consume a large

number of nodes simultaneously, and consequently, the nodes are unavailable for desktop

grid task execution at the same times.

The independence result is supported by the host availability study performed

at Microsoft reported in [18]. In this study, the authors monitored the uptimes of about

53

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Correlation Coefficient

Fra

ctio

n of

Hos

t Pai

rings

mintracerandommax

(a) SDSC

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n of

Hos

t Pai

rings

mintracerandommax

(b) DEUG

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n of

Hos

t Pai

rings

mintracerandommax

(c) LRI

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n of

Hos

t Pai

rings

mintracerandommax

(d) UCB

Figure III.9: Correlation of availability.

54

50,000 desktops machines over a 1 week period, and found that the correlation of host

availability matched random correlation. Host unavailability implies exec unavailability,

and so our measurements subsume host unavailability in addition to the primary factors

that cause exec unavailability, i.e., CPU load and keyboard/mouse activity. We show

that exec availability between hosts is not significantly correlated when taking into the

effects of CPU load, keyboard/mouse activity, in addition to host availability.

Another difference between our study and the Microsoft study is that the Mi-

crosoft study analyzed the correlation of the 50,000 machines as a whole and did not

separate the 50,000 desktop machines into departmental groups (such as hosts used by

administrators, software development groups, and management, for example). So it is

possible that some correlation within each group was hidden. In contrast, the groups

of host that we analyzed had a relatively homogeneous user base, and our result is

stronger in the sense that we rule out the possibility of correlation among hosts with

heterogeneous user bases.

Other studies of host availability have in fact shown correlation of host fail-

ures [9] due to power outages or network switches. While such failures can clearly affect

application execution, we do not believe these types of failures are the dominant cause

of task failures for desktop grid applications. Instead, the most common cause of task

failures is due to either high CPU load or keyboard/mouse activity [73], and our study

directly takes into account these factors affect exec availability in contrast to the other

studies. So although host availability may be correlated, this correlation is significantly

weakened by the major causes of exec unavailability.

The evidence for host independence can simplify the reliability analysis of these

platforms. In particular, we use this result to simplify the calculation of the probability

of multiple host failures, which we describe in Chapter VI.

III.E.6 Correlation of Availability with Host Clock Rates

We hypothesized that a host’s clock rate would be a good indicator of host

performance. Intuitively, host speed could be correlated with a number of other machine

characteristics. For example, the faster a host’s clock rate is, the faster it should complete

a task, and the lower the failure rate should be. Or the faster a host’s clock rate is, the

55

X Y ρ

clock rate mean availability interval length (time) -0.0174log (clock rate) mean availability interval length (time) -0.0311clock rate % of time unavailable 0.1106log (clock rate) % of time unavailable 0.1178clock rate mean availability interval length (ops) 0.5585log (clock rate) mean availability interval length (ops) 0.5196clock rate task failure rate (15 minute task) -0.2489log (clock rate) task failure rate (15 minute task) -0.2728clock rate P(complete 15 min task in 15 min) 0.859log (clock rate) P(complete 15 min task in 15 min) 0.792

Table III.2: Correlation of host clock rate and other machine characteristics during

business hours for the SDSC trace

more often a user will be using that particular host, and the less available the host will

be. Surprisingly, host speed is not as correlated with these factors as we first believed.

Table III.2 shows the correlation of clock rate with various measures of host

availability for the SDSC trace. (Because the clock rates in the other platforms were

roughly uniform, we could not calculate the correlation.) Since clock rates often increase

exponentially throughout time, we also compute the correlation of the log of clock rates

with the other factors. We compute the correlation between clock rate and the mean

time per availability interval to capture the relationship between clock rate the temporal

structure of availability . However, it could be the case that a host with very small

availability intervals is available most of the time. So, we also compute the correlation

between clock rate and percent of time each host is unavailable . We find that there is

little correlation between the clock rate and mean length of CPU availability intervals

in terms of time (see rows 1 and 2 in Table III.2), or percent of the time the host

is unavailable (see rows 3 and 4). We explain this by the fact that many desktops

for most of the time are being used for intermittent and brief tasks, for example for

word processing, and so even machines with relatively low clock rates can have high

unavailability (for example due to frequent mouse/keyboard activity), which results in

availability similar to faster hosts. Moreover, the majority of desktops are distributed in

office rooms. So desktop users do not always have a choice of choosing a faster desktop

to use.

56

Task size Failure rate ρ

5 0.063 -0.205310 0.097 -0.248215 0.132 -0.248920 0.155 -0.262425 0.177 -0.273930 0.197 -0.28035 0.220 -0.2650

Table III.3: Correlation of host clock rate and failure rate during business hours. Task

size is in term of minutes on a dedicated 1.5GHz host.

What matters more to an application than the time per interval is the opera-

tions per interval, how it affects the task failure rate, and whether it is related to the

clock rate of the host. We compute the correlation between clock rate and CPU avail-

ability in terms of the mean operations per interval and failure rate for a 15 minute

task. There is only weak positive correlation between clock rate and the mean number

of operations per interval (see rows 5 and 6), and weak negative correlation between the

clock rate and failure rate (see rows 7 and 8). Any possibility of strong correlation would

have been weakened by the randomness of user activity. Nevertheless, the factors are

not independent of clock rate because hosts with faster clock rates tend to have more

operations per availability interval, thus increasing the chance that a task will complete

during that interval.

Furthermore, in rows 9 and 10 of Table III.2), we see the relationship between

clock rate and rate of successful task completion within a certain amount of time. In

particular, rows 9 and 10 show the fairly strong positive correlation between clock rate

and the probability that a task completes in 15 minutes or less. The size of the task is

15 minutes when executed on a dedicated 1.5GHz. (We also computed the correlation

for other task sizes and found similar correlation coefficients). Clearly, whether a task

completes in a certain amount of time is related to clock rate. However, this relationship

is slightly weakened due to the randomness of exec unavailability, as unavailability could

cause a task executing on a relatively fast host to fail. One implication of this correlation

shown in rows 5-10 is that a scheduling heuristic based on using host clock rates for task

assignment may be effective.

57

The correlation between host clock rate and task failure rate should increase

with task size (until all hosts have very high failure rates close to 1). Short tasks that

have a very low mean failure rate (near zero) over all hosts will naturally have low

correlation. As the task size increases, the failure rate will be more correlated with clock

rates since in general the faster hosts will be able to finish the tasks sooner. Table III.3

shows the correlation coefficient between clock rate and failure rate for different task

sizes. There is only weak negative correlation between host speed and task failure rate,

and it increases in general with the task size as expected. Again, we believe the weak

correlation is partly due to the randomness of keyboard and mouse activity on each

machine. One consequence of this result with respect to scheduling is that the larger the

task size, the more important it is to schedule the tasks on hosts with faster clock rates.

We also recomputed the correlation for only those hosts with failure rates

greater than 10%, 20% and 30% in an effort to remove those hosts that were exceptionally

available, but did not find significant changes in the correlation coefficients. Removing

certain hosts from the correlation calculation left relatively few hosts remaining, so it is

not clear if there was enough data to make a meaningful calculation.

III.F Characterization of CPU Availability

While the temporal structure of availability directly impacts task execution,

it is also useful to observe the CPU availability of the entire system for understanding

how the temporal structure of availability affects system performance as a whole, and

how the CPU availability per availability interval fluctuate if at all. That is, aggregate

CPU availability puts availability and unavailability intervals into perspective, captur-

ing the effect of both types of intervals on the compute power of system as statistics

of either availability intervals or unavailability intervals do not necessarily reflect the

characteristics of the other.

III.F.1 Aggregate CPU Availability

An estimate of the computational power (i.e., number of cycles) that can be de-

livered to a desktop grid application is given by an aggregate measure of CPU availability.

58

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Availability Threshold (%)

% o

f T

ime

Abo

ve T

hres

hold

SDSCDEUGLRIUCB

(a) Business hours

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Availability Threshold (%)

% o

f T

ime

Abo

ve T

hres

hold

SDSCDEUGUCB

(b) Non-business hours

Figure III.10: Percentage of time when CPU availability is above a given threshold, over

all hosts, for business hours and non-business hours.

For each data point in our measurements (over all hosts), we computed how often CPU

availability is above a given threshold for both business hours and non-business hours

on each platform. Figures III.10(a) and III.10(b) plot the frequency of CPU availability

being over a threshold for threshold values from from 0% to 100%: the data point (x, y)

means that y% of the time CPU availability is over x%. For instance, the graphs show

that CPU availability for the SDSC platform is over 80% about 80% of the time during

business hours, and 95% of the time during non-business hours.

In general, CPU availability tends to be higher during business hours than non-

business hours. For example, on business hours, the SDSC and DEUG platforms have

zero CPU availability about 15% of the time, whereas during business hours, the same

hosts in the same platforms almost always have CPU availability greater or equal to

zero. During business hours, we observe that SDSC and DEUG have an initial drop

from a threshold of 0% to 1%. We believe the CPU unavailability indicated by this drop

is primarily due to exec unavailability cause by user activity/process on the hosts in

the system. Other causes of CPU unavailability include the suspension of the desktop

grid task or brief bursts of CPU unavailability that are not long enough to cause load

on the machine to go below the CPU threshold. The exceptions are the UCB and LRI

59

platforms that shows no such drop. The UCB platform is level from 0 to 1% because

the worker’s CPU threshold is relatively stringent (5%), resulting in common but very

brief unavailability intervals which have little impact on the overall CPU availability of

the system. One reason that the LRI plot is relatively virtually constant between 0 and

1% is that the cluster is lightly loaded so most host’s CPU’s are available most of the

time.

After this initial drop between threshold of 0 and 1%, the curves remain almost

constant until the respective CPU threshold is reached. The levelness is an artifact of

the worker’s CPU threshold used to determine when to terminate a task. That is, the

only reason our compute bound task (with little and constant I/O) would have a CPU

availability less than 100% is if there were other processes of other users running on the

system. When CPU usage of the host’s user(s) goes above the worker’s CPU threshold,

the task is terminated, resulting in the virtually constant line from the 1% threshold to

(100% - worker’s CPU threshold). (Note in the case of UCB, we assume that either the

host’s CPU is completely available or unavailable, which is valid given that the worker’s

CPU threshold was 5%.) The reason that the curves are not completely constant during

this range is possibly because of system processes that briefly use the CPU, but not long

enough to increase the moving average of host load to cause the desktop grid task to be

terminated.

Between the threshold of (100% - worker’s CPU threshold) and 100%, most of

the curves have a downward slope. Again, the reason this slope occurs, is that system

processes (which often use 25% of the CPU [52]) are running simultaneously. On aver-

age for the SDSC platform, CPU’s are completely unavailable 19% of the time during

weekdays and 3% of the cases during weekends. We also note that both curves are rela-

tively flat for CPU availability between 1% and 80%, denoting that hosts rarely exhibit

availabilities in that range.

Other studies have obtained similar data about aggregate CPU availability in

desktop grids [28, 55]. While such characterizations make it possible to obtain coarse

estimates of the power of the desktop grid, it is difficult to related them directly to what

a desktop grid application can hope to achieve. In particular, the understanding of host

availability patterns, that is the statistical properties of the duration of time intervals

60

during which an application can use a host, and a characterization of how much power

that host delivers during these time intervals, are key to obtaining quantitative measures

of the utility of a platform to an applications. We develop such characterization in the

next chapter.

III.F.2 Per Host CPU Availability

While aggregate CPU availability statistics reflect the overall availability of the

system, it is possible some hosts are less available than others. Here we show CPU

availability per host to reveal any potential imbalance.

Figures III.11, III.12, III.13, and III.14 show the CPU availability per host.

Each vertical bar corresponds to the CPU availability for a particular host, where the

hosts are sorted by clock rate along the x-axis. In each bar, there are sub-bars that

correspond the percent of the time the host’s CPU availability fell in particular range.

In Figures III.11 and III.12, we observe heavy imbalance in terms of CPU unavailability.

Moreover, it appears that CPU availability does not always increase with host clock rate

as several of the slowest 50% of the hosts have CPU unavailability greater than or equal

to 40% of the time. In Figure III.13, which corresponds to the LRI platform, we see

that the system is relatively underutilized as few of the hosts in the LRI cluster have

CPU unavailability greater than 5%. In Figure III.14, which corresponds to the UCB

platform, we observe that the system is also underutilized as none of the hosts have CPU

unavailability greater than 5% of the time; we believe this is a result of the system’s low

CPU threshold. In summary, the CPU availability does not show strong correlation

with clock rate, which is reaffirmed by Table III.3, and the amount of unavailability is

strongly dependent on the user base and worker’s criteria for idleness. One implication

of this result is that simple probabilistic models of desktop grid systems (such as those

described in [43, 50]) that assume hosts have constant frequencies of unavailability and

constant failure rates are insufficient for modelling these complex systems.

61

5010

015

020

00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Nod

es s

orte

d by

clo

ck r

ate

% of Time Within Availability Range0−

10%

10−

20%

20−

30%

30−

40%

40−

50%

50−

60%

60−

70%

70−

80%

80−

90%

90−

100%

Fig

ure

III.11

:C

PU

avai

labi

lity

per

host

inSD

SCpl

atfo

rm.

62

2040

6080

100

120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Nod

es s

orte

d by

clo

ck r

ate


10%

10−

20%

20−

30%

30−

40%

40−

50%

50−

60%

60−

70%

70−

80%

80−

90%

90−

100%

Fig

ure

III.12

:C

PU

avai

labi

lity

per

host

inD

EU

Gpl

atfo

rm.

63

1020

3040

500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Nod

es s

orte

d by

clo

ck r

ate


10%

10−

20%

20−

30%

30−

40%

40−

50%

50−

60%

60−

70%

70−

80%

80−

90%

90−

100%

Fig

ure

III.13

:C

PU

avai

labi

lity

per

host

inLR

Ipl

atfo

rm.

64

1020

3040

5060

7080

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

Nod

es s

orte

d by

clo

ck r

ate


10%

10−

20%

20−

30%

30−

40%

40−

50%

50−

60%

60−

70%

70−

80%

80−

90%

90−

100%

Fig

ure

III.14

:C

PU

avai

labi

lity

per

host

inU

CB

plat

form

.

65

III.G An Example of Applying Characterization Results:

Cluster Equivalence

The results of our measurement and characterization study have numerous uses

for desktop grid modelling and simulation. In this section, we give an example of how

one could use our characterization to derive a performance model of a desktop grid

system, quantifying the negative impact that heterogeneous and volatile resources have

on system throughput. We measure the impact and utility using a cluster equivalence

metric; given a N-node desktop grid, what is the equivalent M-node dedicated cluster?

We focus our analysis on the SDSC platform since the DEUG, LRI, and UCB platforms

each had less than 100 hosts, but our models and utility metrics are applicable to the

other platforms as well.

III.G.1 System Performance Model

In Section III.E.4, we found that task failure rate was strongly correlated with

task size, and so we could use a linear function of task size to model the task failure

rate. In this section, we use this task failure rate function in a performance model for the

desktop grid, and with this model, determine the grid’s cluster equivalence for a high-

throughput application. In particular, we propose a model for an application’s expected

work rate (that is the number of useful operations performed by time units), W (s), given

a uniform task size s (number of operations per task expressed as minutes on a dedicated

1.5GHz host) as follows.

From our measurements we can determine the average overhead, g in between

each task scheduled on the same resource due to the desktop grid server (see Sec-

tion III.D). Using the method describe in Section III.E.4, we can compute the task

failure rate, f(s), as a function of the task size. We can also estimate the average com-

pute rate in operations per second for a host in the desktop grid, r, by computing the

average delivered operations per second for each host using our availability traces, and

taking an average over all hosts. W (s) is the number of operations per seconds for an

66

application using N hosts in the desktop grid and is given by:

W (s) = N × r(1− f(s))1 + r

sg, (III.1)

where r(1 − f(s)) is the effective compute rate accounting for failures and rsg is the

overhead for scheduling each task.

We can then instantiate the above model with data obtained from any of the

desktop grid platforms to determine the corresponding work rate. As an example, we

compute W (s) for the SDSC grid with N = 220 hosts. The average overhead g for

scheduling a task was characterized to be 36.9 seconds on the SDSC platform (see Sec-

tion III.D). Our least squares fit to the weekday task failure rate (as calculated using the

method described in Section III.E.4) for the SDSC data was f(s) = (0.0031)s + .0142.

The average weekday work rate r (taking into account unavailability and host load) per

host was 31.6 million operations per second. Substituting into equation III.1, we obtain

a closed-form expression for W (s) for the SDSC platform:

W (s) =(6863257896− 21582572 ∗ s)

1 + 19462293741136∗s

(III.2)

Figure III.15 plots this rate W (s) for a range of task sizes executed on the

SDSC grid on weekdays, and also plots the rate for weekends. On weekdays, for task

sizes below 13 minutes per-host, progress increases rapidly as the task size compensates

for the fixed overhead. However, as task size increases further, the per-host progress

decreases as the penalty of additional task failures wastes some of the CPU cycles. This

trend is also exhibited for weekends, but the longer availability intervals enable compute

rates to improve up to task sizes around 25 minutes in length. Thus, for both weekdays

and weekends, the trade-off between overhead and failures produces an optimal task size,

which is 13 and 25 minutes respectively. Note that these are the number of minutes that

a task execution would require on a dedicated 1.5GHz host, so the effective execution

times experienced on the SDSC Entropia grid range from approximately five times longer

to two times shorter.

67

0 10 20 30 40 504.5

5

5.5

6

6.5

7

7.5

8

8.5x 10

9

Task size (minutes on a dedicated 1.5GHz host)

Wor

k ra

te (

ops

per

sec)

weekdayweekend

Figure III.15: Model of application work rate for entire SDSC desktop grid, in number

of operations per seconds versus task size,in number of minutes of dedicated CPU time

on a 1.5GHz host.

III.G.2 Cluster Equivalence

To characterize the impact of resource volatility in a desktop grid on usable

performance, we use a cluster equivalence utility metric, which was first introduced

in [8]. That is, for a given desktop environment (and corresponding temporal CPU

availability), what fraction of a dedicated cluster CPU is each desktop CPU worth to

an application? With this information, we can establish for a desktop grid the size of

a dedicated cluster to which its performance is equivalent. More precisely: “Given an

N -host desktop grid, how many nodes of a dedicated cluster, M , with comparable CPU

clock rates, are required such that the two platforms have equal utility?” We define

M/N as the cluster equivalence ratio of the desktop grid. Because the objective is to

quantify the performance impact of resource volatility, we normalize assuming that the

CPU clock rate of each node in the cluster is equal to the mean CPU clock rate in the

desktop grid 1.

It is clear from our desktop grid measurements that the cluster equivalence

ratio depends on the application’s structure and characteristics. Here we consider only

task parallel applications with various task sizes. The higher the task size the lower1 Numerous industrial interactions by one of the committee members suggest that this is true in many

companies.

68

the cluster equivalence ratio since the application becomes more subject to failures (see

Figure III.15).

0 10 20 30 40 500.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Clu

ster

Equ

ival

ence

Rat

io

Subjob size (minutes on a 1500MHz machine)

weekdaysweekends

0 10 20 30 40 50110

120

130

140

150

160

170

180

190

200

210

220

Equ

ival

ent N

umbe

r of

Clu

ster

Nod

es

Cluster Equivalence

(a) Whole SDSC desktop grid

0 10 20 30 40 500.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Clu

ster

Equ

ival

ence

Rat

io

Subjob size (minutes on a 700MHz machine)

weekdaysweekends

0 10 20 30 40 50

80

90

100

110

120

130

140

150

Equ

ival

ent N

umbe

r of

Clu

ster

Nod

es

Cluster Equivalence

(b) SDSC desktop grid as of 2001

Figure III.16: Cluster equivalence of a desktop grid CPU as a function of the application

task size. Two lines are shown, one for the the resources on weekdays and weekends.

We compute the cluster equivalence for a range of application task sizes, as

shown in Figure III.16(a). Thus curves are essentially scaled versions of those in Fig-

ure III.15. The data points on this graph can be used to determine the effective cluster

CPU’s that the SDSC desktop grid delivers. For example, for 56 million operation tasks

(approximately 25 minutes on a 1.5 GHz CPU), the performance of the 220-node SDSC

Entropia desktop grid is equivalent to a 209-node cluster on weekends, and to a 160-node

cluster on weekdays.

For comparison, Figure III.16(b) shows the cluster equivalence metric computed

for a subset of the desktop grid that excludes the most recent machines (in this case, the

153 machines produced after the year 2001, which had clock rates higher than 1GHz).

The mean clock rate of this subset of hosts was approximately 700MHz. We observe

that the trends are similar to that seen in Figure III.16(a). In fact, the average relative

difference between the cluster equivalence ratios for the entire desktop grid and the

subset, over all task sizes, is approximately 10%.

The fact that our cluster equivalence metric is relatively consistent for different

69

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

Top % of sorted hosts

Cum

ulat

ive

% o

f Tot

al P

ower

By delivered powerBy clock rates

Figure III.17: Cumulative percentage of total platform computational power for SDSC

hosts sorted by decreasing effectively delivered computational power and for hosts by

clock rates.

subsets of the desktop grid is explained by Figure III.17. This figure plots the cumulative

percentage of operations delivered by a subset of the entire platform, corresponding to

an increasing percentage of the sorted hosts. In other words, data point (x, y) on the

graph means that x% of the hosts (taking the most “useful” hosts first) deliver y% of

the compute operations of the the entire platform. Hosts are sorted either by number

of delivered operations per seconds (as computed from our measurements) or by the

corresponding clock rate, as seen in the two curves in Figure III.17. We can see that the

two curves are strikingly similar. This indicates that the average availability patterns

of the hosts in our platform over our measurement period are uncorrelated with host

clock rates (as shown in Chapter III). This in turn explains why our cluster equivalence

metric is consistent for the whole platform and a subset containing only older machines.

Interestingly, we also find that the curves in Figure III.17 while not linear, are

only moderately skewed (as compared to the dotted line in the figure). For instance,

the 30% most useful hosts deliver 50% of the overall compute power. Similarly the 30%

least useful hosts deliver approximately 14% of the overall compute power. Note that

this skew is not as high as to justify using only a small fraction of the resources.

70

III.H Summary

In this chapter, we described a simple trace method that measured CPU avail-

ability in a way that reflects the same CPU availability that could be experienced by real

application. We used this method to gather data sets on three platforms with distinct

clock rate distributions and user types. In addition, we obtained a fourth data set gath-

ered earlier by the authors in [12]. Then, we derived several useful statistics regarding

exec availability and CPU availability for each system as a whole and individual hosts.

The findings of our characterization study can be summarized as follows. First,

we found that even in the most volatile platform with the strictest host recruitment policy

availability intervals tend to be 10 minutes in length or greater, while the mean length

over all platforms with interactive users was about 2.6 hours. We showed in Section III.G

that using relatively short task lengths of 10 minutes to utilize such availability intervals

does not significantly harm the aggregate achievable performance of the application, even

with the server’s enforced task assignment overhead.

Second, we found that task failure rates on each system were correlated with

the task size, and so task failure rate can be approximated as a linear function of task

size. This in turn allowed us to construct a closed-form performance model of the system,

which we describe in Section III.G.

Third, we observed that on platforms with interactive users, exec availability

tends to be independent across hosts. However, this independence is affected significantly

by the configuration of the hosts; for example wake-on-LAN enabled Ethernet adapters

can cause correlated availability among hosts. Also, in platforms used to run batch jobs,

availability is significantly correlated.

Fourth, the availability interval lengths in terms of time are not correlated with

clock rates nor is the percentage of time a host is unavailable. This means that hosts with

faster clock rates are not necessarily used more often. Nevertheless, interval lengths in

terms of operations and task failure rates are correlated with clock rates. This indicates

that the selecting resources according to clock rates may be beneficial.

Finally, we studied the CPU availability of the resources. We found that be-

cause of the recruitment policies of each worker, the CPU availability of the hosts are

71

either above 80% or zero. Regarding the CPU availability per host, there is wide varia-

tion of availability from host to host, especially in the platforms with interactive users.

As a result, even in platforms with hosts of identical clock rates, there is significant

heterogeneity in terms of the performance of each host with respect to the application.

Chapter IV

Resource Management: Methods,

Models, and Metrics

IV.A Introduction

In the previous chapter, we measured and characterized four desktop grids, and

virtually all of the statistics that we computed influence our design and implementation

of scheduling heuristics, which we discuss in the remaining chapters. In this chapter,

we outline the scheduling techniques on which these heuristics are based. In addition,

we describe our platform and application models and instantiations, simulation method,

and performance metrics used to evaluate our scheduling heuristics.

We consider the problem of scheduling applications at the application and re-

source management level on an enterprise desktop grid, which consists of volatile hosts

within a LAN. LAN’s are often found within a corporation or university, and several

companies such as Entropia and United Devices have specifically targeted these LAN’s

as a platform for supporting desktop grid applications. Enterprise desktop grids are an

attractive platform for large scale computation because the hosts usually have better

connectivity with 100Mbps Ethernet for example and have relatively less volatility and

heterogeneity than desktop grids that span the entire Internet. Nevertheless, compared

to dedicated clusters, enterprise grids are volatile and heterogeneous platforms, and so

the main challenge is then to develop fault-tolerant, scalable, and efficient scheduling

72

73

heuristics. Although we evaluate our heuristics in enterprise environments (since we

have traces from only enterprise desktop grids), we design the heuristics so that they

would be applicable to Internet environments as well. This has a number of consequences

with respect to the platform and application models, which we describe in Section IV.B.

The most commonly used scheduling method in desktop grid systems [63,

37, 39] is First-Come-First-Serve (FCFS). Desktop grids, such as SETI@home, FOLD-

ING@home, and FIGHTAIDS@home, have typically been used to run high-throughput

jobs, where the performance metric is the aggregate work rate of the system over weeks

or months. As such, the highest aggregate work rate can be achieved by assigning tasks

to hosts in a FCFS manner as the start-up and the wind-down phases of application

executions are negligible compared to the steady-state phase. When the number of tasks

is far greater than the number hosts, the scheduler should allocate tasks to as many

resources as possible, and so there is no issue of resource selection.

Although desktop grids are commonly used for high-throughput jobs, desktop

grids are also an attractive platform for supporting rapid application turnaround on

the order of minutes or hours. As discussed in Chapter I, numerous applications from

computational biology or graphics (such as interactive scientific visualization [34]) require

rapid turnaround. In addition, most applications from MPP workloads are less than a

day in length [58], and applications in a company’s workload often require relatively

rapid turnaround within a day’s time [29].

Supporting rapid turnaround for applications on volatile desktop grids is chal-

lenging for a number of reasons. First, the resources are heterogeneous in terms of clock

rates, memory and disks sizes, and network connectivity, for example. The distribution

of clock rates in many typical desktop grids often spans over an order of magnitude.

Assuming no system capability for task preemption, an application with equally-sized

tasks, where the number of tasks is roughly the number of hosts, could potentially suffer

from severe load imbalance if the last few tasks are allocated to slow hosts near the end

of application completion, delaying the application completion as most of the hosts sit

needlessly idle.

Second, the resources are shared with the desktop user or owner in a way that

any subset of machines cannot be reserved for a block of time. Without dedicated

74

access to a set of machines, unplanned interruptions in task execution make scheduling

more difficult than if the resources were completely dedicated. In particular, because the

desktop user activities are given priority over the desktop grid application, the desktop is

volatile as a result of fluctuating CPU and host availability, which can result in frequent

task failures. For example, a task that takes 35 minutes to run on a dedicated 1.5GHz

host has a 22% failure rate calculated by means of random incidence over a trace data

set collected from the desktop grid at SDSC (as show in Section III.E.4). In a system

without checkpointing support, failures near the end of application execution can result

in poor application performance because a failed task must be restarted from scratch,

which in turn delays application completion.

These causes of volatility have significant effects on applications that require

rapid turnaround. In Figure IV.1, we show the cumulative number of tasks completed

over time observed using trace-driven simulations, given a server that schedules tasks to

hosts in a FCFS manner. These results are obtained for an application with 100, 200,

and 400 tasks run on a platform with about 190 hosts, where each task would execute for

15 minutes on a dedicated 1.5GHz processor, and simulation the platform is driven the

our SDSC trace (see Section IV.B for a detailed description of our simulation methodol-

ogy). In each of the three curves there is an initial hump as the system reaches steady

state, after which throughput increases roughly linearly. The cumulative throughput

then reaches a plateau, which accounts for an increasingly large fraction of application

makespan as the number of tasks decreases. For the application with 100 tasks, 90% of

the tasks are completed in about 39 minutes, but the application does not finish until

79 minutes have passed, which is almost identical to the makespan for a much larger

application with 400 tasks.

There are two main causes of this plateau. The first cause is task failures that

occur near the completion of the application. When a task fails, it must be started from

scratch, and when this occurs near the end of the application execution, it will delay

the application’s completion. The second cause is tasks assigned to slow hosts. Once a

task is assigned to a slow host, a FCFS scheduler without task preemption or replication

capabilities will be forced to wait until the slow host completes the result.

As the number of tasks gets large when compared to the number of hosts in

75

the platform, the plateau becomes less significant, thus justifying the use of a FCFS

strategy. However, for applications with a relatively small number of tasks, resource

selection could improve the performance of short-lived applications significantly.

0 10 20 30 40 50 60 70 800

50

100

150

200

250

300

350

400

Time (minutes)

Cum

ulat

ive

Num

ber

of T

asks

Com

plet

ed

100 tasks200 tasks400 tasks

Figure IV.1: Cumulative task completion vs. time.

We design various resource management methods to address this scheduling

problem, which we describe in the next section.

IV.B Models and Instantiations

In order to design and evaluate scheduling heuristics, we create models of the

desktop grid systems and targeted applications that capture only the most relevant char-

acteristics. These models enable computationally tractable simulations of a number of

heuristics for a large range of applications and platforms. We then instantiate these

models with our desktop grid traces described in Chapter III. That is, we implement a

discrete event simulator based on the application and platform models, and then drive

the simulator using the traces described in Chapter III that were collected from four

real desktop grid platforms as well as other representative grid configurations. We use

simulation for studying resource selection on desktop grids as direct experimentation

does not allow controlled and thus repeatable experiments. In true desktop grid fashion,

76

Figure IV.2: Scheduling Model

our simulations are deployed using the XtremWeb [37] desktop grid system. We describe

these system and application models and their instantiations in this section, referenc-

ing the Client, Application and Resource Management, and Worker levels described in

Chapter II and shown in Figure II.1.

IV.B.1 Platform model and instantiation

At the application and resource management level, we assume a scheduler that

maintains a queue of tasks to be scheduled and a ready queue of available workers (see

Figure IV.2). As workers become available, they notify the server, and the scheduler on

the server places the workers’ corresponding task requests in the ready queue. During

resource selection, the scheduler examines the ready queue to determine the possible

choices for task assignment.

Because the hosts are volatile and heterogeneous, the size of the host ready

queue changes dramatically during application execution as workers are assigned tasks

(and thus removed from the ready queue), and as workers of different speeds and avail-

ability complete tasks and notify the server. The host ready queue is usually only a small

subset of all the workers, since workers only notify the server when they are available for

task execution.

At the Worker level, we assume that the worker running on each host periodi-

cally sends a heartbeat to the server that indicates the state of the task. We assume the

77

worker sends a heartbeat every minute to indicate whether the task is running or has

failed, as it is done the XtremWeb system.

We do not assume that the system provides remote checkpointing abilities.

All Internet desktop grids systems lack remote checkpointing. One reason is that a

significant number of machines (as high as 88% in 2000 [13]) are connected with dial-up

modems and so transferring large core dumps (often ≥ 512MB in size [65]) over a wide

area quickly is not feasible. However, the task itself could save its state persistently on

disk so that if the task is terminated, the task can revert to its previous state at the next

idle period, losing little of its past computation. Consequently, in Chapter VI, we also

consider the case where the task itself can checkpoint its state onto the local disk.

Also, we do not assume the server can cancel a task once it has been scheduled

on a worker. The reason for this is that resource access is limited, as firewalls are usually

configured to block all incoming connections precluding incoming RPC’s and to allow

only outgoing connections (often on a restricted set of ports like port 80). As such, the

heuristics cannot preempt a task once it has been assigned, and workers must make the

initiative to request tasks from the server.

Our platform model deviates significantly from traditional grid scheduling mod-

els [14, 25, 26, 41]. The scheduling model used in most grid scheduling research is often

a collection of tasks ready to be assigned to a pool of resources. A grid scheduler can

devise a plan that determines when applications including their data will be placed at

specific resources to minimize application makespan. However, in desktop grid environ-

ments, devising a (static) plan for task assignment may be futile because tasks cannot

be pushed to workers. Moreover, the pool of resources from which to select from can

vary dynamically over time depending on which workers are available. So even if plan is

devised, the set of resources chosen may not be available at the time of assignment.

We use this platform model for each of the desktop grids mentioned in Chap-

ter III, namely the SDSC, DEUG, LRI and UCB platforms. We instantiated each plat-

form model with about 200 hosts per desktop grid with the availability defined by the

corresponding traces. SDSC was the only platform with about 200 hosts, and the re-

maining platform had significantly fewer hosts. So to compensate, we aggregated host

traces on different days until there was approximately 200 host traces per platform. After

78

aggregation, there were at least seven full days of traces to be used to drive simulations.

As shown in a number of studies [8, 62] including our own, hosts during weekday

business hours often exhibit higher and more variable load than during off-peak hours

on weekday nights and weekends. As such, all simulations were performed using traces

captured during business hours which varied depending on the platform. (9AM-6PM

for SDSC, 6AM-6PM for DEUG, all day for LRI, and 10AM-5PM for UCB.) We be-

lieve our heuristics would perform relatively the same during off-peak hours when host

performance is more predictable, although the performance difference might be lessened.

Given the diversity of desktop configurations, we compare the performance of

our heuristics on two other configurations representative of Internet and multi-cluster

desktop grids. Because we do not have access to these types of desktop grids, we were

unable to gather traces for these types of platforms. Nevertheless, many desktop grid

projects [44, 54] publicly report the clock rates of participating hosts. So instead of

using real traces, we transform the clock rates of hosts in the SDSC grid to reflect

the distribution of clock rates found in a particular platform, and transform the CPU

availability trace corresponding to each host accordingly.

For example, Internet desktop grids that utilize machines both in the enterprise

and home settings usually have many more slow hosts than fast hosts, and so the host

speed distribution is left heavy. We used host CPU statistics collected from the GIMPS

Internet-wide project [44] to determine the distribution of clock rates, which ranged from

25MHz to 3.4GHz. Other projects such as Folding@home and FightAids@home show

similar distributions [91].

Much work [42, 37, 70] in desktop grids has focused on using resources found

in multiple labs. Recently, [54] reports the use of XtremWeb [37] at a student lab in LRI

with nine 1.8GHz machines, and a Condor cluster in WISC with fifty 600MHz machines

and seventy-three 900MHz machines. We use the configuration specified in that paper

to model the multi-cluster scenario. We plot the cumulative clock rate distribution

functions for our additional two platform scenarios in Figure IV.3(b). For each of these

distributions, we ran simulations using the SDSC desktop grid traces but transforming

host clock speeds accordingly.

Our justification for these clock rate transformations is that the availability

79

0 500 1000 1500 2000 2500 30000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clock Rate (MHz)

Frac

tion

SDSCDEUGLRIUCB

(a) Real

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1

Clock Rate (MHz)

Fra

ctio

n

LRI−WISC

GIMPS

(b) Simulated

Figure IV.3: Cumulative clock rate distributions from real and simulated platform

Platform Range of Clock Rates VolatilitySDSC high medDEUG low medLRI low lowUCB low high

GIMPS very high medLRI-WISC bimodal med

Table IV.1: Qualitative platform descriptions.

interval size is independent of host clock rate as found in Chapter III. So for a desktop

grid with a similar user base as SDSC (i.e., researchers, administrative assistants that

work from 9AM-5PM), we expect that hosts will have similar availability interval lengths,

regardless of CPU speed.

In summary, Table IV.1 shows the type of platforms on which we evaluate each

of our heuristics. As clock rates and host volatility are the primary sources of poor

application performance, we explore real and hypothetical platforms with a wide range

of clock rates and volatility levels that are representative of real desktop grid systems. In

particular, we examine all the cases where a platform has medium volatility and a wide

range of clock rates, and a low range of clock rates but wide range of volatility levels.

80

IV.B.2 Application model and instantiation

The majority of desktop grid applications are high-throughput and embarrass-

ingly parallel. That is, the applications have many more tasks relative to the number

of hosts, and have no dependencies nor communication among tasks. In an effort to

broaden the set of applications supportable on desktop grids, we also study the schedul-

ing of applications that have more stringent time demands, i.e., require turnaround on

the order of minutes or hours versus days or months, and have a number of tasks on

the order of the number of hosts. The experience of the Sprite developers [36], our own

characterization of availability in Chapter III, and numerous interactions with industrial

companies by one of the committee members [29] suggest that desktop grids within the

enterprise are often underutilized, and so the scenario in which the number of resources

is of an order of magnitude comparable to the number of tasks is not uncommon. As

such, we investigate techniques for scheduling applications that consist of T independent

and identically-sized tasks on N volatile hosts, where T is on the order of N . Appli-

cations contain 100, 200, or 400 tasks (which correspond to roughly half, equal, and

double the number hosts, respectively, and also long, medium, and short plateaus found

with the cumulative number of tasks completed during application execution as seen in

Figure IV.1).

In addition, we vary the lengths of the tasks, which affects the failure rate of

the application’s tasks. We experiment with tasks that would exhibit 5, 15, and 35

minutes of execution time on a dedicated 1.5GHz host. Each of these task sizes has a

corresponding failure rate when scheduled on the set of resources during business hours.

As described in Chapter III, we determined the failure rate of a task given its size using

random incidence over the entire trace period. That is, in the collected traces, we chose

many thousands of random points to start the execution of a task and noted whether the

task would run to completion or would meet a host failure. Task failure rate increases

linearly with task size from a minimum of 6.33% for a 5 minute task, 13.2% for a 15

minute task, and to a maximum of 22% for a 35 minute task. A maximum task size

of 35 minutes is chosen so that a significant number of applications can complete when

scheduled during the business hours of a single weekday.

We assume the tasks are compute bound; since we focus on applications with

81

small data input/output sizes on the order of kilobytes or megabytes, and fast networks

in enterprise environments , we do not take into account network effects in the completion

time of the application.

IV.C Proposed Approaches

We consider four general approaches for resource management:

Resource Prioritization – One way to do resource selection is to sort hosts in the

ready queue according to some criteria (e.g., by clock rate, by the number of cycles

delivered in the past) and to assign tasks to the “good” hosts first. Such prioritization

has no effect when the number of tasks left to execute is greater than the number of hosts

in the ready queue. However, when there are fewer tasks to execute than ready hosts,

typically at the end of application execution, prioritization is a simple way of avoiding

the “bad” hosts.

Resource Exclusion Using a Fixed Threshold – A simple way to select resources

is to exclude some hosts and never use them to run application tasks. Filtering can

be based on a simple criterion, such as hosts with clock rates below some threshold.

Often, the distribution of resource clock rates is so skewed [44, 91] that the slowest

hosts significantly impede application completion, and so excluding them can potentially

remove this bottleneck. However, using a fixed threshold can unintentionally exclude

relatively slow hosts that could have contributed to application completion, and we

address this deficiency with the next approach.

Resource Exclusion via Makespan Prediction – A more sophisticated resource

exclusion strategy consists of removing hosts that would not complete a task, if assigned

to them, before some expected application completion time. In other words, it may

be possible to obtain an estimate of when the application could reasonably complete,

and not use any host that would push the application execution beyond this estimate.

The advantage of this method compared to blindly excluding resources with a fixed

threshold is that it should not be as sensitive to the distribution of clock rates. That

is, relatively slow hosts that could contribute to application completion would not be

excluded unnecessarily, in contrast to the previous method that uses a fixed threshold.

82

Task Replication – Even with the best resource selection method, task failures near

the end of application execution are almost inevitable. To deal with such failures, we

examine the effect of replicating multiple instances of a particular task and assigning

them to different hosts; replicating a task may increase the chance that at least one task

instance will complete.

We study these approaches in the order that they are listed above. The ap-

proaches themselves are listed in increasing order with respect to complexity and costs,

which will become evident in the following chapters. For each approach, we examine a

wide range of scheduling heuristics that vary in complexity and the quality of information

used about the resources (such as static or dynamic information). For each heuristic,

we identify the costs and benefits when increasing the heuristic’s complexity or quality

of information about the resources. Using the best heuristic found with a particular ap-

proach, we then study how to augment this heuristic addressing its weaknesses using the

subsequent approach. This inductive method of heuristic design allows us to examine a

manageable number of heuristics.

In each stage of our heuristic design, we evaluate and compare our heuristics

in simulation, which we detail in the next section.

IV.D Measuring and Analyzing Performance

For each experiment (i.e., for a particular number of tasks, and a task size),

we simulated our competing resource management heuristics for applications starting at

different times during business hours. We ran each experiment for about 150 such starting

times and averaged results to obtain statistically significant results. In this section, we

discuss how we evaluate the performance of each application scheduled with a particular

heuristic, and how we automatically analyzed the performance of each heuristic.

IV.D.1 Performance metrics

While application makespan is a good metric to compare results achieved with

different scheduling heuristics, we wish to compare it to the execution time that could

be achieved by an oracle that has full knowledge of future host availabilities. Our oracle

83

works as follows. First, it determines the soonest time that each host would complete a

task, by looking at the future availability traces and scheduling the task as soon as the

host is available. Then, it selects the host that completes the task the soonest, and it

repeats this process until all tasks have been completed. This greedy algorithm results

in an optimal schedule, and we compare the performance of our heuristics using the ratio

of the makespan for a particular heuristic to the optimal makespan that is achieved by

our oracle. The optimality of the greedy algorithm is easy to see intuitively. Neverthe-

less, we prove this formally at the last section of this chapter, as much of our analysis

is based on the results of this algorithm. Note that our work in upcoming chapters

focuses on minimizing the overall execution elapsed time, or makespan, of a single par-

allel application, rather than trying to optimize the performance of multiple, competing

applications. Nevertheless, the heuristics we develop to schedule a single application

provide key elements for designing effective multi-application scheduling strategies (e.g.,

for doing appropriate space-sharing among applications, for selecting which resources are

used for which application, for deciding the task duplication level for each application).

IV.D.2 Method of Performance Analysis

To determine the cause of poor performing heuristic, we visually inspect the

execution trace of a subset of all applications scheduled by the heuristic. That is, for

each host, we graph the time at which it starts each task. If the task completes, we graph

its completion time, and if it fails, we plot the time of failure. However, analyzing the

performance using visual inspection of each application’s execution trace is not possible

due to the high number, i.e., thousands, of applications executed in simulation. So to

supplement our visually analysis of application execution traces, we develop a simple

automated approach to determine the causes of poor performance.

As discussed in Section IV.A, delays in task completion near the end of applica-

tion execution can result in a plateau of task completion rate and thus, poor performance;

we refer to the tasks completed extraordinarily late during application execution as lag-

gers. After visual inspection of numerous application execution traces, we hypothesize

that the causes of these lagger are hosts with relatively low clock rates, and task failures

that occur near the end of application execution. We confirm this hypothesis in the

84

following manner. First, we use an automated method to find laggers in application ex-

ecutions that have been coordinated by a FCFS scheduler. Then, then we automatically

classify the cause of each lagger to be either slow host clock rate or task failure. We

find that a high percentage of the laggers (>70%) are caused by either low host clock

rates or task failures, thus giving strong evidence for our hypothesis. After confirming

the hypothesis, we use this automated method to determine the impact of slow host

clock rates and task failures on application execution when other scheduling heuristics

are used. We describe the automated method in detail below.

To determine the cause of poor performing applications automatically, we mine

the simulation logs, determining the number of laggers and classifying each lagger by a

particular cause, i.e., low clock rate or task failure. First, we classify completed tasks

as laggers by determining the interquartile range (IQR) of task completion. The IQR is

defined as the range between the lower quartile (25th percentile) and upper quartile (75th

percentile) of task completion times, excluding task executions that fail to complete. We

use the method defined by Mendenhall and Sincich in [59] for finding sample quantiles,

where each quantile is an actual data point. (This method is less biased than other

methods that take the average of the data points.) Then, we multiply the IQR by a

certain factor (which we term IQR factor) and add the result to the upper quartile

to give a lagger threshold. If a task is completed after the threshold, it is classified

as a lagger. In particular, assuming an IQR factor F , if there exists a lagger after

the threshold, then a lower bound on the task completion rate slowdown between the

interquartile and the last quartile of tasks is given by 1/2F . So laggers signify a dramatic

decrease in task completion rate.

Figure IV.4 shows the cumulative throughput for an application with 400 tasks

as execution progresses. In the figure, the first and third quartiles of task completions

are labelled, showing the IQR. Using an IQR factor of 1, the figure also shows where the

lagger threshold is with respect to the third quartile. The tasks that finish execution

after the threshold are considered laggers.

Our rationale to use the IQR to define the lagger threshold is that the task

completion rate during the IQR is a close approximation to the optimal. This is because

application execution enters steady state during the IQR as each available host is assigned

85

0 20 40 60 800

50

100

150

200

250

300

350

400

Cum

ulat

ive

Num

ber

of T

asks

Com

plet

ed

Time (minutes)

1st qr.

3rd qr.

IQR

lagger threshold(3rd qr.+ IQR*1)

Figure IV.4: Laggers for an application with 400 tasks.

a task. If T >= N , the task completion rate during the IQR is guaranteed to be

optimal and follows trivially from our proof of the optimal scheduling algorithm discuss

in Section IV.E.

An alternative is to use find the standard deviation of application makespans,

and then to define a lagger threshold according to some factor of the standard deviation.

However, by the very nature of the laggers, they tend to be relatively extreme outliers

in terms of task completion times, and the standard deviation could be too sensitive to

these outliers; an extreme lagger could cause the standard deviation to be quite high

and using a lagger threshold based on the standard deviation could then result in many

false negatives. Our approach of using quantiles is less affected by these extreme laggers.

Another alternative method is to classify the last X% of completed tasks in an application

as laggers. However, in the case of an optimal application execution, this method would

classify the last X% of tasks as laggers, and yield relatively high false positives.

One question related to our lagger analysis is how to choose a suitable IQR

factor F . Clearly, an extremely low IQR factor would yield several false positives, and

an extremely high IQR factor would miss most of the true laggers, limiting our analysis

to an insignificant number of laggers. To determine the set of possible IQR factors to

use, we conducted a simple sensitivity analysis of the number of laggers determined

86

by the IQR factor. The lowest possible IQR factor is about .5, since the steady state

and maximum task completion rate usually occurs during the IQR. We found that the

maximum IQR factor was about 1.5; for IQR factors greater than 1.5, there were near-

zero laggers for all the application. Within the range of IQR factors of .5 and 1.5, we

found that the number of laggers decreases only gradually as the factor is increased. We

choose an intermediate IQR factor of 1 and found that using values of .5 and 1.5 do not

significantly change the distribution of laggers (see Appendix A).

After we find the set of laggers for each application, we determine the cause

of each lagger as follows. To determine if the cause is a task failure near the end of

application completion, we look at all tasks completed after the 75th percentile; if the

task fails after that point, we conclude that failure is at least one cause of the lagger.

To determine if the cause is a slow clock rate, we use the clock rate of the slowest host

used in the corresponding optimal application execution. That is, for an application

that begins execution at a particular time, we run the optimal execution (using an

omniscient scheduler) and find the slowest host used in that execution. The clock rate

of the slowest host is then used to classify whether a lagger was assigned to a slow host

or not. The advantage of using this method of classification is that it actually confirms

that a faster host could have been used by the scheduling heuristic by looking at the

application’s optimal execution. Another advantage is that this method determines a

clock rate threshold for each instance of application execution. So, if only relatively

fast hosts are used during the optimal application execution, the resulting clock rate

threshold will also tend to be higher.

In our comparison of heuristics, we do not consider the number of laggers as

we found weak correlation (correlation coefficient is usually less than .22) between the

number laggers and makespan for the FCFS scheduling method when a relatively low

IQR factor of .5 is used. (By using a low IQR factor of .5 we ensure that all laggers are

counted.) The weak correlation is cause by the application waiting for the completion of

a task scheduled on an extremely slow host when the rest of the tasks have already been

completed (as shown for the application with 100 tasks in Figure IV.1). In this case, the

number of laggers is relatively small, but the effect on application makespan of the sole

lagger is tremendous.

87

Another reason not to consider the number of laggers is the weak correlation

between the mean application makespan and mean number of laggers across the set of

scheduling heuristics; that is, a heuristic that results in a lower mean makespan could in

fact have more laggers than a different heuristic that results in a higher mean makespan.

The reason for this is that the IQR is a metric relative to the total makespan of the

application; as the mean makespan for a particular heuristic decreases, the IQR itself

decreases, which in turn lowers the threshold defined by the 75th quantile, the IQR, and

the IQR factor F . A lower threshold could raise the chance that a host will complete

a task after that threshold, since the clock rate distribution and CPU availability of

the hosts in the platform remains fixed. Thus, to supplement our lagger analysis, we

also show the absolute measure of length of times intervals delimited by the the first,

second, third, and fourth quartiles of task completion times. While we could have used an

absolute IQR for all heuristics (for example, the IQR resulting from the FCFS scheduling

method), the IQR for FCFS can be much higher (as much as three times higher, although

in general the IQR’s tend to be similar) than the IQR’s for the other heuristics, and this

could result in several false negatives. So, instead we use define a relative IQR for each

heuristic.

IV.E Computing the Optimal Makespan

We prove that the greedy algorithm proposed in Section IV.D.1 results in the

the optimal makespan for jobs with identical and independent hosts scheduled on volatile

hosts. The optimal algorithm consisted of starting a task on each host as soon as possible

(during an availability interval), and assigning tasks to a host as soon as the previous

task on that host completes. If a task fails due to the end of the availability interval being

reached before all necessary operations have been delivered to the task, then the task is

restarted from scratch at the beginning of the next availability interval. In this greedy

fashion, the availability intervals of all hosts can be “filled” with an infinite number

of tasks, with tasks packed together as tightly as possible. These tasks are sorted by

increasing completion time, and we pick the first T tasks. The assignment of these

T tasks to hosts corresponds to the optimal schedule. While this schedule is intuitively

88

optimal, one may wonder whether inserting some delays before or in between tasks could

not be beneficial in order to match the overhead periods of length h with periods during

which host CPUS exhibit low, or in fact 0%, availability. In the following sections we

give a formal description of the algorithm and formal proofs of its optimality first for a

single availability interval on a single hosts, then for multiple availability intervals on a

single hosts, and finally for multiple availability intervals on multiple hosts, which is the

general case. While, a posteriori, the proof is straightforward, the algorithm’s optimality

is still worth proving formally since it is used heavily in the following chapters.

Our approach is as follows. After defining the problem formally in Section IV.E.1,

we first show optimality within a single availability interval on a single host in Sec-

tion IV.E.2. Then, in Section IV.E.3, we show optimality for multiple availability inter-

vals separated by failures on a single host. In Section IV.E.4, we show the optimality

for multiple availability intervals across multiple hosts. Finally, in Section IV.E.5, we

consider a variation of the problem that allows for task checkpointing.

IV.E.1 Problem Statement

Consider a job that consists of T tasks, and consider N hosts, with variable

CPU availability described by traces as in the previous chapter, and with possibly dif-

ferent maximum amount of operations delivered per time unit. We denote by f(t) the

instantaneous number of operations delivered at instant t. Although Figure II.2 depicts

an availability trace as continuous, it is in fact a discrete step function. We denote by∫ ba f(t) dt the number of operations that would be delivered by the host to the desk-

top grid application between time a and time b, provided that the interval [a, b] is fully

contained within an availability interval.

All the tasks are of identical computational cost and independent of one another.

We denote by S the task size in number of operations, and h is the overhead in seconds

for scheduling each task, which is incurred before computation can begin. (We have

observed this overhead in practice, as explained in Chapter III.) Finally, we denote by

fm(t) the instantaneous number of operations delivered at time t on some host m, which

is fully known given a trace (see the previous section). Recall that although in practice

function fm would not be known, in this report we focus on developing an optimal

89

schedule that could be achieved by an omniscient algorithm that has foreknowledge of

future host availability for all hosts. The scheduling problem is to assign the tasks to

the hosts so that the job’s makespan is minimized.

IV.E.2 Single Availability Interval On A Single Host

We first focus on optimally scheduling tasks within a single availability interval.

We formalize the algorithm and then prove its optimality.

IV.E.2.a Scheduling Algorithm

Let us first define a helper function, INTG, that takes 4 arguments as input: a

number of operations, numop, an overhead in seconds, overhead, a task start time, a, an

upper bound on task finish time, b, and a CPU availability function, f that corresponds

to a single availability interval. INTG returns the time at which a task of size numop,

started at time a on a host whose CPU availability is described by function f , incurring

overhead overhead before it can actually start computing, would complete if it would

complete before time b, or −1 otherwise. It is assumed that time a lies inside the single

availability interval defined by function f . In other words, function INTG returns, if it

exists, a time t < b such that∫ ta+h f(x) dx = numop, or -1 if such a t does not exist. We

show an implementation of INTG in pseudo-code in Figure IV.5. Note that the pseudo-

code shows a discrete implementation; the loop increments the value of local variable t by

one, assuming the step size of the trace’s step function is 1 second. Intuitively, function

INTG will be used by our scheduling algorithm to see whether a task can actually “fit”

inside an availability interval when started somewhere in that interval.

The greedy algorithm OPTINTV given in Figure IV.6 computes the schedule

described informally in Section IV.D.1. It takes the following parameters:

T : the number of tasks to be scheduled,

h: the overhead for starting a task,

S: the task size in number of operations,

f : the CPU availability function

[a, b]: the absolute start and end times of the host availability interval we

consider,

90

Algorithm : INTG(numop, overhead, a, b, f)

sum← 0

for t← a + h to b

sum← sum + f(t)

if sum = S

return (t)

return (−1)

Figure IV.5: INTG: helper function for the scheduling algorithm.

A: an array that stores the time at which each task begins, to be filled in,

B: an array that stores the time at which each task completes, to be filled in,

and returns the number of tasks that could not be scheduled in the availability interval,

out of the T tasks. From the pseudo-code it is easy to see that OPTINTV schedules tasks

from the very beginning of the availability interval, and a task is scheduled immediately

after the previous task completes. The duration of each task execution is computed via

a call to the INTG helper function.

IV.E.2.b Proof of Optimality

Let [t1, t2, t3, ..., tT ] denote the times at which each task begins execution in the

schedule computed by the OPTINTV algorithm. Note that t1 is just the beginning of the

availability interval. Let [e1, e2, e3, ..., eT ] be the task execution times without counting

the overhead, such that ti = ti−1 + h + ei−1, for 2 ≤ i < T .

Consider another schedule obtained by an algorithm, which we call OPTDE-

LAY, that does not start each task as early as possible. In other words, the algorithm

adds a time delay, wi ≥ 0, before starting task i, for 1 ≤ i ≤ T . Let [t′1, t′2, t

′3, ..., t

′T ]

be the times at which tasks start execution and [e′1, e′2, e

′3, ..., e

′T ] be the task execution

times without counting the overhead, in the OPTDELAY schedule. We prove that the

OPTDELAY schedule is never better than the OPTINTV schedule, which then guar-

antees that the OPTINTV schedule is optimal within an availability interval. Let the

91

Algorithm : OPTINTV(T, h, S, f, a, b, A, B)

t← a

B[1]← a

for i← 2 to T + 1

t← INTG(S, h, t, b, f)

if t ≥ 0

// The time at which task i− 1 completes is the time at which task i

// is scheduled

B[i− 1]← t

if i < T + 1

A[i]← t

else return (T − (i− 2))

end for

return (0)

Figure IV.6: Scheduling algorithm over a single availability interval.

92

proposition P (k) be that OPTINTV schedules k tasks optimally. We prove P (k) by

induction.

Base case – Let us assume that P (1) does not hold, meaning that t1 + h + e1 >

t′1 + w1 + h + e′1. This situation is depicted in Figure IV.7. For convenience, let c1 and

c′1 denote the completion times of the task under the OPTINTV and the OPTDELAY

schedules, respectively, so that our assumption is c1 > c′1.

Figure IV.7: An example of task execution for OPTINTV (higher) and OPTDELAY

(lower) at the beginning of the job. Both jobs arrive at the same time. In the case of

OPTINTV, the first task is scheduled immediately and an overhead of h is incurred.

In the case of OPTDELAY, the scheduler waits of a period of w1 before scheduling the

task.

For the OPTINTV schedule we can write that:

S =∫ t1+h+e1

t1+hf(t) dt,

which just means that, during its execution, the task consumes exactly the number of

operations needed. We can write that:∫ t1+h+e1

t1+hf(t) dt =

∫ t1+h+w1

t1+hf(t) dt +

∫ c′1

t1+h+w1

f(t) dt +∫ c1

c′1f(t) dt.

First, we note that the second integral in the right-hand side of the above equation is

equal to S, since it corresponds to the full computation of the task in the OPTDELAY

schedule. Second, we note that the third integral is strictly positive. Indeed, if it were

equal to zero, then the number of operations delivered by the host to the task during the

[c′1, c1] interval would be zero, meaning that no useful computation would be performed

93

on that interval in the OPTINTV schedule. Therefore, in the OPTINTV schedule, the

completion time c1 would in fact be lower or equal to c′1, which does not agree with our

hypothesis. As a result, we have:

S =∫ t1+h+e1

t1+hf(t) dt > S,

which is a contradiction. We conclude that P (1) holds.

Inductive case – Let us assume that P (j) holds, and let us prove P (j + 1). The

execution timeline for both the OPTINTV and OPTDELAY schedule is depicted on

Figure IV.8 with tj+1 lower or equal to t′j+1 due to P (j). As for the base case, cj+1 and

c′j+1 denote the completion times of task j + 1 under both schedules.

Figure IV.8: An example of task execution for OPTINTV (higher) and OPTDELAY

(lower) in the middle of the job.

Suppose cj+1 > c′j+1. For the OPTINTV schedule, we can write that:

S =∫ cj+1

tj+1+hf(t) dt,

which just means that S operations are delivered to the application task during its

execution. We can split the above integral as follows:∫ cj+1

tj+1+hf(t) dt =

∫ t′j+1+wj+1+h

tj+1+hf(t) dt +

∫ c′j

t′j+1+wj+1+hf(t) dt +

∫ cj

c′jf(t) dt.

The first integral is valid because wj+1 ≥ 0 and tj+1 ≤ t′j+1 (due to property P (j)).

Following the same argument as in the base case, the last integral in the right-hand side

94

of the above equation is strictly positive (otherwise cj+1 ≤ cj). The second integral

is equal to S, since this is the number of operations delivered to task j + 1 during its

execution in the OPTDELAY schedule. We then obtain that

S =∫ cj+1

tj+1+h> S,

which is a contradiction. Therefore cj+1 ≤ c′j+1, and property P (j + 1) holds, which

completes our proof by induction.

IV.E.3 Multiple Availability Intervals On A Single Host

In this section, we consider scheduling tasks during multiple intervals of avail-

ability, whose start and stop times are denoted by [ai, bi]. Without loss of generality, we

can ignore all availability intervals during which a single task cannot complete, i.e. for

which∫ bi

aif(t) dt < S. We also consider an infinite number of availability intervals for

the host, or at least a number large enough to accommodate all T tasks. The scheduling

algorithm OPTMINTV, seen in Figure IV.9, takes the the following parameters:

T : the number of tasks to be scheduled,

h: the overhead for starting a task,

S: the task size in number of operations,

f : the CPU availability function

C: an array that stores start times of all the tasks, to be filled in,

D: an array that stores completion times of all the tasks, to be filled in.

Let P (i) be the property that OPTMINTV schedules k tasks optimally. We prove that

P (k) is true by induction for k ≥ 1.

Base Case – P(1) is true because OPTMINTV(1,...) is equivalent to running OPT-

INTV(1,..) for the first availability interval, which we know leads to an optimal schedule

from Section IV.E.2.

P (2) – When there are two tasks to schedule, either both tasks can be scheduled in the

first availability interval, or if both tasks cannot finish in the first availability interval,

one task must be scheduled in the first interval and the other scheduled in the second.

In the former case, OPTMINTV(2,...) is equivalent to OPTINTV(2,...) for the first

interval, and the result is optimal as proved in the previous section. In the latter case, a

95

Algorithm : OPTMINTV(T, h, S, f, C, D)

numleft← T

i← 1

while numleft 6= 0

numleft← OPTINTV(numleft, h, S, f, ai, bi, A, B)

C ← CONCAT(C,A)

D ← CONCAT(D, B)

i← i + 1

end while

Figure IV.9: Scheduling algorithm over multiple availability intervals.

task is scheduled at the beginning of the first and at the beginning of the second interval.

This results in the optimal schedule since the earliest the second task can execute (and

finish) is at the beginning of the second interval.

Inductive Case – Let us assume that P (j) is true. Then OPTMINTV(j,...) gives

the optimal schedule for the first j tasks. If OPTMINTV must schedule another task,

OPTMINTV will either place it in the in the same interval as the jth task, or if that

is not possible, it will place it at the beginning the following interval. The former case

results in an optimal schedule since OPTINTV will schedule the j+1th task optimally in

that interval, and the resulting makespan for all j + 1th tasks will also be optimal. The

latter case results in an optimal schedule since we know that OPTINTV(1,..) is optimal

on the second availability interval. Therefore, P (j + 1) is true.

It follows that P (k) is true for all k ≥ 1, and thus OPTMINTV computes an

optimal schedule.

IV.E.4 Multiple Availability Intervals On Multiple Hosts

The algorithm for scheduling tasks across multiple availability intervals over

multiple hosts, OPTIMAL, seen in Figure IV.10, takes the following parameters:

T : the number of tasks to be scheduled;

96

h: the overhead for starting a task;

S: the task size in number of operations;

v: the arrival time of the job;

E: a N ×T matrix that stores the start times of the T tasks scheduled on each

of the N hosts; each row of E is computed by a call to OPTMINTV;

F : a N × T matrix that stores the completion times of the T tasks scheduled

on each of the N hosts; each row of F is computed by a call to OPTMINTV;

and returns the total makespan. We assume that the functions describing CPU avail-

abilities for each host, fm for m = 1, ..., N , are known. OPTIMAL uses a local variable,

I, which is a 1×N array that stores the index of the last completed task for each host.

Finally, we define the argmin operator in the classical way for a series, say {xi}i=1,..,n,

by xargmin(x) = mini=1,..,n xi.

Algorithm : OPTIMAL(N, T, h, S, v, E, F )

// schedule T tasks on each host and determine each task’s completion time

for i← 1 to N

OPTMINTV(T, h, S, fi, C, D)

E[i]← C

F [i]← D

// select the T tasks that completed the soonest

j ← T

while j > 0

i← argminiε1,...,N (F [I[i]])

I[i]← I[i] + 1

j ← j − 1

return (M [I[i]]− v);

Figure IV.10: Scheduling algorithm over multiple availability intervals over multiple

hosts

97

Since OPTMINTV(k,..) schedules each task optimally, OPTIMAL(N,k,..) se-

lects the k tasks that complete the soonest, resulting in the optimal schedule.

IV.E.5 Optimal Makespan with Checkpointing Enabled

In this section, we consider the scenario where a desktop grid system is able to

support local task checkpointing and restart. We assume that when a task encounters a

failure, it is always restarted on the machine where it began execution, i.e., we do not

consider process migration. Since the greedy algorithm that accounts for checkpointing

and its proof of optimality are similar to those described in the previous sections, we

give only a high-level description of the new optimal scheduling algorithm and proof

sketch of its optimality. For our discussion of checkpointing, we define the following new

parameters:

p: the overhead in terms of time for checkpointing a task

r: the overhead in terms of time for restarting a task from its checkpoint.

s: the number of operations to be completed before a checkpoint is performed.

We make the following changes to OPTINTV to account for checkpointing.

Because checkpointing is enabled, we can view intervals of failures as periods of 0%

CPU availability. So, the a host’s trace can be treated as single continuous availability

interval, on which we can use OPTINTV to schedule tasks. After a task is scheduled and

it begins execution, a checkpointing overhead of p is incurred after every s operations

are completed. If during execution a task encounters a failure, whatever progress made

since the last checkpoint is lost, and an overhead of r is incurred to restart the task from

the last checkpoint.

We can reduce the problem of scheduling tasks with checkpointing enabled to

the problem of scheduling task without checkpointing as follows. Consider a single task

k scheduled within an availability interval, i.e., the task’s execution does not encounter

any failures. When k is scheduled, it first incurs an overhead of h. Then, after every s

operations are complete, an overhead of p is incurred due to checkpointing. So, one can

treat the task k as dS/se subtasks, bS/sc of which are of size s and dS/se−bS/sc of which

are of size S−s∗bS/sc. The first subtask is scheduled with an overhead of h; thereafter,

each subtask is “scheduled” with an overhead of p. If no failures are encountered during

98

task execution, then OPTINTV achieves the optimal schedule by an argument similar to

the one used in Section IV.E.2, and the same is true if we used OPTVINTV to schedule

multiple tasks (treated as batches of subtasks) in the same availability interval.

If a failure is encountered during task execution, then whatever progress made

since the last checkpointing is lost, an overhead of r is incurred immediately after the

failure, and the task is restarted from the last checkpoint. By the same argument used to

prove OPTMINTV in Section IV.E.3, OPTINTV would have scheduled subtasks in the

previous availability interval optimally, and so starting a subtask at the beginning of the

next availability interval (and then incurring an overhead of r before execution) gives an

optimal schedule. Finally, we can replace OPTMINTV with OPTINTV in OPTIMAL

since failures are viewed as 0% CPU availability, and the resulting algorithm achieves

the optimal schedule over all hosts.

In conclusion, we have shown that a greedy algorithm that has full knowledge

of future host and CPU availabilities achieves the optimal makespan when scheduling a

job with identical and independent tasks on a volatile desktop grid. To the best of our

knowledge, previous work has not dealt with the case where CPU availability fluctuates

between 0 and 100% and at the same time taken into account host heterogeneity and

failures. Note also that although are algorithm achieves the optimal makespan, it does

not necessarily achieve optimal execution time, since delaying a task might allow it to

encounter periods of higher CPU availability. An interesting extension of this work is to

consider the multiple job scenario where minimizing execution time (versus makespan)

could be beneficial to system performance.

Chapter V

Resource Selection

We investigate various heuristics for resource selection, which involves deciding

which resources to use and which resource to exclude. Regarding the former issue, we

focus on resource prioritization techniques to use the “good” hosts first. Regarding the

latter issue, we study resource excluding techniques to filter out the “bad” hosts that

might impede completion from the application execution’s entirely. We evaluate these

heuristics first on the SDSC grid, which contains volatile hosts that exhibit a wide range

of clock rates. We also report the results of heuristics run on the other platforms when

applicable and interesting.

V.A Resource Prioritization

V.A.1 Heuristics

We examine three methods for resource prioritization using different levels of

information about the hosts, from virtually no information to comprehensive historical

statistics derived from our traces for each host, and we evaluate each method using

trace-driven simulation. For the PRI-CR method, hosts in the server’s ready queue

are prioritized by their clock rates. Similar to PRI-CR, PRI-CR-WAIT sorts hosts

by clock rates, but the scheduler waits for a fixed period of 10 minutes before assigning

tasks to hosts. The rationale is that collecting a pool of ready hosts before making task

assignments can improve host selection. The scheduler stops waiting if the ratio of ready

hosts to tasks is above some threshold so that resource selection is executed immediately

99

100

after a large pool of resources exists in the queue. A threshold ratio of 10 to 1 was used

in all our experiments. We experimented with other values for the fixed waiting period

and the above ratio, but obtained similar results.

In contrast to PRI-CR and PRI-CR-WAIT, which use static information

about the hosts, the method PRI-HISTORY uses dynamic information, i.e., history

of a host’s past performance to predict its future performance. Specifically, for each

host, the scheduler calculates the expected operations per availability interval (that is

how many operations can be executed in between two host failures) using the previous

weekday’s trace. For a particular availability interval, a task may begin execution any-

where in that interval, and a task has a higher probability of completing within a longer

interval than a shorter one. So longer intervals should be weighted more than shorter

ones when calculating the expected operations per interval. We take this into account by

considering all possible subinterval starting points with ∼10 second increments within

each availability interval. For each availability interval, this results in subintervals that

begin every ten 10 seconds in the availability interval, and end at the interval’s stopping

point (see Figure V.1).

Figure V.1: Subintervals denoted by the double arrows for each availability interval. The

length of each subinterval is shown, and the subinterval lengths differ by 10 seconds.

The expected operations per interval is then used to determine in which of

two priority queues a host is placed. If the expected number of operations per intervals

101

is greater than or equal to the number of operations of an application task, then on

average the task should execute until completion, and so the host is placed in the higher

of two priority queues. Otherwise, the host is put in the low priority queue, which

corresponds to the hosts on which the task is not expected to run until completion.

Within each queue, the hosts are prioritized according to the expected operations per

interval divided by expected operations per second; as a result, hosts in each queue are

prioritized according to their speed. The higher priority queue lists hosts on which the

task is expected to complete, and faster hosts (in terms of operations per interval) have

higher priority. The lower priority queue lists hosts on which the task is not expected to

complete, and faster hosts have higher priority. When scheduling, PRI-HISTORY will

check the higher priority queue first and select the host with the highest priority, i.e., the

fastest expected speed. If the higher priority queue is empty, PRI-HISTORY will check

the lower priority queue and select the host with highest priority, i.e., fastest expected

speed.

V.A.2 Results and Discussion

For the SDSC platform, Figure V.2 shows the mean makespan of the three

heuristics (PRI-CR, PRI-HISTORY, PRI-CR-WAIT), and the mean makespan of the

FCFS heuristic, all normalized to the mean optimal execution time for applications with

100, 200, and 400 tasks of lengths 5, 15, and 35 minutes on a dedicated 1.5GHz host. The

bold dotted line in the figure represents the normalized mean makespan of the optimal

algorithm. Recall that these averages are obtained for about one-hundred fifty distinct

experiments.

To explain the performance of the heuristics, we use both visual analysis of

particular application execution traces and also the automated method described in

Section IV.D.2 to give additional and more concrete evidence of our conclusions. Our

lagger analysis shown in Figure V.7 focuses on FCFS, and the PRI-CR, which we find to

be the best resource prioritization heuristic. We also show in that figure other heuristics,

which we discuss later in Section V.B

Figure V.7 shows the classification of laggers as either caused by slow hosts or

task failures for each heuristic and application size. The height of each bar corresponds

102

to the mean number of laggers for an application with a particular number of tasks

and task size. For a particular bar, the height of each sub-bar represents the number

of laggers caused by slow hosts or task failures. We find that the poor performance

of FCFS is predominately caused by slow hosts, and that the other heuristics achieve

better performance by eliminating these slow hosts as we discuss below. The reduction

in laggers caused by slow hosts corresponds to a reduction in laggers caused by task

failures, as we showed that task failure rate is correlated with host clock rate.

The effect of eliminating laggers on application makespan is shown in Fig-

ure V.8. The height of each bar corresponds to the mean makespan of applications

with a particular task size and number. For a particular bar, the sub-bars represent the

length of each quartile of task completion times. We observe that the length of the first,

second, and third quartiles of each heuristic are approximately equal to the optimal’s;

this is because task are completed at a steady-state and near-optimal rate. (It appears

that some quartiles are missing for optimal algorithm, when in fact the quartiles are

too small to be visible.) The difference in makespans is primarily due to the length of

the 4th quartile, which in turn is caused by the reduction of laggers resulting from task

execution on slow hosts. We discuss how each prioritization heuristic eliminate laggers

below.

The general trend shown in Figure V.2 is that the larger the number of tasks in

the application the closer the achieved makespans are to the optimal, which is expected

since for larger number of tasks resource selection is not as critical to performance and

a greedy method approaches the optimal.

Focusing on those applications with 200 and 400 tasks, we notice that the prior-

itization heuristics perform as badly as FCFS. The reason FCFS performs so well is that

hosts that appear most often and earliest in the queue tend to have the high task com-

pletion rates, as clock rates are negatively correlated with task failure rates. The reason

PRI-CR and PRI-HISTORY perform similarly to FCFS is that clock rate and expected

number of operations per interval are weakly correlated with task completion rate (as

shown in Chapter III) and so the prioritized hosts in the ready queue are in a similar

order as those during FCFS. This is reflected by the similar number and proportion of

laggers caused by slow hosts of task failures as shown in Figure V.7. PRI-CR-WAIT

103

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6

Task length (minutes on a dedicated 1.5GHz host)

Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

PRI−CRPRI−HISTORYPRI−CR−WAIT

100 200 400

0

1

2

3

4

5

6

Number of Tasks Per Application

Figure V.2: Performance of resource prioritization heuristics on the SDSC grid.

performs poorly for small 5 minutes tasks and improves thereafter, but never surpasses

PRI-CR. The initial waiting period of 10 minutes is costly for the 100 task / 5min ap-

plication, which takes about 6 minutes to complete in the optimal case. As the task size

increases (along with application execution time), the penalty incurred by waiting for

host requests is lessened, but since most hosts are already in the request queue when the

application is first submitted, the PRI-CR-WAIT performs almost identically to PRI-CR

and is no better. Figure V.4 provides additional insights as to why PRI-CR-WAIT is

largely ineffectual. This figure shows the number of available hosts and the number of

tasks that are yet to be scheduled throughout time for a typical execution. Initially,

there are about 150 hosts available and 400 tasks for execution, and this immediately

drops to 0 hosts and about 250 tasks as each available host gets assigned a task. One

can see that it is usually the case that either there are far more tasks to schedule than

ready hosts or far more ready hosts than tasks to schedule. In the former scenario,

PRI-CR-WAIT performs exactly as PRI-CR. In the latter case, waiting does not give

the algorithm more choice in selecting resources.

We noticed different trends when the number of tasks in the application was

roughly half the number of hosts. The reason FCFS performs so poorly is that initially

104

there are 100 tasks to schedule on about 200 hosts, and because FCFS chooses 100 hosts

randomly, some slow hosts are chosen, which then causes a reduction in application

performance. In contrast, PRI-CR will exclude the slowest 50% of the resources so that

these slow hosts are excluded from the computation also shown in Figure V.7.

Surprisingly, PRI-HISTORY performs poorly compared to PRI-CR, which uses

static instead of dynamic information. We found that the availability interval size, both

in terms of time and in terms of operations, was not stationary across weekdays and so,

the expected operations per second is a poor predictor of performance for certain hosts.

We determined the per host prediction error from one day to the next as follows. For

each host we calculated the mean number of operations per interval on a given weekday

during business hours. We then took the absolute value of the difference between a host’s

mean on one particular day and the next. In Figure V.3(a), we show the complementary

cumulative distribution function of prediction error of the expected time per interval for

each host. That is, the figure plots the fraction of prediction errors greater than some

length of time. We can see that 80% of the predictions errors are 50 minutes in length or

more. On average, the mean prediction error is 109 minutes in length and the is median

error is 122 minutes. Given that many applications are less than an hour in length, the

high prediction error could be problematic.

Moreover, in Figure V.3(b), we show the complementary CDF of prediction

error of the expected ops per interval for each host. That is, the figure plots the fraction

of prediction errors greater than some quantity of operations delivered per interval.

We find that 80% of the prediction errors are equivalent to 40 minutes or more on a

dedicated 1.5GHz host, On average, the mean prediction error is 99 minutes in length

and the median error is 85 minutes. Again, the high prediction error is significant given

that many applications are less than an hour in length (and since PRI-HISTORY will

tend to use hosts with high expected operations per interval). Similarly, the authors

of [94, 34] found that the using the host’s mean performance over long durations does

not reflect the dynamism of CPU availability, and thus is a poor predictor.

We also compared the prediction error of the compute rate per host estimated

using the expected operations and time length per interval. Since hosts are usually

completely idle as shown in Section III.F, the rate itself was predicted correctly. So we

105

attribute the poor performance of PRI-HISTORY to the poor operations per interval

predictions, which cause hosts to be put in the wrong priority queues.

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Prediction Error of E[time per interval] (minutes)

Fra

ctio

n

mean: 109min

std dev: 54min

median: 122min

(a) Expected ops per interval

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

Prediction Error of E[ops per interval] (min. on 1.5GHz)F

ract

ion

mean: 99min

std dev: 75min

median: 85min

(b) Expected time per interval

Figure V.3: Complementary CDF of Prediction Error When Using Expected Operations

or Time Per Interval

In summary, if the number of tasks is greater than or equal than the number

of hosts, there is little benefit or prioritization over FCFS since the fastest and most

available hosts will naturally requests tasks the soonest. Also, waiting to collect a pool

of available hosts does not improve resource selection and only delays task assignment.

This is because during application execution there are usually either far more tasks than

hosts or far fewer tasks than hosts; in either case, waiting is not beneficial.

If the number of tasks is less than the number of hosts, PRI-CR works as well

as or better than PRI-HISTORY since the expected number of operations per interval

tends to be unpredictable for certain hosts. PRI-CR works well only because the slowest

hosts are excluded from the computation, and so, prioritization resulting in exclusion

can improve performance. On average PRI-CR is 1.65 times better than FCFS for

applications with 100 tasks on the SDSC platform.

In conclusion, we see that although PRI-CR outperforms FCFS consistently, re-

source prioritization still leads to performance that is far from the optimal (by more than

a factor of 4 for an application with 100, 5-minute tasks). Looking at the task schedules

in detail, we noticed that using the slowest hosts significantly limited performance, and

106

we address this issue through heuristics described in the next section.

0 10 20 30 40 50 60 70 80 900

50

100

150

200

250

300

350

400

Num

ber

of T

asks

to b

e Sc

hedu

led

Time (minutes)

Number of tasksNumber of ready hosts

0 10 20 30 40 50 60 70 80 900

50

100

150

200

250

300

350

400

Num

ber

of R

eady

Hos

ts

Figure V.4: Number of tasks to be scheduled (left y-axis) and hosts available (right

y-axis).

V.B Resource Exclusion

To prevent slower hosts from delaying application completion, we developed

several heuristics that exclude hosts from the computation using a variety of criteria.

All these heuristics use only host clock rates to obtain lower bounds on task completion

time (as we have seen that the expected operations or time per interval is not a good

predictor of future performance). All of the resource exclusion heuristics prioritize re-

sources according to their clock rates since we found in the previous section that PRI-CR

performed the best out of all the prioritization heuristics.

V.B.1 Excluding Resources By Clock Rate

Our first group of heuristics excludes hosts whose clock rates are lower than

the mean clock rate over all hosts (1.2GHz for the SDSC platform) minus some factor

of the standard deviation of clock rates (730MHz for the SDSC platform) for the entire

duration of the computation. The heuristics EXCL-S1.5, EXCL-S1, EXCL-S.5, and

EXCL-S.25 exclude those hosts according a threshold that is 1.5, 1, .5, or .25 standard

107

deviations below the mean clock rate, respectively.

Figure V.5 shows the performance of the heuristics on the SDSC platform, and

we see that in all cases at least one of the exclusion heuristics improves performance

relative to PRI-CR. In most cases, the minimum makespan occurs at a threshold of .5

or 1; EXCL-S.5 effectively eliminates almost all of the laggers caused by slow hosts (see

Figure V.7). The makespan increases for higher or lower thresholds as too many useful

hosts or too few useless hosts are excluded from the computation. Usually, EXCL-S.25

excludes so many hosts that it not only removes the useless hosts but also excludes some

of the useful ones; the exception is the application with 100 tasks, which is equal to

roughly half the number of hosts. For this particular desktop grid platform, excluding

those hosts with speeds 25% below the mean will leave slightly more than half of the

hosts and thus filtering in this case does not hurt performance. EXCL-S1.5 excludes too

few hosts, and the remaining useless hosts hurt the application makespan.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

PRI−CREXCL−S1.5EXCL−S1EXCL−S.5EXCL−S.25

100 200 400

0

1

2

3

4

5

6


Figure V.5: Performance of heuristics using thresholds on SDSC grid

In conclusion, resource exclusion can be significantly beneficial. In the above

experiments, it performs on average 1.49 times better than FCFS on the SDSC desktop

grid. For the SDSC platform, EXCL-S.5 has the particular threshold that yields the best

performance; on average, EXCL-S.5 performs 8%, 30%, and 6% better than PRI-CR for

108

applications with 100, 200, and 400 tasks respectively because many of the hosts with

slow clock rates are eliminated from the computation. However in other platforms with

different clock rate distributions, the fixed threshold may not be adequate. Figure V.10

shows the performance of EXCL-S.5 compared to FCFS on the multi-cluster LRI-WISC

platform. For applications with 400 tasks, we see the negative effect of using a fixed

threshold with EXCL-S.5 as the resulting performance is worse than FCFS. This is

because EXCL-S.5 excludes hosts with 900MHz clock rates, which contribute a significant

fraction of the platform’s overall compute power. So larger applications scheduled with

EXCL-S.5 exhibit worse performance than with FCFS. In the next section, we propose

strategies that use a makespan predictor to filter hosts in a way that is less sensitive

to the clock rate distribution, and compare it to EXCL-S.5 for different desktop grid

configurations.

V.B.2 Using Makespan Predictions

To avoid the pitfalls of using a fixed threshold such as a particular clock rate 50%

of the standard deviation below the mean in the case of EXCL-S.5, we develop a heuristic

where the scheduler uses more sensitive criteria for eliminating hosts. Specifically, the

heuristic predicts the application’s makespan, and then excludes those resources that

cannot complete a task by the projected completion time. Our rationale is that the

definition of a “slow” host should vary with the application size (or number of tasks to

be completed during runtime), instead of the distribution of clock rates. That is, large

applications with many tasks relative to the number of hosts should use most of the

hosts as long as they do not delay application completion, whereas small applications

with fewer tasks than hosts should use only a small subset of hosts that can complete a

task by the application’s projected makespan.

To predict the makespan, we compute the average operations completed per

second for each host taking into account host load and availability using the traces and

then computing the average over all hosts (call this average r). If N is the number

of hosts in the desktop grid, we assume a platform with N hosts of speed r, and then

estimate the optimal execution time for the entire application with T tasks of size s in

operations via wr = dT/Ne(s/r). The rationale behind this prediction method is that

109

the optimal schedule will never encounter task failures. So host unavailability and CPU

speed are the two main factors influencing application execution time, and these factors

are accounted for by r. In addition, we account for the granularity at which tasks can

be completed with dT/Ne.To assess the quality of our predictor wr, we compared the optimal execution

time with the predicted time for tasks 5, 15, and 35 minutes in size and applications

with 100, 200, and 400 tasks. The average error over 1,400 experiments is 7.0% with a

maximum of 10%. The satisfactory accuracy of the prediction can be explained by the

fact that the total computational power of the grid remains relatively constant, although

the individual resources may have availability intervals of unpredictable lengths. To show

this, we computed the number of operations delivered during weekday business hours in

5 minute increments, aggregated over all hosts. We found that the coefficient of variation

of the operations available per 5 minute interval was 13%. This relatively low variation

in aggregate computational power makes the accurate predictions of wr possible.

The heuristic EXCL-PRED uses the makespan prediction, and also adaptively

changes the prediction as application execution progresses. In particular, the heuristic

starts off with a makespan computed with wr, and then after every N tasks are com-

pleted, it recomputes the projected makespan. We choose to recompute the prediction

after N tasks are completed for the following reasons. On one extreme, a static predic-

tion computed only once in the beginning is prone to errors due to resource availability

variations. At the other extreme, recomputing the prediction every second would not

be beneficial since it would create a moving target and slide the prediction back (until a

factor of N tasks are completed).

If the application is near completion and the predicted completion time is too

early, then there is a risk that almost all hosts get excluded. So, if there are still

tasks remaining at time pred − .95 ∗meanops, where pred is the predicted application

completion time and meanops is the mean clock rate over all hosts, the EXCL-PRED

heuristic reverts to PRI-CR at that time. This ensures that EXCL-PRED switches to

PRI-CR when it is clear that most hosts will not complete a task by the predicted

completion time. Note that if the heuristic waited until time pred (versus pred − .95 ∗meanops) before switching to PRI-CR, it would result in poor resource utilization as seen

110

in some of our early simulations, since most hosts are available and excluded by time

pred. Therefore, waiting until time pred before making task assignments via PRI-CR

would cause most hosts to sit needlessly idle.

V.B.2.a Evaluation on Different Desktop Grids

We tested and evaluated our heuristics in simulation on all the desktop grid

platforms described in Section III.D. We focus our discussion here on the platforms on

which we found remarkable results, namely the SDSC, GIMPS, and LRI-WISC platforms,

and report the results of the other platforms in Appendix B. In particular, since all of

our heuristics use only clock rate information for resource selection or exclusion, the

heuristics executed on platforms that contained hosts with relatively similar clock rates

usually had similar results (for the DEUG and LRI platforms, with the exception of the

UCB platform [see Appendix B]).

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

CREXCL−S.5EXCL−PRED

100 200 400

0

1

2

3

4

5

6


Figure V.6: Heuristic performance on the SDSC grid

Figure V.6 shows that on the SDSC grid PRI-CR performs nearly as well as

EXCL-PRED or EXCL-S.5 for applications with 100 or 400 tasks, but performs more

than 23% worse than EXCL-S.5 for applications with 200 tasks. The performance of

PRI-CR depends greatly on number of tasks in the application and whether this number

111

causes the slow hosts to be excluded from the computation. For the application with 100

tasks, the slow hosts get excluded and so PRI-CR does relatively well (see Figure V.7).

However, for the application with 200 tasks, PRI-CR assigns tasks to slow hosts, which

then impede application completion. For the application with 400 tasks, there are enough

tasks such that most hosts are kept busy with computation while the slow tasks complete.

In contrast to PRI-CR, the exclusion heuristics perform relatively well for all

application sizes. Figure V.6 shows that EXCL-PRED usually performs as well as EXCL-

S.5 on the machines at SDSC, but there is no clear advantage for using EXCL-PRED; for

the particular distribution of clock rates in the SDSC desktop grid, EXCL-S.5 appears

to have the particular threshold that yields the best performance. Of all the heuristics,

EXCL-S.5 eliminates the highest percentage of laggers caused by slow hosts; the reduc-

tion in the percent of laggers caused by slow hosts is as high as ∼60%. EXCL-PRED

has slightly more laggers caused by slow hosts than EXCL-S.5 as it is less aggressive in

filtering hosts than EXCL-S.5. Consequently, EXCL-PRED performs 13% more poorly

than EXCL-S.5 for the application with two-hundred 15-minute tasks. We have found

after close inspection of our traces and the laggers that this is because of a handful rela-

tively slow hosts that finish execution past the projected makespan and/or task failures

on these slow hosts occurring near the end of the application. For the application with

400 tasks, the delay is hidden as there are enough tasks to keep other hosts busy until

the slow hosts can finish task execution. For the application with 100 tasks, the rela-

tively slow and unstable hosts get filtered out as there are fewer tasks than hosts and

the heuristic prioritizes resources by clock rate.

Using the same reasoning for the SDSC platform, we can explain why EXCL-

S.5 outperforms EXCL-PRED for the GIMPS desktop grid (see Figure V.9), which like

the SDSC grid has a left heavy distribution of resource clock rates. On the GIMPS

resources, applications scheduled with FCFS or PRI-CR often cannot finish during the

weekday business hours period, i.e., have application completion times greater than 8

hours, because of the use of the extremely slow resources. So, slow hosts especially in

Internet desktop grids having a left-heavy distribution of clock rates are detrimental to

the performance of both FCFS and PRI-CR.

Although EXCL-S.5 performs the best for the SDSC and GIMPS desktop grids,

112

12

34

12

34

12

34

05 # of laggers

100

task

s pe

r ap

plic

atio

n

slow

hos

tfa

iled

task

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

12

34

12

34

12

34

05 # of laggers

200

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

12

34

12

34

12

34

05 # of laggers

400

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

Fig

ure

V.7

:C

ause

ofLag

gers

(IQ

Rfa

ctor

of1)

onSD

SCG

rid.

1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D

113

01

23

40

12

34

01

23

40

5000

1000

0

100

task

s pe

r ap

plic

atio

n

duration (sec)1s

t2n

d3r

d4t

h5

min

task

s15

min

task

s35

min

task

s

01

23

40

12

34

01

23

40

5000

1000

0

200

task

s pe

r ap

plic

atio

n

duration (sec)

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

01

23

40

12

34

01

23

40

5000

1000

0

400

task

s pe

r ap

plic

atio

n

duration (sec)

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

Fig

ure

V.8

:Len

gth

ofta

skco

mpl

etio

nqu

arti

les

onSD

SCG

rid.

0→

OP

TIM

AL.1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D

114

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure V.9: Heuristic performance the GIMPS grid

the threshold used by EXCL-S.5 is inadequate for different desktop grid platforms, and

the filtering criteria and adaptiveness of EXCL-PRED is advantageous in the other sce-

narios. In particular, EXCL-PRED either performs the same as or outperforms EXCL-

S.5 for the multi-cluster LRI-WISC platform. For the application with 400 tasks (see

Figure V.10), EXCL-PRED outperforms EXCL-S.5 in the case of the LRI-WISC by

17%. EXCL-S.5 in the LRI-WISC desktop grid excludes all 600MHz hosts, which con-

tribute significantly to the platform’s overall computing power. In general, the longer

the steady state phase of the application, the better EXCL-PRED performs with re-

spect to EXCL-S.5, since EXCL-S.5 excludes useful resources some of which are utilized

by EXCL-PRED. This explains why EXCL-PRED performs better than EXCL-S.5 for

applications with more tasks and larger task sizes. While PRI-CR does as well as EXCL-

PRED, clearly PRI-CR is not as effective on other platforms, especially those with a left

heavy distribution of clock rates.

In conclusion, using a makespan prediction can prevent unnecessary exclusion of

useful resources. However, this method is sometimes too conservative in the elimination

of hosts, especially for shorter applications.

115

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure V.10: Heuristic performance on the LRI-WISC grid

V.C Related Work

Since the emergence of grid platforms, resource selection on heterogeneous,

shared, and dynamic systems has been the focus of intense investigation. However, desk-

top grids compared with traditional grid systems incorporating mainly a set of clusters

and/or MPP’s are much more heterogeneous and volatile as reflected by the results in

Chapter III. Consequently, the platforms models used in grid scheduling are inadequate

for desktop grids.

One example of this inadequacy is the typical model of resource availability.

As discussed in Chapter III, availability models based on host or CPU availability, such

as those described in [32, 56, 92] do not accurately reflect the availability of resource

as perceived by a desktop grid application. So, the scheduling heuristics designed and

evaluated with these models are inapplicable to desktop grid environments.

For example, the work in [33] describes a system for scheduling soft real-time

tasks using statistical predictors of host load. The system presents to a user confidence

intervals for the running time of a task. These confidence intervals are formed using time

series analysis of historical information about host load. However, the work assumes a

homogeneous environment and disregards task failures caused by user activity (as the

116

system does not specifically target desktop grid environments). As such, the effectiveness

of the system on desktop grids is questionable.

Another example is the work in [80], which studies the problem of scheduling

tasks on a computational grid for the purpose of online tomography. The application

consist of multiple independent tasks that must be scheduled in quasi-real-time on a

shared network of workstations and/or MPP’s. To this end, the authors formalize the

scheduling problem as a constrained optimization problem. The scheduling heuristics

then construct a plan given the constraints of the user (e.g., requirement of feedback

within a certain time) and the characteristics of the application (e.g., data input size).

Although the scheduling model considers the fact that host can be loaded, the model

does not consider task failures. Given the high task failure rates in desktop grid systems,

the same heuristics executed desktop grids will likely suffer from poor performance.

The work described in [61] is the most relevant in terms of desktop grid schedul-

ing. The author investigates the problem of scheduling multiple independent compute-

bound applications that have soft-deadline constraints on the Condor desktop grid sys-

tem. Each “application” in this study consists of a single task. The issue addressed in the

paper is how to prioritize multiple applications having soft deadlines so that the highest

number of deadlines can be met. The author uses two approaches. One approach is to

schedule the application with the closest deadline first. Another approach is to deter-

mine whether the task will complete by the deadline using a history of host availability

from the previous day, and then to randomly choose a task that is predicted to complete

by the deadline. The author finds that a combined approach of scheduling the task

that is expected to complete with the closest deadline is the best method. Although the

platform model in that study considers shared and volatile hosts, the platform model

assumes that the hosts have identical clock rates and that the platform supports check-

pointing. So, the study did not determine impact of relatively slow hosts or task failures

on execution for a set of tasks; likewise, the author did not study the effect of resource

prioritization (e.g., according to clock rates) or resource exclusion.

117

V.D Summary

In this chapter, we investigated the use of two resource selection techniques for

improving application makespan, namely resource prioritization and resource exclusion.

We found that resource prioritization could improve application performance, but that

the improvement varied greatly with the number of tasks per application; if the applica-

tion consisted of many tasks, then the tasks would inevitably be assigned to slow hosts,

which limited performance. When the number of tasks is about equal or greater than

the number hosts, there was little benefit of prioritization over FCFS. The most capable

hosts tended to request tasks the most often, and so FCFS performed almost as well as

any of the prioritization heuristics we studied. Moreover, waiting for a pool of host re-

quests to collect before performing resource selection only delayed application execution.

When the number of tasks was less than the number of hosts, prioritization resulting in

exclusion of poor hosts improved performance. PRI-CR on average performed 1.65 times

better than FCFS for applications with 100 tasks. We found that using static clock rate

information was more useful than using relatively dynamic information about the length

of availability intervals; the mean availability interval length is a poor predictor of host

performance.

We then studied heuristics to eliminate these slow hosts from the application

execution. Our exclusion heuristics used either a fixed threshold (with respect to the

platform’s mean clock rate) by which to filter hosts, or an adaptive threshold based

on the application’s predicated makespan. When using a fixed threshold, the exclusion

heuristics achieved high performance gains; EXCL-S.5, which was the best performing

fixed-threshold heuristic on the SDSC platform, performed 1.49 times better than FCFS

on the SDSC grid. However, exclusion using a fixed threshold can sometime degrade

performance, depending on the distribution of host speeds. We then studied another

heuristic that excluded resources according to a predicted makespan. That is, periodi-

cally, the heuristic EXCL-PRED made a makespan prediction, and excluded only those

hosts that could complete a task by the predicted makespan. For the SDSC and the

GIMPS platforms, the EXCL-PRED proved to be too conservative in its exclusion of re-

sources and performed up to 1.14 times worse. However, on the multi-cluster LRI-WISC,

118

EXCL-PRED performed up to 1.19 times better, especially for longer applications that

can make more use of the slower hosts incorporated by EXCL-PRED but excluded by

EXCL-S.5. We will see in the next chapter how EXCL-PRED can be combined with

task replication so that it performs best (or close to best) on all platforms.

Chapter VI

Task Replication

VI.A Introduction

In the previous chapter, we explored a range of heuristics that determined on

which host to schedule a task. However, even if the best resource selection method is

used, performance degradation due to task failures is still possible. The relatively long

last quartile of task completion times of the best performing heuristic compared to the

last quartile of the optimal (which was as much as 20 minutes or 13.8 times shorter)

indicates there is much room for improvement. In this chapter, we augment the resource

selection and exclusion heuristics described previously to use task replication techniques

for dealing with failures.

We define task replication as the assignment of multiple task instances of a

particular task to a set of hosts; a task is the applications’s unit of work and a task

instance is the corresponding application executable and data inputs to be assigned to

a host. We refer to the first task instance created as the original and the replicated

task instances as replicas. By assigning multiple task instances to hosts, the probability

of all tasks failing can be reduced. Also, replication can be a means of adapting to

dynamic host arrivals (as most desktop grid systems do not support process migration);

for example, in the case where a task has been assigned to a relatively slow host but a

fast host arrives shortly thereafter, a task can be replicated on the fast host (as opposed

to migrated) to accelerate task completion.

Task replication is a plausible technique for coping with task failures and delays

119

120

for at least two reasons. First, there is often an abundance of resources available com-

pared to the amount of work to be completed. At one point in the SETI@home project,

there were more participants than actual tasks to distribute and so the scheduler began

replicating tasks just to keep the participants busy [86]. In the Sprite project [36], the

authors noted that the use of idle hosts is limited by the lack of applications instead

of the lack of hosts. Finally, personal communication with one of the committee mem-

bers [29] suggests that desktop grids within enterprises are often underutilized. Because

there is little contention for resources among applications, replication is often a plausible

option.

Second, task replication is relatively easier to implement and deploy than check-

pointing or process migration because replication requires no modification of the appli-

cation nor the hosts’ operating system. With little modification, schedulers in most

desktop grid systems [37, 39, 87] can support task replication; only simple bookkeeping

details for each task instance need to be added (see Chapter VII). In contrast, imple-

mentation of system-level checkpointing and process migration often requires integration

with the kernel (and is often highly specific to the kernel version) [52, 36], which is not

always possible considering the wide range of operating systems (and versions) on hosts

found in enterprise and Internet desktop grids [78]. Moreover, remote checkpointing

often requires servers to store checkpoints, and process migration often involves moving

the entire state of the application across different hosts. Considering that hosts with

memory sizes of 512MB are common and the relatively low data transfer speeds capable

through the Internet, remote checkpointing or process migration across Internet desk-

top grids may not be practical or feasible, especially for applications that require rapid

application turnaround.

In order to replicate tasks effectively, we investigate the following issues:

1. Which task to replicate and which host to replicate to. If a task instance is already

running on a fast and stable host, replicating the task on a different host with a

lower clock rate or less availability clearly will not improve performance. We study

different methods of choosing which task to replicate and on which host to schedule

a replica.

121

2. How much to replicate. Clearly, task throughput tends to decrease inversely with

the amount of replication. The reason is simply because if r task instances of a

particular task are assigned then the effective amount of work increases by a factor

of r, and so throughput is reduced by a factor of 1/r. On one extreme, a task

could be replicated only once, and on another extreme a task could be replicated

on all available hosts. We determine the performance improvement and waste for

various levels of replication.

Regarding the issue of when to replicate during an application’s execution, all

of our heuristics only replicate when there are more hosts than tasks. Applications

that have a number of tasks larger than the number of hosts will often have a steady-

state phase, and replicating during this steady state phase will usually not improve

makespan, and only delay task completion. The fact that the length of time in the

first three quartiles of application execution is close to the optimal supports our claim

that replication is unnecessary during this phase (see Figure V.7). Thus, we only use

replication after the point at which the number of available hosts is greater than the

number of tasks remaining, scheduling tasks only when there is a surplus of hosts, and

in this way, reducing the chance that a replicated task will delay the execution of another

task. Replicating anytime sooner could cause a host to do redundant work when there are

more unscheduled tasks than hosts, and thereby cause a delay in application completion.

We examine the above replication issues with respect to three broad approaches

for task replication, namely proactive, reactive, and hybrid approaches. With proactive

replication, multiple instances of each task are created initially and assigned as hosts

become available. Proactive replication techniques are aggressive in the sense that repli-

cation is done before a delay in application completion time has occurred. In contrast,

with reactive replication, the heuristics replicate a task only when the task’s completion

has been delayed and its execution is delaying completion; in this sense, the heuristics

are reactive. Finally, we develop a heuristic that uses a hybrid approach for replicating

tasks that either have a high risk of delaying application completion or are currently

delaying completion; as such, the heuristic uses both proactive and reactive replication

techniques.

122

VI.B Measuring and Analyzing Performance

VI.B.1 Performance metrics

Similar to Chapter V, we continue to use makespan relative to optimal as the

performance metric. In addition, we use waste, which is the percent of tasks replicated

(including those that fail), to quantify the expense of wasting CPU cycles. A replication

heuristic that has high waste would be problematic if the entire desktop grid is loaded

and multiple applications are competing for resources. (Note that the reason we did not

consider the heuristics with replication in the previous chapter is that replication is not

always an option when there is high resource contention among multiple applications in

the system.)

VI.B.2 Method of Performance Analysis

In general, we use the same techniques of lagger analysis used in Chapter V.

However, in our analysis of laggers, we take into account replication as follows. We

define a task instance as the executable and data of a particular task assigned to a host.

Replication involves assigning multiple instances of a task each to a different host. When

task replication is used, some task instances might complete before the lagger threshold

while others complete after the threshold. To address this scenario, we only classify task

instances of a task as laggers if the completion times of all task instances of that task fall

after the lagger threshold. In this way, if instances of a task have completed before the

lagger threshold, any instances of the task completed after the threshold are excluded

from lagger analysis. If a task instance is classified as a lagger, we consider all of the

instances of the corresponding task in the lagger analysis in order to assess why each

task instance was completed late.

VI.C Proactive Replication Heuristics

We augment the heuristics PRI-CR, EXCL-S.5, and EXCl-PRED described in

Chapter V to use replication and refer to the new heuristics as PRI-CR-DUP, EXCL-

S.5-DUP, and EXCL-PRED-DUP respectively. When scheduling an application,

123

each of the heuristics will create two instances (one original and one replicas) of each

task and place them into a priority queue. Replicas are scheduled from this queue only

when the number of hosts available is greater than the number of tasks to schedule. The

tasks are prioritized according to the clock rate of the host to which the original task

instance was assigned. So task instances assigned to slower hosts should be replicated

more often. The heuristics PRI-CR-DUP, EXCL-S.5-DUP, and EXCL-PRED-DUP differ

by the set of hosts considered for task assignment as described in Chapter V.

All the heuristics discussed above prioritize tasks according to the clock rate

of the host to which the original task instance was assigned. We study other criteria

for selecting which task to schedule. EXCL-PRED-DUP-TIME is similar to EXCL-

PRED-DUP except original task instances assigned farthest in the past are assigned

first; original task instances assigned farthest in the past most likely failed or are stuck

on slow hosts. EXCL-PRED-DUP-TIME-SPD prioritizes the tasks according to the

time the first task instance was assigned plus the shortest possible completion time of

the task, i.e., the task size divided by the host’s maximum compute rate. Since most

hosts are available most of the time, we expect that most hosts should complete tasks in

the shortest possible time, and the heuristic replicates those tasks that take longer and

whose execution has been delayed.

The heuristics above only create one replica for each task. We also study the

effect on application performance of varying the number of times a task is replicated. We

vary the number of replicas created by EXCL-PRED-DUP to be 2, 4, and 8 for heuristics

EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, and EXCL-PRED-DUP-8, respectively.

VI.C.1 Results and Discussion

The addition of replication for each heuristic invariably improves performance

significantly by 35% on average (see Figure VI.1) for the SDSC platform. Somewhat

surprisingly, the performance of each of the replication heuristics are similar, regardless

of which set of hosts is excluded. We attribute this to the fact that when replication is

done near the end of the application there are far more hosts than tasks and of these

hosts, several are fast and stable. So replication is done on the same set of relatively fast

hosts for each heuristic (about 20% of the hosts have clock rates greater than 2GHz),

124

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal

FCFSCRCR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP

100 200 400

0

1

2

3

4

5

6


Figure VI.1: Performance Of Heuristics Combined With Replication On SDSC Grid.

Number of tasksHeuristic 100 200 400EXCL-PRED-DUP-2 +2.6% +0.15% +3.4%EXCL-PRED-DUP-4 +6.7% -3.0% +2.0%EXCL-PRED-DUP-8 +4.3% -7.4% -4.1%

Table VI.1: Mean performance difference relative to EXCL-PRED-DUP when increasing

the number of replicas per task.

and excluding slow resources at this point is ineffectual.

We find that after a task is replicated once, replicating more often does not

improve performance. Table VI.1 shows the mean performance difference between EXCL-

PRED-PRED and EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, and EXCL-PRED-DUP-

8 for applications with 100, 200, and 400 tasks. The maximum mean improvement

relative to EXCL-PRED-DUP over all heuristics is 6.7%. The lack of performance im-

provement is partly due to the fact that replicating a task once dramatically decreases

the probability of failure since there are many fast hosts available near the end of ap-

plication execution. Thereafter, creating more replicas will not significantly reduce the

125

probability of failure. Moreover, if too many replicas are created, then a large fraction

of the hosts will be doing redundant work preventing useful work from being done, thus

degrading performance. This explains the performance degradation shown in Table VI.1

of the EXCL-PRED-DUP-4 and EXCL-PRED-DUP-8 heuristics.

Several of the trends described for the SDSC platforms match those trends

found on the DEUG, LRI, and UCB platform, which we summarize here and shows in

Appendix C. The performance improvement resulting from replication on the DEUG

platform is less than the improvement found with the SDSC platform because the DEUG

host clock rates are relatively homogeneous compared to the SDSC host clock rates.

Little improvement is found on the LRI platform because the hosts are both stable and

have homogeneous clock rates. Replication on the UCB platform results in high benefits

as the hosts are volatile and replication reduces the chance of failure. We conclude that

(proactive) replication can be useful either when there is a wide range of host clock rates

and/or the hosts are volatile.

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated

PRI−CR−DUPEXCL−S.5−DUPEXCL−PRED−DUP

100 200 400Number of Tasks Per Application

Figure VI.2: Waste Of Heuristics Using Proactive Replication On SDSC Grid.

Despite the performance improvement when creating a single replica, the waste

in resources is significant (see Figure VI.2), and is 29% on average and as high as ∼90%.

In loaded desktop grids especially, such waste is unacceptable and would result in a

126

dramatic decrease in overall system throughput. We develop heuristics that use reactive

replication to reduce the level of waste in the next section.

VI.D Reactive Replication Heuristics

Up to now, we have considered heuristics that place all replicas of a task in the

queue initially as soon as the original task is scheduled, and this resulted in high waste.

In an effort to improve efficiency, we now consider heuristics that are discriminate in

deciding which tasks are replicated. We modify the EXCL-PRED heuristic to evaluate

certain criteria for each task before placing a replica in the queue, effectively delaying task

replication. EXCL-PRED-TO is similar to EXCL-PRED except it delays the creation

of replicas until the predicted application completion time passes. That is, whenever the

original task instance is scheduled, we associate with that task instance the predicted

application completion time. This completion time is determined using the makespan

predictor described in Section V.B.2, which uses the average effective compute rate per

host to predict when the application will complete. This predicted completion time is

then used as a “time-out” value; if by that time the task instance has not completed, we

create a replica and place it in the queue. This heuristic is optimistic in the sense that

it creates the replica only after it determines that the original task instance has failed

to complete by the predicted application completion time instead of replicating earlier.

The rationale is that we should not replicate tasks that have been scheduled to fast and

reliable hosts, and instead, we should only replicate when it is has been determined that

the execution of the task instance is delaying application completion, i.e., when the task

instance’s execution goes past the predicted completion time. The heuristic effectively

only replicates when it is close to the completion time of the application.

Another heuristic we consider is EXCL-PRED-TO-SPD, which replicates

more aggressively than EXCL-PRED-TO but less aggressively than EXCL-PRED-DUP.

EXCL-PRED-TO-SPD creates a replica only after the minimum task completion time

has expired, i.e., after task size/host clock rate seconds have expired. The reasoning

is that since most hosts are unloaded most of the time, each host should usually be

completely available to execute a task. In the case where a task instance is not completed

127

in its expected execution time (e.g., because the task execution was suspended multiple

times or host is slightly loaded), the heuristic assumes the task execution will delay

application completion and places a replica in the queue.

VI.D.1 Results and Discussion

The performance of EXCL-PRED-TO and EXCL-PRED-TO-SPD is similar to

the other more aggressive replication heuristics from Section VI.C (see Figure VI.3),

despite replicating tasks later during application execution and replicating tasks less

often; the mean difference in the average makespan between EXCL-PRED-DUP and

EXCL-PRED-TO is close to zero. This is due to fact that the heuristics only replicate a

task instance when it is determined that the executing task instance will delay application

completion. Moreover, when a task instance is replicated, there is usually a fast and

stable host to complete the task instance quickly and reliably. We discuss this in further

detail in Section VI.E.4.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal

FCFSEXCL−PRED−DUPEXCL−PRED−TOEXCL−PRED−TO−SPD

100 200 400

0

1

2

3

4

5

6


Figure VI.3: Performance of reactive replication heuristics on SDSC grid.

Also, the performance of EXCL-PRED-TO and EXCL-PRED-TO-SPD is re-

markable because both heuristics significantly outperform EXCL-PRED-DUP with much

less waste (by as much as 86% on the SDSC platform). In all cases on the SDSC platform,

128

PlatformMetric SDSC DEUG LRI UCBMakespan +0.06% -8.7% -10.8% -17.5%Waste +86.2% +39% +71.4% +26.1%

Table VI.2: Mean performance difference and waste difference between EXCL-PRED-

DUP and EXCL-PRED-TO.

EXCL-PRED-TO achieves the lowest waste of all the heuristics shown in Figure VI.4,

and is less wasteful than EXCl-PRED-TO-SPD by 65% on average. Again, we attribute

the efficiency of EXCL-PRED-TO to the makespan predictor, which forces the sched-

uler to wait as long as possible (without significantly delaying application execution)

before replicating a task. The results show that reactive replication can achieve high

performance gains at relatively low resource waste.

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated

EXCL−PRED−DUPEXCL−PRED−TOEXCL−PRED−TO−SPD

100 200 400Number of Tasks Per Application

Figure VI.4: Waste of reactive replication heuristics on SDSC grid.

Table VI.2 summarizes the mean makespan difference and mean difference of

waste of EXCL-PRED-DUP and EXCL-PRED-TO on all platforms. A positive per-

centage means that EXCL-PRED-TO did better than EXCL-PRED-DUP. In terms of

mean makespan, EXCL-PRED-TO performs about 8.7% and 10.8% worse than EXCL-

129

PRED-DUP on the DEUG and LRI platforms, respectively. Although EXCL-PRED-TO

performs slightly worse on these platforms, it is much less wasteful (on average 39% and

71.4% less wasteful on the DEUG and LRI platforms). This is partly because EXCL-

PRED-TO can adjust to the volatility of the platforms. Because EXCL-PRED-TO does

not replicate until a task delays execution, which is unlikely in the LRI scenario, there

is much less waste with EXCL-PRED-TO than EXCL-PRED-DUP. In the opposite sce-

nario, where the platform is volatile, EXCL-PRED-TO will replicate tasks more often.

Nevertheless, we find that EXCL-PRED-TO performs worse (17.5% on average) than

EXCL-PRED-DUP because tasks have a relatively high chance of failing on the UCB

platform. When EXCL-PRED-TO replicates a task instance that has timed out, there

is a relatively high probability that the replica itself will fail, and so the benefits of

using EXCL-PRED-TO are less on the relatively volatile UCB compared to the other

platforms. In contrast to EXCL-PRED-TO, EXCL-PRED-DUP replicates tasks imme-

diately as soon as the original task instance is assigned to a host (versus waiting until

the predicted application completion time) so there is a smaller chance that both task

instances will fail and delay application completion. At the same time, the waste of

EXCL-PRED-TO is significantly less than EXCL-PRED-DUP by 26.1% on average.

In summary, we find that EXCL-PRED-TO in general performs similar to

EXCL-PRED-DUP (within 10% on average across all platforms) while causing much

less waste (on average, 55% less across all platforms). This is because EXCL-PRED-TO

only replicates when a task instance will delay application completion, and because the

replica is most often scheduled on a relatively fast and stable host. The exception is on

the UCB platform, where the resources are so volatile that replicating a task as soon as

the original task instance is assigned results in faster task completion than if timeouts

are used; in this case, EXCL-PRED-DUP performs 17.5% better than EXCL-PRED-TO.

VI.E Hybrid Replication Heuristics

In the previous sections, we designed and evaluated proactive and reactive

replication heuristics that replicate tasks either proactively or reactively in an effort

to reduce the probability of task failure near the end of application execution. In this

130

section, we investigate a hybrid approach for replication that replicates proactively those

tasks that have high chance of failing, while replicating reactively those tasks that have

not completed by a predicted completion time. Clearly, just combining the proactive

replication heuristic EXCL-PRED-DUP and the reactive replication heuristic EXCL-

PRED-TO would not be beneficial as EXCL-PRED-TO achieved similar performance as

EXCL-PRED-DUP but with far less waste on most platforms; EXCL-PRED-DUP was

wasteful because it indiscriminately replicated all tasks once in order of those assigned

to the slowest hosts. In contrast, we use a more refined method of determining which

task to replicate and how much to replicate with our hybrid heuristic. Our approach is

to use the probability of task completion on the previous day to predict the probability

of task completion on the following day. Using these predicted probabilities, we replicate

tasks until the predicted probabilities of task completion go below some threshold. We

describe the heuristic in detail below.

The REP-PROB heuristic uses the history of host availability to make in-

formed decisions regarding replication. Specifically, the heuristic prioritizes each host

according to its predicted probability of completing a task by the projected application

completion time. We use random incidence (as discussed in Section III.E.4) with the

previous day’s host traces to determine the predicted probability of task completion. The

projected application completion time is determined using the same makespan predictor

as EXCL-PRED-TO described in Section V.B.2.

Also, REP-PROB prioritizes each task according to its probability of comple-

tion by the predicated makespan given the set of hosts it has been assigned to; the

task with the lowest probability of completion is replicated on the host with the highest

probability. Regarding how many task instances to create, the heuristic could create a

single replica as in the EXCL-PRED-DUP heuristic. But if the two task instances were

both scheduled on slow and unreliable hosts, then the probability of task completion

would remain low and the task would require more replicas. Instead, REP-PROB uses

the probability of completion to estimate how many task replicas to create in order to

ensure the probability of task completion is greater than some threshold.

131

VI.E.1 Feasibility of Predicting Probability of Task Completion

To evaluate the feasibility of such an approach, we examine the stationarity

of the probability that a task of a given size completes from day to day. Figure VI.5

shows the probability of task completion per day for tasks 5, 15, 35 minutes in length

for each of the platforms. The graphs show the probabilities for all five business days in

one week, staring with a Monday. We see that in each platform that the probability of

task completion is relatively constant, and deviates from the previous day by no more

than 10%. This provides evidence that the predicted values may be sufficiently close to

the actual.

Also, we calculate the prediction error of the probability of task completion for

each host from one day to the next. Figures VI.6(a), VI.6(b), VI.6(c), and VI.6(d) show

the CDF of prediction errors for the SDSC, DEUG, LRI, and UCB platforms. We find

in all of the platforms that at least 60% of the prediction errors are less than 25%. These

results combined with the evidence of host independence shown in Section III.E.5 made

us optimistic that we could compute the probability of task completion accurately.

VI.E.2 Probabilistic Model of Task Completion

To create an accurate probabilistic model, we first created a simple deterministic

finite automata (DFA) to understand and clarify the various states of a task during

execution (see Figure VI.7). Note that the concept of availability used in the figure

refers to exec availability. First, the task begins execution (state 1). If the host fails

before the task can complete, the task fails (state 2), and we must wait until the host

becomes available again before beginning task execution again (state 1). If the host is

available long enough for the task complete, task completes (state 3).

With this model, it became apparent that using a geometric distribution to

model the probability that a task completes in certain number of attempts would be

possible. By using a geometric distribution, we assume that each attempt to complete

a task instance on some host is independent of other attempts on the same host. In

particular, the probability of task completion can be computed using the following pa-

rameters:

132

0

0.2

0.4

0.6

0.8

1

Day

Pro

babi

lity

of T

ask

Com

plet

ion

08−S

ep−2

003

09−S

ep−2

003

10−S

ep−2

003

11−S

ep−2

003

12−S

ep−2

003

5 min15 min35 min

(a) SDSC

0

0.2

0.4

0.6

0.8

1

Day

Pro

babi

lity

of T

ask

Com

plet

ion

17−J

an−2

005

18−J

an−2

005

19−J

an−2

005

20−J

an−2

005

21−J

an−2

005

5 min15 min35 min

(b) DEUG

0

0.2

0.4

0.6

0.8

1

Day

Pro

babi

lity

of T

ask

Com

plet

ion

17−J

an−2

005

18−J

an−2

005

19−J

an−2

005

20−J

an−2

005

21−J

an−2

005

5 min15 min35 min

(c) LRI

0

0.2

0.4

0.6

0.8

1

Day

Pro

babi

lity

of T

ask

Com

plet

ion

07−M

ar−1

994

08−M

ar−1

994

09−M

ar−1

994

10−M

ar−1

994

11−M

ar−1

994

5 min15 min35 min

(d) UCB

Figure VI.5: Probability of task completion per day for several task lengths.

133

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Prediction Error of Task Completion Rate Between Day X and X+1

Cum

ulat

ive

Fra

ctio

n

5 min15 min35 min

(a) SDSC

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

Fra

ctio

n

5 min15 min35 min

(b) DEUG

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

Fra

ctio

n

5 min15 min35 min

(c) LRI

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Cum

ulat

ive

Fra

ctio

n

5 min15 min35 min

(d) UCB

Figure VI.6: CDF of prediction errors of the probability of task completion from one

day to the next for 5, 15, 35 minute tasks on a dedicated 1.5GHz host

134

1TASK BEGINSEXECUTION

2TASK FAILS

HOST FAILSBEFORE

TASK COMPLETION

HOST BECOMESAVAILABLE

HOST AVAILABLELONG ENOUGH FOR TASK

TO COMPLETE

3TASK

COMPLETES

Figure VI.7: Finite automata for task execution.

D

XL L L

E[Time to task failure]E[Length of unavailability intverval]

TIME

Current time

Trial 4Trial 3Trial 2

A=4Trial 1

Figure VI.8: Timeline of task completion.

135

• H1,H2, ....,HN : the set of N heterogeneous and volatile hosts.

• W1, W2, ...., WT : the set of T tasks of an application to be scheduled.

• Ci: completion time of task Wi.

• ci,j : completion of time of task instance j of task Wi.

• ri: number of instances of task Wi.

• D: the desired completion time of the application, as determined by our makespan

predictor, for example.

• X: execution time of a task on a particular host.

• A: starting from the current time, the number of attempts, i.e., trials, possible on

a particular host to complete a task by time D,

• L: the length of time for each failed attempt on a particular host

• p: the probability of task completion (as computed by random incidence discussed

in Section III.E.4) for a particular host.

The parameters X,A, L, and p are all defined for a particular host Hm (and

should be written as XHm , AHm , LHm , and pHm , respectively), but for brevity we omit

the subscripts in our discussion below.

If a task is to complete by time D when executed on a particular host Hm, the

last attempt to complete a task must occur by time D−X. Thus, the number of attempts

A for task completion is given by b(D − X)/Lc + 1, where L is the time required for

each failed attempt (see Figure VI.8). In the DFA in Figure VI.7, L is the time the task

had been executing before failure just before entering state 2 from state 1 plus the time

before the host becomes available again, i.e., the length of the unavailability interval,

incurred when going from state 2 back to state 1. Ideally, L would be modelled by

the probability distribution of the task’s time to failure and the length of unavailability

intervals. However, constructing such a joint probability distribution is difficult as using

only a day’s worth of historical data results in a very sparse probability distribution over

multiple dimensions. So, as a simplification, we calculate L using the expected time to

136

task failure (which we can compute with random incidence) plus the expected length

of unavailability for a particular host (which we can derive from the traces). Then, the

probability that a task instance j of task Wi completes by time D can be estimated by:

P (ci,j ≤ D) =A∑

a=1

(1− p)a−1p, (VI.1)

which sums the probability that the task completes in the ath attempt.

In Section III.E.5, we gave evidence that exec availability is independent among

hosts (as shown in Section III.E.5) on certain platforms. Assuming that exec availability

is independent among hosts, the probability that a particular task Wi completes by time

D is estimated by:

P (Ci ≤ D) = P (minj(ci,j) ≤ D)

= 1− P (minj(ci,j) > D)

= 1− P (ci,1 > D)P (ci,2 > D) · · ·

P (ci,j > D) where 1 ≤ j ≤ ri

(VI.2)

Then, the probability that the application completes in time D can be estimated

by:

P (maxi(Ci) ≤ D) = P (C1 ≤ D)P (C2 ≤ D) · · ·

P (CT ≤ D)(VI.3)

So using the probability of completion for each host and desired completion

time, we can determine the amount of replication needed to achieve some minimum

probability threshold. Clearly, at any particular time during application execution, there

may not be enough hosts to replicate on in order to achieve the threshold. The heuristic

REP-PROB makes the best effort by replicating the task with lowest probability of

completion on the host with the highest, until there are no remaining hosts left; if a task

has no instances assigned, it is given the highest task priority to ensure that an instance

of each task is assigned before replicating.

137

While Equation VI.3 can be used to estimate the probability of application

completion in theory, in practice it is almost impossible to achieve given the high amount

of replication and number of hosts required. This can be shown by a simple back of the

envelope calculation to determine the number of instances per task required to achieve

some probability bound. That is, assume our application consists of 100 tasks to be

scheduled on the SDSC grid, and that our desired probability of application completion

P (maxi(Ci) ≤ D) is 80%. Achieving this threshold requires that each task is completed

with probability P (Ci ≤ D) = eln(.8)/100, assuming that each task is completed with

equal probability. If a task instance fails with probability 20% (a realistic number as

shown in Section III.E.4), it would require four task instances for each task, totalling

400 task instances for the application with 100 tasks. Since there are only ∼200 hosts

in the SDSC platform, computing all task instances at once is not possible for even a

relatively small application. Furthermore, waste of 300% is extremely high and would

reduce the effective system throughput considerably. We confirmed these conclusions

in simulation for a range of application sizes (100, 200, 400 tasks) and task sizes (5,

15, 35 minutes on a dedicated 1.5GHz host); each task is replicated so often that the

application rarely completes by the predicated makespan. So instead of trying to achieve

a probability threshold per application, REP-PROB makes the best effort to achieve a

probability threshold per task using Equation VI.2.

VI.E.3 REP-PROB Heuristic

A procedural outline of the REP-PROB heuristic is given below:

1. Predict the application completion time D using the makespan predictor described

in Section V.B.2.

2. Prioritize tasks according to the probability of task completion by time D estimated

by Equation VI.2.. Unassigned tasks have the highest priority. Tasks that have

timed out have the second highest priority.

3. Prioritize hosts according to the probability of completing a task by time D.

4. While there are tasks remaining in the queue:

138

(a) Assign an instance of the task with the lowest probability of completion to

the host with the highest probability.

(b) Assign a timeout D to that task. If the task has not been completed by time

D, the task will be given the second highest possible priority (corresponding

to “timed-out” tasks).

(c) Recompute the task’s probability of completion.

(d) Remove the task from the queue if its probability of completion is above 80% 1.

We hypothesize that REP-PROB should outperform EXCL-PRED-TO. REP-

PROB takes into account both host clock rate and host volatility when deciding which

task to replicate and which host to replicate on. As such, REP-PROB aggressively

replicates tasks that have a low probability of completion as soon as the original task

instance is assigned; this in turn reduces the chance that tasks scheduled on volatile

hosts will delay application completion. In contrast, EXCL-PRED-TO only replicates a

task if it has not been completed by the predicted makespan, and the replica is assigned

to a host based on its clock rate (disregarding host’s volatility). As such, tasks initially

assigned to volatile will not have replicas scheduled until late during the application

execution, and this may result in a delays in application completion. Also, replicas may

be assigned to volatile hosts (although the hosts may have relatively fast clock rates).

VI.E.4 Results and Discussion

Figure VI.9 shows the results for the SDSC platform for each application size,

while Table VI.3 shows the performance of REP-PROB relative to EXCL-PRED-TO

for all four platforms. A positive value in the table means that REP-PROB performed

that much better than EXCL-PRED-TO. (Figures for the other platforms are shown in

Appendix C.) Surprisingly, REP-PROB does not perform better than EXCL-PRED-TO

in the SDSC and DEUG platforms. In the one platform where REP-PROB does perform

significantly better than EXCL-PRED-TO, the performance difference on average is only

13%.

1We tested a range of thresholds from 50-90% and found that a threshold of 80% is the most adequatein terms of improving application performance.

139

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FIFO

EXCL−PRED−DUPEXCL−PRED−TOREP−PROB

100 200 400

0

1

2

3

4

5

6


Figure VI.9: Performance of REP-PROB on SDSC grid.

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Figure VI.10: Waste of REP-PROB on SDSC grid.

140

PlatformMetric SDSC DEUG LRI UCBMakespan -4% -11.1% +1.7% +13.1%Waste -140% -37.5% +3.3% -19.2%

Table VI.3: Mean performance and waste difference between EXCL-PRED-TO and

REP-PROB.

The performance of EXCL-PRED-TO is similar to REP-PROB for several rea-

sons. First, there is strong correlation between the probability of completion by a partic-

ular time and clock rate as shown in Section III.E.6. Since EXCL-PRED-TO replicates

tasks on the hosts with the highest clock rates, these hosts will tend to have the highest

probability of completion by the projected completion time. Second, a large fraction of

the hosts in each platform are relatively stable. Figures VI.12(a), VI.12(b), and VI.12(c)

show the CDF of task failure rates for each host in the platform. For example, for a

fifteen minute task, the fraction of hosts with failure rates less than 20% for the SDSC,

DEUG, LRI, and UCB platforms are ∼60%, 75%, 100%, and 50% respectively. So for

the small fraction of tasks that timeout, the replica will most likely be scheduled on a

relatively stable host (especially since EXCL-PRED-TO will choose the host with the

fastest clock rate, which is correlated with the probability of task completion) and the

resulting probability of the task will be dramatically lowered. For example, if the timed-

out task has 50% chance of failure and a replica is then scheduled on a host with a 20%

chance of completion, then the probability that the task will fail is a mere 10%. The fact

that EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, EXCL-PRED-DUP-8 did not improve

performance on the SDSC platform supports this claim (see Section VI.C). Moreover, by

comparing the number of laggers caused by task failures between EXCL-PRED-TO and

REP-PROB, we see little improvement in the number of laggers when the REP-PROB

heuristic is used. Figure VI.11 shows the number of laggers for applications scheduled by

the EXCL-PRED-TO and REP-PROB heuristics, and we can see from this figure that

the number of laggers caused by failures is usually similar; on average, REP-PROB has

only .66 less laggers than EXCL-PRED-TO. (We also see in Figure VI.11 that the number

of laggers for EXCL-PRED-TO and REP-PROB exceeds the number of laggers corre-

sponding to EXCL-S.5 for applications with 100 tasks that are 5 minutes in length; at the

141

same time, the mean makespans of EXCL-PRED-TO and REP-PROB are 56% better

than mean makespan of EXCL-S.5 on average. This discrepancy is due to the fact that

the IQR’s for EXCL-PRED-TO and REP-PROB are shorter than EXCL-S.5’s, and so a

higher number of task instances are classified as laggers. So, when comparing the number

of laggers between one heuristic and another, one should also look at Figure VI.15, which

shows the mean makespans of each heuristic, to gain perspective.) Because there is not

a significant reduction in the number of laggers when the REP-PROB heuristic is used

and the mean makespans resulting from EXCL-PRED-TO and REP-PROB are similar,

the benefits of REP-PROB are dubious. Third, the (un)availability of one host with

respect to another can be correlated in some platforms and so the probability of task

completion computed is only a lower bound. The fact that availability of hosts in the

DEUG is correlated as shown in Section III.9 may be one reason why EXCL-PRED-TO

outperforms REP-PROB by about ∼11% on that particular platform.

142

Moreover, REP-PROB wastes significantly more resources (as much as 140%

more than EXCL-PRED-TO) without much gain in performance (see Table VI.3). REP-

PROB naturally replicates more than EXCL-PRED-TO when the heuristic replicates

tasks with low probabilities of completion. One reason that this does not result in sig-

nificant performance improvement could be because of mispredictions in the probability

of task completion. Although a significant fraction of predictions may be within 25%

of the actual value (as discussed in Section VI.E.1), any misprediction that leaves a

task assigned to a volatile host unreplicated could be costly for the application. Also,

our assumption that the series of attempts to complete a task instance on a particular

host are independent may not be valid; by observing our traces, we found that a short

availability interval is often followed by a another short availability interval.

Nevertheless, REP-PROB does perform slightly better than EXCL-PRED on

the UCB platform. Because all the hosts in the UCB platform have the same clock rates

and EXCL-PRED-TO prioritizes hosts only by clock rates, EXCL-PRED-TO cannot

distinguish a stable host from a volatile one. REP-PROB on the other hand will prioritize

the hosts by their predicted probability of completion, and have an advantage in this

case. But, the performance improvement is limited again because a large fraction of the

hosts in the UCB platform are relatively stable.

VI.E.5 Evaluating the benefits of REP-PROB

The “achilles heel” of EXCL-PRED-TO is the fact that it sorts only by clock

rates, and one can certainly construct pathological cases that make EXCL-PRED-TO

perform more poorly than REP-PROB. For example, one could imagine the scenario

where half the hosts are extremely volatile while the other half are extremely stable but

have slightly lower clock rates than the volatile hosts. In this case, EXCL-PRED-TO will

tend to schedule tasks to hosts with faster (albeit only slightly faster) clock rates, which

are also the most volatile; as a result, the tasks will tend to fail and delay application

completion. REP-PROB on the other hand will take into account host volatility and

schedule tasks to stable hosts.

To investigate this issue, we construct a new platform, half of which consists of

volatile hosts from the UCB platform. The clock rates of the UCB hosts are transformed

143

12

34

56

12

34

56

12

34

56

05 # of laggers

100

task

s pe

r ap

plic

atio

n

slow

hos

tfa

iled

task

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

12

34

56

12

34

56

12

34

56

05 # of laggers

200

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

12

34

56

12

34

56

12

34

56

05 # of laggers

400

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

Fig

ure

VI.11

:C

ause

ofLag

gers

(IQ

Rfa

ctor

of1)

onSD

SCG

rid.

1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D.5

→E

XC

L-P

RE

D-T

O.6→

RE

P-P

RO

B.

144

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Failure Rate

Cum

ulat

ive

Frac

tion

SDSCDEUGLRIUCB

(a) 5 min. task

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Failure Rate

Cum

ulat

ive

Frac

tion

SDSCDEUGLRIUCB

(b) 15 min. task

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Failure Rate

Cum

ulat

ive

Frac

tion

SDSCDEUGLRIUCB

(c) 35 min. task

Figure VI.12: CDF of task failure rates per host.

145

to follow a normal distribution with mean 1500MHz and standard deviation of 250MHz.

The other half of the new platform consists of stable hosts from the LRI cluster, which

is relatively homogeneous in terms of host clock rates. We then create a set of platforms

where we transform the clock rates of hosts from the LRI platform. Clearly, if the clock

rates of the stable LRI hosts are relatively very low, then it will be better to schedule

tasks to the volatile UCB hosts, and EXCL-PRED-TO will perform better then REP-

PROB. If the clock rates of the stable LRI hosts are higher than UCB hosts, then it

will be better to schedule tasks to the stable and fast LRI hosts, and again EXCL-

PRED-TO will outperform REP-PROB. However, when clock rates of the LRI hosts are

“slightly” less than the clock rates of the UCB hosts, then REP-PROB has a chance of

outperforming EXCL-PRED-TO.

Specifically, we transform the clock rates of LRI hosts by -33%, -15%, -6%, +6%,

+15%, and +33% relative to the mean clock rate of UCB hosts (1500MHz), and we refer

to the resulting platforms as UCB-LRI-n33, UCB-LRI-n15, UCB-LRI-n06, UCB-LRI-

p06, UCB-LRI-p15, and UCB-LRI-p33, respectively. Then we run both EXCL-PRED-

TO and REP-PROB on each platform and determine how each heuristic performs. We

observe that REP-PROB performs better by ∼13% or less than EXCL-PRED-TO for

only a limited set of the platforms with the range of -15% and -6% relative to the mean

UCB clock rate (see Figure VI.13).

This limited improvement in a relatively small set of hypothetical scenarios in-

dicates that REP-PROB will rarely outperform EXCL-PRED-TO, and any performance

difference is slight. Moreover, in practice on our real platforms, we find that REP-PROB

performs at most 13.1% better than EXCL-PRED-TO while causing 40% more waste

on average across all platforms. So, in general, we believe that EXCL-PRED-TO will

usually outperform or perform as well as REP-PROB, with the possibility of performing

slightly worse.

VI.F Estimating application performance

Estimates or bounds on the application makespan are useful for users submit-

ting applications. Using the results of our application simulations, we can give estimates

146

1000 1200 1400 1600 1800 2000−10

−5

0

5

10

15

Clock Rate Mode of Transformed LRI Hosts in Each Platform

Per

form

ance

Diff

eren

ce in

Per

cent

5 min. tasks15 min. tasks35 min. tasks

−33 −20 −6 +6 +20 +33Percent Deviation from mean clock rate of UCB hosts

Figure VI.13: Performance difference between EXCL-PRED-TO and transformed UCB-

LRI platforms

of makespan for our best heuristic EXCL-PRED-TO and provide lower confidence inter-

vals for an application executed on each platform.

Table VI.4 shows the mean makespan of EXCL-PRED-TO on the SDSC plat-

form as well as the lower confidence intervals (80%, 90%, 95%) relative to the mean

makespan, standard deviation, and median. On the SDSC platform, we found that the

lower 80% confidence interval for application makespan is remarkably tight as it is less

than 8% away from the mean for all task sizes and numbers. The mean 80% lower

confidence intervals for the DEUG, LRI, UCB, GIMPS, and LRI-WISC platforms are

20%, .18%, 9%, 4%, and 3% respectively from the respective means. This means that

one could use the 80% lower confidence interval in Table VI.4 to get a reasonably accu-

rate prediction of the makespan within 20% of the mean. (Nevertheless, the lower 95%

confidence are significantly wider and as much as 60%.)

From the statistics of the empirical simulation data shown Table VI.4, a user

could get an estimate of how long his/her application would take to execute on the SDSC

platform, and how much variance to expect. For example, an application with 200 tasks

that are 15 minutes in length on a dedicated 1.5GHz host should take about 50 minutes

to complete on the SDSC platform when scheduled with the EXCL-PRED-TO heuristic.

This estimate could take 10 minutes longer (with 80% confidence) than the predicted

mean.

147

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median

100 676 +0.05 +0.14 +0.44 194 613200 1087 -0.02 +0.08 +0.31 768 3811400 1752 +0.02 +0.12 +0.27 240 1713

(a) 5 min. tasks

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 1960 +0.07 +0.34 +0.59 618 1709200 3012 -0.02 +0.06 +0.19 356 2923400 4831 +0.03 +0.05 +0.10 524 4814

(b) 15 min. tasks

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 4037 +0.02 +0.08 +0.31 768 3810200 6824 -0.005 +0.03 +0.16 600 6750400 10892 +0.04 +0.05 +0.10 766 10882

(c) 35 min. tasks

Table VI.4: Makespan statistics of EXCL-PRED-TO for the SDSC platform. Lower

confidence intervals are w.r.t. the mean. The mean, standard deviation, and median are

all in units of seconds.

148

VI.G Related Work

VI.G.1 Task replication

The authors in [43] use a similar probabilistic model as described in Section VI.E

to analyze various replication issues. The platform model used in the study is similar to

ours in that the resources are shared, task preemption is disallowed, and checkpointing is

not supported. The application models were also similar; one model was based on tightly-

coupled applications, while the other was based on loosely-coupled application, which

consisted of task parallel components before each barrier synchronization. The authors

then assume that the probability of task completion follows a geometric distribution.

Despite the similarities in platform and application models, there were a number

of important differences between that study and our own. First, the results were based

on a discrete time model, where the unit of time is the length of the task l. That is,

if a task that began execution at time t fails, the task is started only after time t + l.

This assumption is made to ensure each “trial” is evenly spaced so that computing the

time to task completion is simplified. However, their assumption is problematic because

it places an unrealistic constraint on the time required to restart task execution. In

particular, in the case of a task failure, their model assumes that the expected time to

failure plus the expected period of unavailability) must equal the task length l and is thus

entirely dependent on the task length, which is a rare and improbable occurrence; The

second difference between that study and our own is that their platform model assumes

a homogeneous environment, and so their study does not consider the effect of using

hosts of different speeds when replicating.

The work in [50] examines analytically the costs of executing task parallel ap-

plications in desktop grid environments. The model assumes that after a machine is un-

available for some fixed number of time units, at least one unit of work can be completed.

Thus, the estimates for execution time are lower bounds. We believe the assumption is

too restrictive, especially since the size of an availability intervals can be correlated in

time [62]; that is, a short availability interval (which would likely cause task failure) will

most likely be followed by another short availability interval.

Other studies of task replication [88, 74, 87] have focused on detecting errors

149

and ensuring correctness. Although many of these types of security methods have been

deployed by current systems, most are ad-hoc and none are fail-proof. For example,

SETI@home recomputes tasks that have indicated a positive signal has been found on

a dedicated machine to prevent false positives. Another example is the work described

in [74] where the author develops methods to give probabilistic guarantees on result

correctness, using a credibility metric for each worker. The results however are built

upon dubious and unsupported assumptions of the probabilities of task result error rates.

Given the numerous sources of error (e.g., hardware/software malfunction, malicious

attacks), creating probabilistic models of error rates may not be possible.

VI.G.2 Checkpointing

Task checkpointing is another means of dealing with task failures since the task

state can be stored periodically either on the local disk or on a remote checkpointing

server; in the event that a failure occurs, the application can be restarted from the last

checkpoint. In combination with checkpointing, process migration can be used to deal

with CPU unavailability or when a “better” host becomes available by moving the pro-

cess to another machine. As discussed earlier in Section VI.A, remote checkpointing or

process migration is most likely infeasible in Internet environments, as the application

can often consume hundreds of megabytes of memory and bandwidth over the Internet

is often limited. (Although our heuristics are evaluated using traces gathered solely from

enterprise environments, the heuristics were designed using our platform and application

models discussed in Section IV.B.1 to also function in Internet environments. Design-

ing heuristics that assume process migration capabilities would make them no longer

applicable to Internet environments.)

We investigate the effect of local checkpointing on application makespan. Specif-

ically, we assume that the EXCL-PRED heuristic (which does no replication) is enabled

with local checkpointing capabilities, and we refer to this heuristic as EXCL-PRED-

CHKPT. We also enable the optimal scheduler with checkpointing abilities and refer to

the resulting algorithm as OPTIMAL-CHKPT.

We assume that each checkpointing heuristic checkpoints every two and a half

minutes, and the cost of checkpointing is 15 seconds. Also, we assume that the cost of

150

restarting a task after a checkpoint has occurred is 15 seconds. We tried a range of other

values for the frequency and cost of checkpointing, and restart costs, and found the same

trends.

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3x 10

4

Task Size (min on dedicated 1.5GHz)

Mak

espa

n (s

ec)

EXCL−PRED−DUPEXCL−PRED−CHKPTOPTIMALOPTIMAL−CHKPT

Figure VI.14: Performance of checkpointing heuristics on SDSC grid.

Figure VI.14 shows the mean makespan for applications with 100 tasks of sizes

ranging from 15 to 300 minutes executed on the SDSC platform. (We also executed

applications of other sizes but found that most could not complete during business hours.)

In addition to plotting the performance of the checkpoint-enabled heuristics EXCL-

PRED-CHKPT and OPTIMAL-CHKPT, we plot the performance of EXCL-PRED-

DUP and OPTIMAL (which is the performance resulting from the optimal schedule)

for comparison. Note that the optimal schedule for a platform without checkpointing

capabilities (determined by OPTIMAL) can be different from the optimal schedule for

a platform where checkpointing is enabled (determined by OPTIMAL-CHKPT). For

example, for an extremely long task, the OPTIMAL algorithm may not be able to

complete a task at all whereas the OPTIMAL-CHKPT will be able to use a series of

availability intervals since little (if any) progress in task execution is lost when the host

fails.

We find that EXCL-PRED-CHKPT performs at least five times worse than

EXCL-PRED-DUP. OPTIMAL-CHKPT performs slightly worse than OPTIMAL for

151

task sizes ranging from about 15 to 225 minutes; for task sizes larger than 225 minutes,

OPTIMAL-CHKPT outperforms OPTIMAL slightly.

The poor performance of EXCL-PRED-CHKPT is due to the fact that a task

is not reassigned when it is assigned to a slow host or when the host becomes unavailable

for task execution. When the host becomes unavailable for task execution, it is typically

unavailable for long periods of time relative to the execution time of the application.

In particular, the mean length of unavailability intervals for the SDSC, LRI, DEUG,

or UCB platforms are 75, 225, 21, and 7 minutes, respectively. As a result, task exe-

cution is is delayed by the amount of time required before the host becomes available

again for execution; for applications that require rapid turnaround, this is detrimen-

tal. OPTIMAL-CHKPT performs nearly as well as OPTIMAL because the omniscient

scheduler will avoid periods of exec unavailability, but it performs slightly worse for tasks

less than 225 minutes in length because of the overheads involved when checkpointing.

For task sizes greater than 225 minutes, OPTIMAL-CHKPT outperforms OPTIMAL

(without checkpointing enabled) as the costs of restarting a task from scratch due a exec

unavailability become higher than the overheads of using checkpointing. So while local

checkpointing is possible, we find that its benefits are limited for short-lived applications

given the relatively long lengths of unavailability intervals found in many real desktop

grid environments.

VI.H Summary

We studied a variety of approaches for improving performance by means of

replication. We used proactive, reactive, and hybrid approaches, and for each approach,

we examined the issues of which task to replicate, which host to replicate to, and how

much to replicate (see Table VI.5).

Our conclusion is that a reactive replication strategy that uses timeouts when

the execution time of a task goes past the predicted makespan is surprisingly superior to

more aggressive replication heuristics or heuristics that use dynamic historical informa-

tion to predict task completion rates. This conclusion can be explained by the fact that a

large portion of the host in each platform are stable, and that clock rates are correlation

152

Heuristic Which host Which task How manyreplicas

PRI-CR-DUP, clock rate clock rate x1EXCL-S.5-DUP,

EXCL-PRED-DUPEXCL-PRED-DUP-2 clock rate clock rate x2EXCL-PRED-DUP-4 clock rate clock rate x4EXCL-PRED-DUP-8 clock rate clock rate x8

EXCL-PRED-TO clock rate on timeout x1via predicated makespan

EXCL-PRED-TO-SPD clock rate on timeout x1according to clock rate

REP-PROB P (Ci ≤ D) P (Ci ≤ D) until above 80%

Table VI.5: Summary of replication heuristics.

strongly with task completion rates. Combining this with the fact that there are usually

many hosts relative to the number of tasks near the end of application execution (as

shown in Figure V.4), EXCL-PRED-TO demonstrates the best performance in terms of

reducing makespan and waste.

Figure VI.15 shows the mean makespan of EXCL-PRED-TO and REP-PROB

for the SDSC grid in addition to the best performing heuristics we examined in previous

chapters. We find that for the SDSC grid the fourth quartile of EXCL-PRED-TO is

on average 2.25 times shorter than the fourth quartile of EXCL-PRED, and the EXCL-

PRED-TO performs better than EXCL-PRED by a factor of 1.49 on average. Compared

to the optimal schedule, EXCL-PRED-TO performs within a factor of 1.7 on the SDSC,

DEUG, and LRI platforms, and within a factor of 2.6 on the UCB platform. In addition

to achieving the best or close to best performance, EXCL-PRED-TO almost always re-

sults in the least (or close to the least) waste of all the replication heuristics (achieving

a mean waste of 6%, 33%, 9%, 71% on the SDSC, DEUG, LRI, and UCB platforms, re-

spectively); the large performance benefits of using reactive replication are often achieved

with little waste.

153

01

23

45

60

12

34

56

01

23

45

60

5000

1000

0

100

task

s pe

r ap

plic

atio

n

duration (sec)1s

t2n

d3r

d4t

h5

min

task

s15

min

task

s35

min

task

s

01

23

45

60

12

34

56

01

23

45

60

5000

1000

0

200

task

s pe

r ap

plic

atio

n

duration (sec)

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

01

23

45

60

12

34

56

01

23

45

60

5000

1000

0

400

task

s pe

r ap

plic

atio

n

duration (sec)

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

Fig

ure

VI.15

:Len

gth

ofta

skco

mpl

etio

nqu

arti

les

onSD

SCG

rid.

0→

OP

TIM

AL.1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D5→

EX

CL-P

RE

D-T

O.6→

RE

P-P

RO

B.

Chapter VII

Scheduler Prototype

In this chapter, we describe our implementation of the best performing heuristic

EXCL-PRED-TO, and show that the scheduling model we used is feasible in a real

system. Our implementation of EXCL-PRED-TO is integrated with the open source

XtremWeb desktop grid software.

VII.A Overview of the XtremWeb Scheduling System

The architecture of the XtremWeb system matches the general architecture

of desktop grid systems described in Section II.B. We describe in further detail here

the components of the XtremWeb system that reside at the Application and Resource

Management Level since we modify these components for our scheduler.

After an application is submitted, the application manager periodically selects

a subset of tasks from a task pool and distributes them to a scheduler (or set of sched-

ulers). The scheduler is then responsible for the completion of tasks. Workers make a

request for work to the scheduler typically using Java RMI, although other methods of

communication through SSL and TCP-UDP are supported. The default scheduler in

XtremWeb schedules tasks to hosts in a FCFS fashion, i.e., schedules tasks to hosts in

the order in which they arrived. Upon completion, the worker will return the result to

the scheduler, which stores the result on the server’s disk and records the task completion

in the results database.

154

155

VII.B EXCL-PRED-TO Heuristic Design and Implemen-

tation

We replace the FCFS scheduler in Xtremweb with our EXCL-PRED-TO sched-

uler. This involves a number of changes to the XtremWeb system which we describe

below.

VII.B.1 Task Priority Queue

One potential hazard of replication is that the replicas could delay original task

instances from being executed. For example, suppose instances of a particular task are

replicated a high number of times and then placed in a work queue. Then suppose task

instances of a different task are placed in the work queue after instances of the first task.

Since task instances are assigned in the order that they are placed in the work queue,

the second task could “starve” as the workers are kept busy executing replicas of the

first task.

To reduce the chance of task starvation, our scheduler uses a two-level work

queue; we refer to the higher level queue as the primary queue and the lower level queue

as the secondary queue. When an application is submitted by the client, an instance of

each task is placed in the primary queue. For the EXCL-PRED-TO heuristic, a timeout

is associated with each original task instance when it is scheduled on a host. When

this time out expires, a task replica is placed in the secondary queue. When doing task

assignment, the scheduler will first schedule tasks in the primary queue before those in

the secondary queue in an effort to ensure that at least one instance of each task will

always be scheduled before any replicas.

To keep the number of replicas from growing too rapidly, only original task

instances are allowed to time out. Also, when the original task instance fails, a new

corresponding instance is placed in the primary queue; however, if a replica fails, nothing

more is done.

The task instance priority queues are implemented as fixed-sized lists within

the XtremWeb scheduler, and the lists act as a buffer between the database of tasks

and workers requesting task instances. Periodically, the primary priority queue is filled

156

with original task instances instantiated from tasks in the database. A task state thread

periodically checks the state of all the task instances in each priority queue. Given a

particular state, the task state thread causes the appropriate action to be taken. For

example, to implement the timeout mechanism, we set a timeout for each task instance

in the primary work queue. Periodically, the thread checks the state of each tasks and

if a task has timed out, it places a replica in the secondary queue.

VII.B.2 Makespan Predictor

As EXCL-PRED-TO depends on a makespan prediction, we use the formula

described in Section V.B.2 to predict the application’s makespan. This requires having a

predicted aggregate operations completed per second. This rate could be determined by

submitting either real or measurement tasks (consisting of some number of operations

per task) to all the workers for a short duration. For instance, XtremWeb records the

start and completion time each task instance. Given the number of operations per task,

a separate thread of the scheduler could be responsible for computing a daily average

over all hosts. Then, counting the number of tasks to be completed per application, the

thread could make a rough estimate of application completion and the main scheduling

thread could then use this estimate for the EXCL-PRED-TO heuristic. We found that

the aggregate operations per second remains relatively constant throughout time (see

Section V.B.2), and so this estimate could be accurate for several days.

Chapter VIII

Conclusion

VIII.A Summary of Contributions

Desktop grids are an attractive platform for executing large computations be-

cause they offer a high return on investment by using the idle capacity of an existing

computing infrastructure. Projects from a wide range of scientific domains have utilized

TeraFlops of computing power offered by hundreds of thousands of mostly desktop PC’s.

The applications in most of these projects are high-throughput, task parallel, and com-

pute bound. In this dissertation, we studied how to schedule an application that requires

rapid turnaround in an effort to broaden the types of applications executable in desktop

grid environments. To this end, we made the following contributions:

Measurement and characterization of real enterprise desktop grids. We char-

acterized four real desktop grid platforms using an accurate measurement technique that

captured performance exactly as what would be perceived by a real application. Using

this measurement data, we characterized the temporal structure of availability of each

platform as a whole and of individual hosts. Both the measurement data and character-

ization could be used to drive simulations or used as the basis for forming predicative,

explanatory, or generative models. With respect to modelling, we found a number of

pertinent statistics. For instance, we found that task failure rate is correlated with task

length and that availability is not correlated with host clock rates.

157

158

Resource prioritization and exclusion heuristics. We used this characterization

to develop novel and effective resource prioritization and resource exclusion heuristics for

scheduling short-lived applications. We found that using static clock rate information

to prioritize hosts can often improve performance; however, the performance of using

prioritization alone depends on the number of tasks in the application relative to the

number of hosts, and whether tasks are assigned to “poor” hosts. We then adapted our

prioritization heuristic to exclude “poor” hosts from the application execution. We found

that using a fixed threshold to filter hosts was beneficial but application performance was

dependent on the distribution of the clock rates. To lessen this dependence, we developed

a heuristic that used a predicted makespan to eliminate hosts from application execution.

While this heuristic was less sensitive to the clock rate distribution, it was less aggressive

in exclusion, and for smaller applications performed slightly more poorly. The benefit of

using the predicted makespan to eliminate hosts became more obvious after we combined

the heuristic with task replication.

Task replication heuristics. We studied the use of proactive, reactive, and hybrid

replication techniques by combining task replication with our best resource exclusion and

prioritization heuristics. We found that a heuristic that uses a makespan predictor with

reactive replication by means of timeouts is the most effective in practice; the makespan

predictor is essential for eliminating “poor” hosts and also for setting the timeouts of

each task such that waste is relatively low. The reason timeouts are so effective is that

platforms often have a large portion of relatively stable hosts. Because volatility is

negatively correlated with clock rates and our best replication heuristic prioritizes tasks

and hosts according to clock rates, the probability of failure is reduced dramatically after

replicating a task once. Surprisingly, this heuristic often achieves similar performance

with relatively less waste compared to other heuristics that replicate more aggressively

and/or use more dynamic information about the resources. Our best heuristic often

performs within a factor of 1.7 of optimal.

Scheduler prototype. We show the feasibility of our heuristic by implementing a

scheduler prototype in a real desktop grid system. This heuristic was incorporated into

the real open-source desktop grid system XtremWeb. We believe that the scheduler will

159

improve the performance of short-lived applications.

VIII.B Future Work

There are a number of ways in which this work can be extended in terms of

measurement and characterization, and the types of applications studied:

Characterization of Internet desktop grids. We designed our heuristics so that

they would be applicable and effective in Internet environments. However, because we did

not have traces of Internet desktop grids, we could not prove the heuristics’ effectiveness

in Internet environments. The collection of Internet desktop grid trace data is currently

being conducted by the Recovery Oriented Computing group at U.C. Berkeley [17].

Given that data, we would be able to evaluate our heuristics on Internet desktop grids.

Characterization of memory and network connectivity. Clearly, applications

use other resources in addition to the CPU. As an extension to our characterization of

exec availability, it would be useful to characterize other resource usage data, such as

memory allocation or network traffic. This would improve the accuracy of our platform

model.

Scheduling applications with dependencies. Another interesting class of appli-

cations is the class with dependencies among tasks. It would be interesting to use our

characterization data to study the costs and benefits of running applications with task de-

pendencies. The fact that hosts in some desktop grid environments appear independent

could simplify performance modelling of such applications. We believe our probabilistic

model of task completion described in Chapter VI will aid in the analysis of scheduling

applications with task dependencies.

Scheduling multiple applications on the same desktop grid. A desktop grid

application does not always have the luxury of using the entire platform exclusively. It

would be useful to investigate the scenario where multiple applications are competing for

the same set of resources. Given that the costs of using desktop resources for users that

160

submit applications can be quite low, applications are often very large; at the same time

there may be users that require rapid application turnaround. How to balance system

throughput and response time while promoting fairness among users is an interesting

research direction. Also, the performance of EXCL-PRED-TO was dependent on the

existence of stable hosts in the platform, and so if in the scenario of multiple application

those hosts are being used by other applications, then REP-PROB heuristic may in fact

prove beneficial compared to EXCL-PRED-TO.

Toward these ends, we believe that the work in this thesis will be a helpful

stepping stone for future desktop grid research.

Appendix A

Defining the IQR Factor

A.A IQR Sensitivity

161

162

12

34

56

12

34

56

12

34

56

0510 # of laggers

100

task

s pe

r ap

plic

atio

n

slow

hos

tfa

iled

task

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

12

34

56

12

34

56

12

34

56

0510 # of laggers

200

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

12

34

56

12

34

56

12

34

56

0510 # of laggers

400

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

Fig

ure

A.1

:C

ause

ofLag

gers

(IQ

Rfa

ctor

of.5

)on

SDSC

Gri

d.1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D.5→

EX

CL-P

RE

D-T

O.6→

RE

P-P

RO

B.

163

12

34

56

12

34

56

12

34

56

05 # of laggers

100

task

s pe

r ap

plic

atio

n slow

hos

tfa

iled

task

5 m

in ta

sks

15 m

in ta

sks

35 m

in ta

sks

12

34

56

12

34

56

12

34

56

05 # of laggers

200

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

12

34

56

12

34

56

12

34

56

05 # of laggers

400

task

s pe

r ap

plic

atio

n5

min

task

s15

min

task

s35

min

task

s

Fig

ure

A.2

:C

ause

ofLag

gers

(IQ

Rfa

ctor

of1.

5)on

SDSC

Gri

d.1→

FC

FS.

2→

PR

I-C

R.3→

EX

CL-S

.5.

4→

EX

CL-P

RE

D.5

→E

XC

L-P

RE

D-T

O.6→

RE

P-P

RO

B.

Appendix B

Additional Resource Selection

and Exclusion Results and

Discussion

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure B.1: Performance of resource selection heuristics on the DEUG grid

On the UCB platform, the host clock rates are all identical. So EXCL-S.5 will

exclude all the hosts, and no results corresponding to EXCL-S.5 are shown. EXCL-PRED

and CR perform worse than FCFS mainly because they prioritize resources according to

164

165

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure B.2: Performance of resource selection heuristics on the LRI grid

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure B.3: Performance of resource selection heuristics on the UCB grid

166

the clock rates where as FCFS prioritizes the resources according the the time of arrival

in the queue. So FCFS will tend to assign tasks to resources that have been available

the longest (which tended to have a longer probability of task completion) whereas

PRI-CR will assigned tasks to hosts randomly. While FCFS outperforms PRI-CR in

this particular scenrio, we find that in all the other platforms PRI-CR outperforms or

performs as well as FCFS.

Appendix C

Additional Task Replication

Results and Discussion

C.A Proactive Replication

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

PRI−CRPRI−CR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP

100 200 400

0

1

2

3

4

5

6


Figure C.1: Performance of proactive replication heuristics on DEUG grid.

Similar to the replication heuristics described in the previous section, EXCL-

PRED-DUP-TIME and EXCL-PRED-DUP-TIME-SPD are wasteful in their use of re-

sources (see Figure C.4). The reason is that tasks are always replicated, regardless of

167

168

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

PRI−CRPRI−CR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP

100 200 400

0

1

2

3

4

5

6


Figure C.2: Performance of proactive replication heuristics on LRI grid.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS

PRI−CRPRI−CR−DUPEXCL−PREDEXCL−PRED−DUP

100 200 400

0

1

2

3

4

5

6


Figure C.3: Performance of proactive replication heuristics on UCB grid.

169

the speed or reliability of the host to which the original task instance was first assigned.

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated

OPTIMALCR−DUPEXCL−S.5−DUPEXCL−PRED−DUPEXCL−PRED−DUP−TIMEEXCL−PRED−DUP−TIME−SPD

100 200 400

0

0.2

0.4

0.6

0.8

1


Figure C.4: Waste of proactive replication heuristics with EXCL-PRED-DUP-TIME and

EXCL-DUP-TIME-SPD.

Waste for heuristic EXCL-S.5-DUP is not shown in Figure C.7 because the

hosts in UCB all had the same clock rates and so the heuristic excluded all hosts.

C.B Reactive Replication

C.C Hybrid Replication

170

5 15 35 5 15 35 5 15 350

0.2

0.4

0.6

0.8

1

1.2

1.4


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.2

0.4

0.6

0.8

1

1.2

1.4


Figure C.5: Waste of proactive replication heuristics on DEUG grid.

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Figure C.6: Waste of proactive replication heuristics on LRI grid.

171

5 15 35 5 15 35 5 15 350

0.5

1

1.5

2

2.5


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated

PRI−CR−DUPEXCL−PRED−DUP

100 200 400

0

0.5

1

1.5

2

2.5


Figure C.7: Waste of proactive replication heuristics on UCB grid.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal

OPTIMALEXCL−PRED−DUPEXCL−PRED−DUP−TIMEEXCL−PRED−DUP−TIME−SPDEXCL−PRED−DUP2EXCL−PRED−DUP2−TIMEEXCL−PRED−DUP2−TIME−SPDEXCL−PRED4−DUPEXCL−PRED4−DUP−TIMEEXCL−PRED−DUP4−TIME−SPDEXCL−PRED8−DUPEXCL−PRED−DUP8−TIMEEXCL−PRED−DUP8−TIME−SPD

100 200 400

0

1

2

3

4

5

6


Figure C.8: Performance of proactive replication heuristics when varying replication level

on SDSC grid.

172

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FIFO

EXCL−PRED−DUPEXCL−PRED−TO

100 200 400

0

1

2

3

4

5

6


Figure C.9: Performance of reactive replication heuristics on DEUG grid.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure C.10: Performance of reactive replication heuristics on LRI grid.

173

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure C.11: Performance of reactive replication heuristics on UCB grid.

5 15 35 5 15 35 5 15 350

0.2

0.4

0.6

0.8

1

1.2

1.4


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.2

0.4

0.6

0.8

1

1.2

1.4


Figure C.12: Waste of reactive replication heuristics on DEUG grid.

174

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Figure C.13: Waste of reactive replication heuristics on LRI grid.

5 15 35 5 15 35 5 15 350

0.5

1

1.5

2

2.5


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.5

1

1.5

2

2.5


Figure C.14: Waste of reactive replication heuristics on UCB grid.

175

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FIFO


100 200 400

0

1

2

3

4

5

6


Figure C.15: Performance of hybrid replication heuristic on DEUG grid.

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure C.16: Performance of hybrid replication heuristic on LRI grid.

176

5 15 35 5 15 35 5 15 350

1

2

3

4

5

6


Ave

rage

mak

espa

n re

lativ

e to

opt

imal FCFS


100 200 400

0

1

2

3

4

5

6


Figure C.17: Performance of hybrid replication heuristic on UCB grid.

5 15 35 5 15 35 5 15 350

0.2

0.4

0.6

0.8

1

1.2

1.4


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.2

0.4

0.6

0.8

1

1.2

1.4


Figure C.18: Waste of hybrid replication heuristic on DEUG grid.

177

5 15 35 5 15 35 5 15 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Figure C.19: Waste of hybrid replication heuristic on LRI grid.

5 15 35 5 15 35 5 15 350

0.5

1

1.5

2

2.5

3


Was

te In

Ter

ms

of P

erce

nt o

f Tas

ks R

eplic

ated


100 200 400

0

0.5

1

1.5

2

2.5

3


Figure C.20: Waste of hybrid replication heuristic on UCB grid.

178

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 548 +0.18 +0.31 +0.37 119 530200 816 +0.22 +0.27 +0.41 201 802400 1355 +0.23 +0.31 +0.42 309 1321

(a) 5 min. tasks


(b) 15 min. tasks


(c) 35 min. tasks

Table C.1: Makespan statistics for the DEUG platform. Lower confidence intervals are

w.r.t. the mean. The mean, standard deviation, and median are all in units of seconds.

179

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 694 +0.21 +0.21 +.21 131 643200 959 +0.17 +0.17 +0.17 105 969400 1699 +0.18 +.32 +.32 285 1587

(a) 5 min. tasks


(b) 15 min. tasks


(c) 35 min. tasks

Table C.2: Makespan statistics for the LRI platform. Lower confidence intervals are


180

Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 773 +0.10 +0.25 +0.39 140 720200 1234 +0.06 +0.12 +0.29 223 1273400 1707.27 +0.02 +0.14 +0.22 172 1657

(a) 5 min. tasks


(b) 15 min. tasks


(c) 35 min. tasks

Table C.3: Makespan statistics for the UCB platform. Lower confidence intervals are


Bibliography

[1] Cell Computing. www.cellcomputing.net.

[2] Climateprediction.net. http://climateprediction.net.

[3] Distributed.net. www.distributed.net.

[4] EINSTEN@home. http://einstein.phys.uwm.edu.

[5] LHC@home. http://athome.web.cern.ch/athome.

[6] The UCLA Internet Report Surveying the Digital Future. Technical report, UCLACenter for Communication Policy, January 2003.

[7] D. Abramson, J. Giddy, I. Foster, and L. Kotler. High Performance ParametricModeling with Nimrod/G: Killer Application for the Global Grid ? In Proceedingsof the International Parallel and Distributed Processing Symposium, May 2000.

[8] A. Acharya, G. Edjlali, and J. Saltz. The Utility of Exploiting Idle Workstations forParallel Computation. In Proceedings of the 1997 ACM SIGMETRICS InternationalConference on Measurement and Modeling of Computer Systems, pages 225–234,1997.

[9] Y. Amir and A. Wool. Evaluating quorum systems over the Internet. In 26thSymposium on Fault-tolerant Computing (FTCS96), June 1996.

[10] D. Anderson. Personal communication, April 2002.

[11] T. E. Anderson, D. E. Culler, and D. A. Patterson. A case for now (networks ofworkstations). IEEE Micro, 15(1):54–64, 1995.

[12] R.H. Arpaci, A.C. Dusseau, A.M. Vahdat, L.T. Liu, T.E. Anderson, and D.A.”Patterson. The Interaction of Parallel and Sequential Workloads on a Network ofWorkstations. In Proceedings of SIGMETRICS’95, pages 267–278, May 1995.

[13] S. Baker, R. Lanctot, S. Koenig, and S. Wargo. Home PC Portrait. Technicalreport, PC Data, Inc., Reston, VA, 2000.

[14] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-LevelScheduling on Distributed Heterogeneous Networks. In Proc. of Supercomputing’96,Pittsburgh, 1996.

181

www.cellcomputing.net

http://climateprediction.net

www.distributed.net

http://einstein.phys.uwm.edu

http://athome.web.cern.ch/athome

182

[15] R. Bhagwan, S. Savage, and G. Voelker. Understanding Availability. In In Proceed-ings of IPTPS’03, 2003.

[16] Ranjita Bhagwan, Kiran Tati, Yu-Chung Cheng, Stefan Savage, and Geoffrey M.Voelker. Total recall: System support for automated availability management. InNSDI, pages 337–350, 2004.

[17] Resource Measurement Via BOINC. http://roc.cs.berkeley.edu/projects/boinc.

[18] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Dis-tributed file System Deployed on an Existing Set of Desktop PCs. In Proceedingsof SIGMETRICS, 2000.

[19] G. Bosilca, F. Cappello, A. Dijilali, G. Fedak, T. Herault, and Mangiette F. Perfor-mance Evaluation of Sandboxing Techniques for Peer-to-Peer Computing. Technicalreport, LRI-NDRS and Paris-Sud University, February 2002.

[20] J. Brevik, D. Nurmi, and R. Wolski. Quantifying Machine Availability in Networkedand Desktop Grid Systems. Technical Report CS2003-37, Dept. of Computer Scienceand Engineering, University of California at Santa Barbara, November 2003.

[21] B. Calder, A. Chien, J. Wang, and D. Yang. The Entropia Virtual Machine forDesktop Grids. Technical Report CS2003-0773, University of California at SanDeigo, October 2003.

[22] B. Calder, A. A. Chien, J. Wang, and D. Yang. The Entropia Virtual Machine forDesktop Grids. In Proceedings of the First ACM/USENIX Conference on VirtualExecution Environments (VEE’05), June 2005.

[23] The Compute Against Cancer project. http://www.computeagainstcancer.org/.

[24] P. Cappello, B. Christiansen, M. Ionescu, M. Neary, K. Schauser, and D. Wu.Javelin: Internet-Based Parallel Computing Using Java. In Proceedings of the SixthACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,1997.

[25] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman. Heuristics for SchedulingParameter Sweep Applications in Grid Environments. In Proceedings of the 9thHeterogeneous Computing Workshop (HCW’00), pages 349–363, May 2000.

[26] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS ParameterSweep Template: User-Level Middleware for the Grid. In Proceedings of SuperCom-puting 2000 (SC’00), Nov. 2000.

[27] A.J. Chakravarti, G. Baumgartner, and M. Lauria. The Organic Grid: Self-Organizing Computation on a Peer-to-Peer Network. In Proceedings of the In-ternational Conference on Autonomic Computing (ICAC ’04), May 2004.

[28] A. Chien, B. Calder, S. Elbert, and K. Bhatia. Entropia: Architecture and Perfor-mance of an Enterprise Desktop Grid System. Journal of Parallel and DistributedComputing, 63:597–610, 2003.

http://roc.cs.berkeley.edu/projects/boinc

http://roc.cs.berkeley.edu/projects/boinc

http://www.computeagainstcancer.org/

183

[29] A. A. Chien. Personal communication, December 2003.

[30] J. Chu, K. Labonte, and B. Levine. Availability and locality measurements of peer-to-peer file systems. In Proceedings of ITCom: Scalability and Traffic Control in IPNetworks, July 2003.

[31] Condor Statistics. http://www.cs.wisc.edu/condor/map.

[32] P. Dinda. The Statistical Properties of Host Load. Scientific Programming, 7(3–4),1999.

[33] P. Dinda. A Prediction-Based Real-Time Scheduling Advisor. In Proceedings ofthe International Parallel and Distributed Processing Symposium (IPDPS’02), April2002.

[34] P. Dinda. Online Prediction of the Running Time of Tasks. Cluster Computing,5(3):225–236, July 2002.

[35] P. Dinda and D. O’Hallaron. An Evaluation of Linear Models for Host Load Pre-diction. In Proceedings of the The Eighth IEEE International Symposium on HighPerformance Distributed Computing, page 10, 1999.

[36] Fred Douglis and John Ousterhout. Transparent process migration: Design alterna-tives and the sprite implementation. Software Practice and Experience, 21(8):757–785, 1991.

[37] G. Fedak, C. Germain, V. N’eri, and F. Cappello. XtremWeb: A Generic GlobalComputing System. In Proceedings of the IEEE International Symposium on ClusterComputing and the Grid (CCGRID’01), May 2001.

[38] The Fight Aids At Home project. http://www.fightaidsathome.org/.

[39] The Berkeley Open Infrastructure for Network Computing. http://boinc.berkeley.edu/.

[40] I. Foster and C. Kesselman, editors. The Grid, Blueprint for a New ComputingInfrastructure, chapter Chapter 8: Medical Data Federation: The Biomedical Infor-matics Research Network. Morgan Kaufmann, 2nd edition, 2003.

[41] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New ComputingInfrastructure. Morgan Kaufmann Publishers, Inc., San Francisco, USA, 1999.

[42] James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, and Steven Tuecke.Condor-g: A computation management agent for multi-institutional grids. Clus-ter Computing, 5(3):237–246, 2002.

[43] G. Ghare and L. Leutenegger. Improving Speedup and Response Times by Repli-cating Parallel Programs on a SNOW. In Proceedings of the 10th Workshop on JobScheduling Strategies for Parallel Processing, June 2004.

[44] The great internet mersene prime search (gimps). http://www.mersenne.org/.

http://www.cs.wisc.edu/condor/map

http://www.fightaidsathome.org/

http://boinc.berkeley.edu/

http://boinc.berkeley.edu/

http://www.mersenne.org/

184

[45] M. Harchol-Balter and A. Downey. Exploiting Process Lifetime Distributions forDynamic Load Balancing. In Proceedings of the 1996 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling of Computer Systems, pages13–24, 1996.

[46] A Hsu. Personal communication, March 2005.

[47] S.A. Hupp. The “Worm” Programs – Early Experience with Distributed Compu-tation. Communications of the ACM, 3(25), 1982.

[48] Y. Kee, D. Logothetis, R. Huang, H. Casanova, and Andrew A. Chien. EfficientResource Description and High Quality Selection for Virtual Grids. In IEEE Con-ference on Cluster Computing and the Grid (CCGrid 2005), 2005.

[49] S. M. Larson, C. D. Snow, M. Shirts, and V. S. Pande. Folding@Home andGenome@Home: Using distributed computing to tackle previously intractable prob-lems in computational biology. Computational Genomics, 2003.

[50] S. Leutenegger and X. Sun. Distributed Computing Feasibility in a Non-DedicatedHomogeneous Distributed System. In Proc. of SC’93, Portland, Oregon, 1993.

[51] Y. Li and M. Mascagni. Improving performance via computational replication on alarge-scale computational grid. In Proc. of the IEEE International Symposium onCluster Computing and the Grid (CCGrid’03), May 2003.

[52] M. Litzkow, M. Livny, and M. Mutka. Condor - A Hunter of Idle Workstations. InProceedings of the 8th International Conference of Distributed Computing Systems(ICDCS), 1988.

[53] V. M. Lo, D. Zhou, D. Zappala, Y. Liu, and S. Zhao. Cluster Computing on the Fly:P2P Scheduling of Idle Cycles in the Internet. In The 3rd International Workshopon Peer-to-Peer Systems (IPTPS’04), Feb 2004.

[54] O. Lodygensky, G. Fedak, V. Neri, F. Cappello, D. Thain, and M. Livny. XtremWeband Condor: Sharing Resources Between Internet Connected Condor Pool. In Pro-ceedings of the IEEE International Symposium on Cluster Computing and the Grid(CCGRID’03) Workshop on Global Computing on Personal Devices, May 2003.

[55] O. Lodygensky, G. Fedak, V. Nri, A. Cordier, and F. Cappello. Auger & XtremWeb: Monte Carlo computation on a global computing platform. In Proceedings ofComputing in High Energy and Nuclear Physics (CHEP2003), March 2003.

[56] D. Long, A. Muir, and R. Golding. A Longitudinal Survey of Internet Host Relia-bility. In 14th Symposium on Reliable Distributed Systems, pages 2–9, 1995.

[57] J. Lopez, M. Aeschlimann, P. Dinda, L. Kallivokas, B. Lowekamp, andD. O’Hallaron. Preliminary report on the design of a framework for distributedvisualization. In Proceedings of the International Conference on Parallel and Dis-tributed Processing Techniques and Applications (PDPTA’99), pages 1833–1839, LasVegas, NV, June 1999.

185

[58] U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modelingthe characteristics of rigid jobs. J. Parallel & Distributed Comput., 11(63), 2003.

[59] W. Mendenhall and T. L. Sincich, editors. Statistics for Engineering and the Science.Prentice Hall, 1995.

[60] editor. Mitchell-Kernan, Claudia I. Science & engineering indicators. Technicalreport, National Science Board, Washington, D.C. USA, 2000.

[61] M. Mutka. Considering deadline constraints when allocating the shared capacity ofprivate workstations. Int. Journal in Computer Simulation, 4(1):41–63, 1994.

[62] M. Mutka and M. Livny. The available capacity of a privately owned workstationenvironment . Performance Evaluation, 4(12), July 1991.

[63] J. Nabrzyski, J. Schopf, and J. Weglarz, editors. Grid Resource Management, chap-ter 26. Kluwer Press, 2003.

[64] Sagnik Nandy, Larry Carter, and Jeanne Ferrante. Guard: Gossip used for au-tonomous resource detection. In IPDPS, 2005.

[65] D. Nurmi, J. Brevik, and R. Wolski. Model-based Checkpoint Scheduling for VolatileResource Environments. Technical Report CS2004-25, Dept. of Computer Scienceand Engineering, University of California at Santa Barbara, 2004.

[66] A. J. Olson. Personal communication, April 2002.

[67] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Distributed ResourceDiscovery on PlanetLab with SWORD. In Proceedings of the ACM/USENIX Work-shop on Real, Large Distributed Systems (WORLDS), December 2004.

[68] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Design and Implemen-tation Tradeoffs for Wide-Area Resource Discovery. In 14th IEEE Symposium onHigh Performance Distributed Computing (HPDC-14), July 2005.

[69] V. Pande. Personal communication, December 2003.

[70] J. Pruyne and M. Livny. A Worldwide Flock of Condors : Load Sharing amongWorkstation Clusters . Journal on Future Generations of Computer Systems, 12,1996.

[71] Reuters. Worldwide PC Shipments Seen Rising Slightly. http://news.cnet.com/investor/news/newsitem/0-9900-1028-20866911-0.html/, February 2003.

[72] Sean C. Rhea, Patrick R. Eaton, Dennis Geels, Hakim Weatherspoon, Ben Y. Zhao,and John Kubiatowicz. Pond: The oceanstore prototype. In FAST, 2003.

[73] K. D. Ryu. Exploiting Idle Cycles in Networks of Workstations. PhD thesis, 2001.

[74] L. Sarmenta. Sabotage-tolerance mechanisms for volunteer computing systems. InProceedings of IEEE International Symposium on Cluster Computing and the Grid,May. 2001.

http://news.cnet.com/investor/news/newsitem/0-9900-1028-20866911-0.html/

http://news.cnet.com/investor/news/newsitem/0-9900-1028-20866911-0.html/

186

[75] L. Sarmenta and S. Hirano. Bayanihan: Building and Studying Web-Based Volun-teer Computing Systems Using Java. Future Generation Computer Systems, 15(5-6):675–686, 1999.

[76] S. Saroiu, P.K. Gummadi, and S.D. Gribble. A measurement study of peer-to-peerfile sharing systems. In Proceedinsg of MMCN, January 2002.

[77] Serial section electron tomography: a method for three-dimensional reconstructionof large structures. Soto ge, young sj, martone me, deerinck tj, lamont s, carragherbo, hama k, ellisman mh. Neuroimage, 1(3):230–43, June 1994.

[78] Current statistics. http://setiathome.berkeley.edu/stats.html.

[79] Technical news report 2003. http://setiathome.ssl.berkeley.edu/tech_news03.html.

[80] S. Smallen, H. Casanova, and F. Berman. Tunable On-line Parallel Tomography.In Proceedings of SuperComputing’01, Denver, Colorado, Nov. 2001.

[81] S. Son and M. Livny. Recovering internet symmetry in distributed computing. InProceedings of the 3rd International Symposium on Cluster Computing and the Grid,Tokyo, Japan, May 2003.

[82] E. J. Sorin and V. Pande. Empirical Force-Field Assessment: The Interplay BetweenBackbone Torsions and Noncovalent Term Scaling. Computational Chemistry, 2005.

[83] E. J. Sorin and V. Pande. Exploring the Helix-Coil Transition via All-atom Equi-librium Ensemble Simulations. Biophysical Journal, 2005.

[84] D. Spence and T. Harris. XenoSearch: Distributed Resource Discovery in theXenoServer Open Platform. In 12th IEEE International Symposium on High Per-formance Distributed Computing (HPDC’03), June 2003.

[85] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrish-nan. Chord: A scalable peer-to-peer lookup service for internet applications. InProceedings of the ACM SIGCOMM ’01 Conference, San Diego, California, August2001.

[86] W. T. Sullivan, D. Werthimer, S. Bowyer, J. Cobb, G. Gedye, and D. Anderson.A new major SETI project based on Project Serendip data and 100,000 personalcomputers. In Proc. of the Fifth Intl. Conf. on Bioastronomy, 1997.

[87] Michela Taufer, C. An, A. Kerstens, and Charles L. Brooks III. Predictor@home: A”protein structure prediction supercomputer” based on public-resource computing.In IPDPS, 2005.

[88] Michela Taufer, David Anderson, Pietro Cicotti, and Charles L. Brooks III. Homo-geneous redundancy: a technique to ensure integrity of molecular simulation resultsusing public computing. In IPDPS, 2005.

[89] Top 500 list. http://www.top500.org/sublist/stats/index.php?list=2004-11-30&type=archtype&submit=1.

http://setiathome.berkeley.edu/stats.html

http://setiathome.ssl.berkeley.edu/tech_news03.html

http://setiathome.ssl.berkeley.edu/tech_news03.html

http://www.top500.org/sublist/stats/index.php?list=2004-11-30&type=archtype&submit=1

http://www.top500.org/sublist/stats/index.php?list=2004-11-30&type=archtype&submit=1

187

[90] United Devices Inc. http://www.ud.com/.

[91] Vijay Pande. Private communication, 2004.

[92] R. Wolski, N. Spring, and J. Hayes. Predicting the CPU Availability of Time-sharedUnix Systems. In Peoceedings of 8th IEEE High Performance Distributed ComputingConference (HPDC8), August 1999.

[93] R. Wolski, N. Spring, and J. Hayes. The Network Weather Service: A DistributedResource Performance Forecasting Service for Metacomputing. Journal of FutureGeneration Computing Systems, 15(5–6):757–768, 1999.

[94] P. Wyckoff, T. Johnson, and K. Jeong. Finding Idle Periods on Networks of Work-stations. Technical Report CS761, Dept. of Computer Science, New York University,March 1998.

[95] S. Zagrovic and V. Pande. Structural correspondence between the alpha-helix andthe random-flight chain resolves how unfolded proteins can have native-like proper-ties. Nature Structural Biology, 2003.

http://www.ud.com/

Documents

UNIVERSITY OF CALIFORNIA, SAN DIEGOmescal.imag.fr/membres/derrick.kondo/pubs/thesis_kondo.pdf · II Desktop Grid System Design and Implementation: State of the Art . . . . . . 12