Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Scheduling Task Parallel Applications For Rapid Turnaround on Desktop Grids
A dissertation submitted in partial satisfaction of the
requirements for the degree Doctor of Philosophy
in Computer Science and Engineering
by
Derrick Kondo
Committee in charge:
Professor Henri Casanova, Co-ChairmanProfessor Andrew A. Chien, Co-ChairmanProfessor Phillip BourneProfessor Larry CarterProfessor Rich Wolski
2005
Copyright
Derrick Kondo, 2005
All rights reserved.
The dissertation of Derrick Kondo is approved, and it is ac-
ceptable in quality and form for publication on microfilm:
Co-Chair
Co-Chair
University of California, San Diego
2005
3
TABLE OF CONTENTS
Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Vita, Publications, and Fields of Study . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1A. Desktop Grids: Past and Present . . . . . . . . . . . . . . . . . . . . . . . 2B. Prospects and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5C. Goal, Motivation, and Approach . . . . . . . . . . . . . . . . . . . . . . . . 8D. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
II Desktop Grid System Design and Implementation: State of the Art . . . . . . 12A. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12B. System Anatomy and Physiology . . . . . . . . . . . . . . . . . . . . . . . 14
1. Client Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152. Application and Resource Management Level . . . . . . . . . . . . . . . 163. Worker Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
a. Worker Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17b. Worker Sandbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4. Design Trade-offs of Centralization . . . . . . . . . . . . . . . . . . . . 20a. Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21b. Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
III Resource Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A. The Ideal Resource Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25B. Related Work on Resource Measurements and Modelling . . . . . . . . . . 27
1. Host Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272. Host Load and CPU Utilization . . . . . . . . . . . . . . . . . . . . . . 283. Process Lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
C. Trace Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31D. Trace Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1. SDSC Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362. DEUG and LRI Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . 383. UCB Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
E. Characterization of Exec Availability . . . . . . . . . . . . . . . . . . . . . 401. Number of Hosts Available Over Time . . . . . . . . . . . . . . . . . . 402. Temporal Structure of Availability . . . . . . . . . . . . . . . . . . . . . 443. Temporal Structure of Unavailability . . . . . . . . . . . . . . . . . . . 47
4
4. Task Failure Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495. Correlation of Availability Between Hosts . . . . . . . . . . . . . . . . . 506. Correlation of Availability with Host Clock Rates . . . . . . . . . . . . 54
F. Characterization of CPU Availability . . . . . . . . . . . . . . . . . . . . . 571. Aggregate CPU Availability . . . . . . . . . . . . . . . . . . . . . . . . 572. Per Host CPU Availability . . . . . . . . . . . . . . . . . . . . . . . . . 60
G. An Example of Applying Characterization Results: Cluster Equivalence . 651. System Performance Model . . . . . . . . . . . . . . . . . . . . . . . . 652. Cluster Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
H. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
IV Resource Management: Methods, Models, and Metrics . . . . . . . . . . . . . 72A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72B. Models and Instantiations . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1. Platform model and instantiation . . . . . . . . . . . . . . . . . . . . . 762. Application model and instantiation . . . . . . . . . . . . . . . . . . . . 80
C. Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81D. Measuring and Analyzing Performance . . . . . . . . . . . . . . . . . . . . 82
1. Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822. Method of Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 83
E. Computing the Optimal Makespan . . . . . . . . . . . . . . . . . . . . . . 871. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882. Single Availability Interval On A Single Host . . . . . . . . . . . . . . . 89
a. Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 89b. Proof of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3. Multiple Availability Intervals On A Single Host . . . . . . . . . . . . . 944. Multiple Availability Intervals On Multiple Hosts . . . . . . . . . . . . 955. Optimal Makespan with Checkpointing Enabled . . . . . . . . . . . . . 97
V Resource Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99A. Resource Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
1. Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B. Resource Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061. Excluding Resources By Clock Rate . . . . . . . . . . . . . . . . . . . . 1062. Using Makespan Predictions . . . . . . . . . . . . . . . . . . . . . . . . 108
a. Evaluation on Different Desktop Grids . . . . . . . . . . . . . . . . 110C. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115D. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
VI Task Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119B. Measuring and Analyzing Performance . . . . . . . . . . . . . . . . . . . . 122
1. Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222. Method of Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 122
C. Proactive Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 1221. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5
D. Reactive Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 1261. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
E. Hybrid Replication Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 1291. Feasibility of Predicting Probability of Task Completion . . . . . . . . 1312. Probabilistic Model of Task Completion . . . . . . . . . . . . . . . . . . 1313. REP-PROB Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385. Evaluating the benefits of REP-PROB . . . . . . . . . . . . . . . . . . 142
F. Estimating application performance . . . . . . . . . . . . . . . . . . . . . . 145G. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
1. Task replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1482. Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
H. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
VII Scheduler Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154A. Overview of the XtremWeb Scheduling System . . . . . . . . . . . . . . . . 154B. EXCL-PRED-TO Heuristic Design and Implementation . . . . . . . . . . . 155
1. Task Priority Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1552. Makespan Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
VIIIConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157A. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 157B. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A Defining the IQR Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161A. IQR Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B Additional Resource Selection and Exclusion Results and Discussion . . . . . 164
C Additional Task Replication Results and Discussion . . . . . . . . . . . . . . . 167A. Proactive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B. Reactive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169C. Hybrid Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6
VITA
1999 B.S. in Computer ScienceStanford University
2002 M.S. in Computer Science and EngineeringUniversity of California, San Diego
PUBLICATIONS
D. Kondo, A. A. Chien, and H. Casanova. Resource Management for Rapid Applica-tion Turnaround on Enterprise Desktop Grids In Proceedings of ACM Conference onHigh Performance Computing and Networking, SC2005, November 2005, Pittsburgh,Pennsylvania.
D. Kondo and H. Casanova. Computing the Optimal Makespan for Jobs with Identicaland IndependentTasks Scheduled on Volatile Hosts. Technical Report CS2004-0796,Dept. of Computer Science and Engineering, University of California at San Diego,July, 2004.
D. Kondo, M. Taufer, C. Brooks, H. Casanova, and A. A. Chien. Characterizing andEvaluating Desktop Grids: An Emprical Study. Proceedings of the International Paralleland Distributed Processing Symposium 2004, May 2004.
D. Kondo, H. Casanova, E. Wing and F. Berman. Models and Scheduling Mechanismsfor Global Computing Applications. In Proceeding of the International Parallel andDistributed Processing Symposium 2002, April 2002, Fort Lauderdale, Florida.
S. Joseph, M. Whirl, D. Kondo, and H. Noller, and R. Altman, Calculation of theRelative Geometry of tRNAs in the Ribosome from Directed Hydroxyl-Radical ProbingData. RNA 6:220-232. 2000.
FIELDS OF STUDY
Major Field: Computer ScienceStudies in Parallel and Distributed ComputingProfessor Henri Casanova
Major Field: Computer ScienceStudies in Computational BiologyProfessor Russ B. Altman
vii
LIST OF FIGURES
II.1 A Common Anatomy of Desktop Grid Systems . . . . . . . . . . . . . . . 16II.2 CPU Availability During Task Execution. . . . . . . . . . . . . . . . . . . 19
III.1 Distribution of “small” gaps (<2 min.). . . . . . . . . . . . . . . . . . . . 34III.2 Host’s clock rate distribution in each platform . . . . . . . . . . . . . . . . 38III.3 Number of hosts available for a given week for each platform. . . . . . . . 42III.3 * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43III.4 Cumulative distribution of the length of availability intervals in terms of
time for business hours and non-business hours. . . . . . . . . . . . . . . . 44III.5 Cumulative distribution of the length of availability intervals normalized
to total duration of availability in terms of time for business hours andnon-business hours for the UCB platform. . . . . . . . . . . . . . . . . . . 46
III.6 Cumulative distribution of the length of availability intervals in terms ofoperations for business hours and non-business hours. . . . . . . . . . . . 47
III.7 Unavailability intervals in terms of hours . . . . . . . . . . . . . . . . . . . 48III.8 Task failure rates during business hours . . . . . . . . . . . . . . . . . . . 49III.9 Correlation of availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53III.10Percentage of time when CPU availability is above a given threshold, over
all hosts, for business hours and non-business hours. . . . . . . . . . . . . 58III.11CPU availability per host in SDSC platform. . . . . . . . . . . . . . . . . 61III.12CPU availability per host in DEUG platform. . . . . . . . . . . . . . . . . 62III.13CPU availability per host in LRI platform. . . . . . . . . . . . . . . . . . 63III.14CPU availability per host in UCB platform. . . . . . . . . . . . . . . . . . 64III.15Model of application work rate for entire SDSC desktop grid, in number of
operations per seconds versus task size,in number of minutes of dedicatedCPU time on a 1.5GHz host. . . . . . . . . . . . . . . . . . . . . . . . . . 67
III.16Cluster equivalence of a desktop grid CPU as a function of the applicationtask size. Two lines are shown, one for the the resources on weekdays andweekends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
III.17Cumulative percentage of total platform computational power for SDSChosts sorted by decreasing effectively delivered computational power andfor hosts by clock rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
IV.1 Cumulative task completion vs. time. . . . . . . . . . . . . . . . . . . . . 75IV.2 Scheduling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76IV.3 Cumulative clock rate distributions from real and simulated platform . . . 79IV.4 Laggers for an application with 400 tasks. . . . . . . . . . . . . . . . . . . 85IV.5 INTG: helper function for the scheduling algorithm . . . . . . . . . . . . . 90IV.6 Scheduling algorithm over a single availability interval. . . . . . . . . . . . 91IV.7 An example of task execution for OPTINTV (higher) and OPTDELAY
(lower) at the beginning of the job. Both jobs arrive at the same time.In the case of OPTINTV, the first task is scheduled immediately and anoverhead of h is incurred. In the case of OPTDELAY, the scheduler waitsof a period of w1 before scheduling the task. . . . . . . . . . . . . . . . . . 92
viii
IV.8 An example of task execution for OPTINTV (higher) and OPTDELAY(lower) in the middle of the job. . . . . . . . . . . . . . . . . . . . . . . . . 93
IV.9 Scheduling algorithm over multiple availability intervals. . . . . . . . . . . 95IV.10Scheduling algorithm over multiple availability intervals over multiple hosts 96
V.1 Subintervals denoted by the double arrows for each availability interval.The length of each subinterval is shown, and the subinterval lengths differby 10 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
V.2 Performance of resource prioritization heuristics on the SDSC grid. . . . . 103V.3 Complementary CDF of Prediction Error When Using Expected Opera-
tions or Time Per Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 105V.4 Number of tasks to be scheduled (left y-axis) and hosts available (right
y-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106V.5 Performance of heuristics using thresholds on SDSC grid . . . . . . . . . . 107V.6 Heuristic performance on the SDSC grid . . . . . . . . . . . . . . . . . . . 110V.7 Cause of Laggers (IQR factor of 1) on SDSC Grid. 1 → FCFS. 2 →
PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED . . . . . . . . . . . . . . . . 112V.8 Length of task completion quartiles on SDSC Grid. 0→ OPTIMAL. 1→
FCFS. 2 → PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED . . . . . . . . . . 113V.9 Heuristic performance the GIMPS grid . . . . . . . . . . . . . . . . . . . . 114V.10 Heuristic performance on the LRI-WISC grid . . . . . . . . . . . . . . . . 115
VI.1 Performance Of Heuristics Combined With Replication On SDSC Grid. . 124VI.2 Waste Of Heuristics Using Proactive Replication On SDSC Grid. . . . . . 125VI.3 Performance of reactive replication heuristics on SDSC grid. . . . . . . . . 127VI.4 Waste of reactive replication heuristics on SDSC grid. . . . . . . . . . . . 128VI.5 Probability of task completion per day for several task lengths. . . . . . . 132VI.6 CDF of prediction errors of the probability of task completion from one
day to the next for 5, 15, 35 minute tasks on a dedicated 1.5GHz host . . 133VI.7 Finite automata for task execution. . . . . . . . . . . . . . . . . . . . . . . 134VI.8 Timeline of task completion. . . . . . . . . . . . . . . . . . . . . . . . . . . 134VI.9 Performance of REP-PROB on SDSC grid. . . . . . . . . . . . . . . . . . 139VI.10Waste of REP-PROB on SDSC grid. . . . . . . . . . . . . . . . . . . . . . 139VI.11Cause of Laggers (IQR factor of 1) on SDSC Grid. 1 → FCFS. 2 →
PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
VI.12CDF of task failure rates per host. . . . . . . . . . . . . . . . . . . . . . . 144VI.13Performance difference between EXCL-PRED-TO and transformed UCB-
LRI platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146VI.14Performance of checkpointing heuristics on SDSC grid. . . . . . . . . . . . 150VI.15Length of task completion quartiles on SDSC Grid. 0 → OPTIMAL. 1
→ FCFS. 2 → PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED 5 → EXCL-PRED-TO. 6 → REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1 Cause of Laggers (IQR factor of .5) on SDSC Grid. 1 → FCFS. 2 →PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
ix
A.2 Cause of Laggers (IQR factor of 1.5) on SDSC Grid. 1 → FCFS. 2 →PRI-CR. 3 → EXCL-S.5. 4 → EXCL-PRED. 5 → EXCL-PRED-TO. 6→ REP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.1 Performance of resource selection heuristics on the DEUG grid . . . . . . 164B.2 Performance of resource selection heuristics on the LRI grid . . . . . . . . 165B.3 Performance of resource selection heuristics on the UCB grid . . . . . . . 165
C.1 Performance of proactive replication heuristics on DEUG grid. . . . . . . 167C.2 Performance of proactive replication heuristics on LRI grid. . . . . . . . . 168C.3 Performance of proactive replication heuristics on UCB grid. . . . . . . . 168C.4 Waste of proactive replication heuristics with EXCL-PRED-DUP-TIME
and EXCL-DUP-TIME-SPD. . . . . . . . . . . . . . . . . . . . . . . . . . 169C.5 Waste of proactive replication heuristics on DEUG grid. . . . . . . . . . . 170C.6 Waste of proactive replication heuristics on LRI grid. . . . . . . . . . . . . 170C.7 Waste of proactive replication heuristics on UCB grid. . . . . . . . . . . . 171C.8 Performance of proactive replication heuristics when varying replication
level on SDSC grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171C.9 Performance of reactive replication heuristics on DEUG grid. . . . . . . . 172C.10 Performance of reactive replication heuristics on LRI grid. . . . . . . . . . 172C.11 Performance of reactive replication heuristics on UCB grid. . . . . . . . . 173C.12 Waste of reactive replication heuristics on DEUG grid. . . . . . . . . . . . 173C.13 Waste of reactive replication heuristics on LRI grid. . . . . . . . . . . . . 174C.14 Waste of reactive replication heuristics on UCB grid. . . . . . . . . . . . . 174C.15 Performance of hybrid replication heuristic on DEUG grid. . . . . . . . . 175C.16 Performance of hybrid replication heuristic on LRI grid. . . . . . . . . . . 175C.17 Performance of hybrid replication heuristic on UCB grid. . . . . . . . . . 176C.18 Waste of hybrid replication heuristic on DEUG grid. . . . . . . . . . . . . 176C.19 Waste of hybrid replication heuristic on LRI grid. . . . . . . . . . . . . . . 177C.20 Waste of hybrid replication heuristic on UCB grid. . . . . . . . . . . . . . 177
x
LIST OF TABLES
I.1 Characteristics of desktop grid applications [63] . . . . . . . . . . . . . . . 7
III.1 Characteristics of desktop grid applications. (Deriv. denotes “derivable”) 30III.2 Correlation of host clock rate and other machine characteristics during
business hours for the SDSC trace . . . . . . . . . . . . . . . . . . . . . . 55III.3 Correlation of host clock rate and failure rate during business hours. Task
size is in term of minutes on a dedicated 1.5GHz host. . . . . . . . . . . . 56
IV.1 Qualitative platform descriptions. . . . . . . . . . . . . . . . . . . . . . . . 79
VI.1 Mean performance difference relative to EXCL-PRED-DUP when increas-ing the number of replicas per task. . . . . . . . . . . . . . . . . . . . . . . 124
VI.2 Mean performance difference and waste difference between EXCL-PRED-DUP and EXCL-PRED-TO. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
VI.3 Mean performance and waste difference between EXCL-PRED-TO andREP-PROB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
VI.4 Makespan statistics of EXCL-PRED-TO for the SDSC platform. Lowerconfidence intervals are w.r.t. the mean. The mean, standard deviation,and median are all in units of seconds. . . . . . . . . . . . . . . . . . . . . 147
VI.5 Summary of replication heuristics. . . . . . . . . . . . . . . . . . . . . . . 152
C.1 Makespan statistics for the DEUG platform. Lower confidence intervalsare w.r.t. the mean. The mean, standard deviation, and median are allin units of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.2 Makespan statistics for the LRI platform. Lower confidence intervals arew.r.t. the mean. The mean, standard deviation, and median are all inunits of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
C.3 Makespan statistics for the UCB platform. Lower confidence intervals arew.r.t. the mean. The mean, standard deviation, and median are all inunits of seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
xi
ABSTRACT OF THE DISSERTATION
Scheduling Task Parallel Applications For Rapid Turnaround on Desktop Grids
by
Derrick Kondo
Doctor of Philosophy in Computer Science and Engineering
University of California, San Diego, 2005
Professors Henri Casanova and Andrew Chien, Co-Chairs
Since the early 1990’s, the largest distributed computing systems in the world
have been desktop grids, which use the idle cycles of mainly desktop PC’s to support large
scale computation. Despite the enormous computing power offered by such systems, the
range of supportable applications is largely limited to task parallel, compute-bound, and
high-throughput applications. This limitation is mainly because of the heterogeneity and
volatility of the underlying resources, which are shared with the desktop users. Our work
focuses on broadening the applications supportable by desktop grids, and in particular,
we focus on the development of scheduling heuristics to enable rapid turnaround for
short-lived applications.
To that end, the contributions of this dissertation are as follows. First, we mea-
sure and characterize four real enterprise desktop grid systems; such characterization is
essential for accurate modelling and simulation. Second, using the characterization, we
design scheduling heuristics that enable rapid application turnaround. These heuristics
are based on three scheduling techniques, namely resource prioritization, resource exclu-
sion, and task replication. We find that our best heuristic uses relatively static resource
information for prioritization and exclusion, and reactive task replication to achieve per-
formance within a factor of 1.7 of optimal. Third, we implement our best heuristic in a
real desktop grid system to demonstrate its feasibility.
xii
Chapter I
Introduction
Since the late 1990’s, the largest distributed computing systems in the world
have been desktop grids, which aggregate the idle CPU cycles of mostly desktop PC’s to
support large scale computations. The main motivation for using desktop grids is that
these platforms offer high computational power at low cost. That is, one can reuse an
existing infrastructure of resources (e.g., systems staff, machine hardware) to support
large computational demands. Numerous studies have shown that desktops often have
CPU availability of 80% or more [62, 8], and as desktop PC’s are getting less expensive
and more prevalent [11, 6, 71], the savings in infrastructure costs when using the idle
cycles of desktop PC’s can be as high as a factor of five or ten [22].
Virtually all desktop grid applications that run in wide-area environments are
task parallel, compute-bound, and high-throughput. An application that is task parallel
consists of tasks that are independent of one another. A compute-bound application has
a high computation to communication ratio. A high-throughput application has many
more tasks than the number of hosts.
These desktop grids applications span a wide range of scientific domains in-
cluding computational biology [49, 87, 1, 38], climate modelling [2], physics [4, 5], as-
tronomy [86], and cryptography [3]. Desktop grids enable these applications to utilize
TeraFlops of computing power provided by hundreds of thousands of hosts at relatively
little cost, which has allowed these applications to explore enormous parameter spaces
and/or run simulations at high levels of detail that would otherwise be impossible. For
instance, the Folding@home [49] and Prediction@home [87] projects have resulted in nu-
1
2
merous published discoveries [95, 83, 82, 87] that that have furthered our understanding
of protein folding and structure prediction.
The set of hosts scattered over the Internet that participate in these desktop
grid projects is incredibly diverse in terms of usage patterns, hardware configurations,
and network connectivity. For example, many home machines are used as little as 23
hours per month (which averages to 47 minutes per day) [60], while machines in enter-
prise environments are often power-on for the entire day. The configurations of hosts
participating in the SETI@home desktop grid project spans over 170 operating systems
(including version variants) and 160 CPU types (including family variants) [78]. The
network connectivity of hosts ranges from dial-up and cable/DSL to 10/100/1000 Mbps
Ethernet. Given this diversity of resources, developing a software infrastructure for
harnessing idle cycles has been a challenging endeavor.
I.A Desktop Grids: Past and Present
Soon after computers were first networked together, the notion of using the idle
cycles of desktop PC’s arose. In this section, we describe how desktop grid systems have
evolved since the early 1980’s when the first desktop grid systems were implemented and
deployed. We outline the design of each system, discuss strengths and weaknesses.
The Xerox Worm [47] was one of the earliest desktop grid systems, which had
basic resource management and security schemes, and also mechanisms that limited its
resource obtrusiveness. The worm spread itself among the ∼100 machines at Xerox
PARC by sequentially scanning through a list of resource addresses. For each address,
the worm segment would send a probe to a corresponding host, whose response would
indicate its availability. If the host was idle, the worm segment would replicate itself
and begin execution on the new host. During execution, the worm segment avoided disk
accesses entirely to limit its obtrusiveness. The applications that were run included a
telephone alarm clock, image display, and Ethernet network testing. The worm could
grow uncontrollably, and the worm’s only control mechanism was a special kill packet
that could be broadcasted to kill the worm entirely. In principle, the Xerox Worm with
little modification could work in the modern-day Internet, and a number of malicious
3
worms such as the Blaster worm have used similar spreading mechanisms. The main
challenge would be the controllability and manageability of such a worm, and decentral-
ized approaches to desktop grid computing is the focus of ongoing research [27, 64, 53].
In the mid 1990’s, Java applet-based systems such as Javelin [24] and Bayani-
han [75] allowed Java applications that ran in a secure sandboxed environment to harvest
the idle cycles of computers distributed across Internet. Users would use their browsers
to download and execute tasks in the form of an portable Java applet. In addition to
portability, Java applets were executed in a sandbox by most browsers, which reduced the
risk of harmful code being downloaded and executed on the host machine. Despite the
many security mechanisms in place, there were no mechanisms that limited the obtru-
siveness of the running applet. The applet would be able to consume CPU and memory
resources without restriction, and limiting obstrusiveness may have been impossible as it
often requires inspection of system performance counters, which applets are not allowed
to do. Also, the security mechanisms for Java applets however could be too restrictive
for desktop grid applications; for example, applets could not read or write to the local file
system. These academic systems were tested only with relatively few hosts (about 30)
and so the robustness and scalability of such systems were never proven. Furthermore,
these systems lacked tools for manageability, which are essential for large systems; for
example, there was no way to ensure that the Java Runtime Environment supported by
the browser was up to date.
At the same time, in the mid 1990’s, the authors in [11] argued that networks
of shared workstations built from commodity components could have similar or even
better performance than Massively Parallel Processor machines (MPP’s) at a fraction
of the cost. The enormous volume at which commodity components were produced re-
duced the production costs dramatically, as the Gordon Bell rule stated that “doubling
volume reduces unit cost to 90%” [11]. Moreover, the engineering lag time of developing
specialized applications, operating systems and hardware for MPP’s made a network of
workstations (NOW’s) an attractive alternative. Indeed, as of November, 2004, commod-
ity clusters make up almost 60% of the list of top 500 supercomputers [89]. A plethora
of research has gone into supporting high performance computing on NOW’s, and to
this end, research in this area has ranged from cooperative file caching to implement-
4
ing RAID over a set of workstations [11]. In terms of harnessing the idle computing of
desktops for compute-intensive tasks, the Condor distributed batch system is one of the
most relevant.
Since its inception in the 1991, Condor has been one the most extensively used
desktop grid systems in enterprise settings. Within the U.S., over 600 Condor pools
exist containing a total of 38,000 hosts [31], with each pool often containing hundreds
of hosts. (The Condor pool in the computer science department at the University of
Wisconsin has over 1000 machines.) The system supports remote checkpointing, process
migration, network data encryption, and recovery from faults of any component of the
system. Numerous operating systems are supported including Windows (with limited
functionality) and UNIX variants., and installation does not require superuser privileges,
although not all features are supported in this case.
In Condor, a user submits his/her application through a submission daemon
(schedd). An execution daemon startd runs on each resource and is responsible for
managing the task execution, such as checkpointing or terminating the task if there is
other user activity. The condor scheduler (matchmaker) determines which resources are
suitable for the application and vice versa, using the requirements of the application
(e.g., only machines with clock rates > 1 GHz) specified through the schedd and using
the requirements of the resources (e.g., the task can only run at night) specified through
the startd. Then, the schedd and startd contact each other to bind a task to a specific
resource.
Because Condor was designed primarily for local area environments, running
in wide-area environments where hosts can be in private networks behind firewalls is
problematic. In particular, Condor often uses UDP instead of TCP for communication
which makes deployment across firewalls or through congested networks difficult. Al-
though there have been nascent efforts to address this problem [81], these new methods
have not yet been added to the stable release (as of September, 2004).
In the late 1990’s, the growth of the Internet exploded and many distributed
computing projects sought to exploit the TeraFlops of potential computing power offered
by tens of millions of hosts [38, 23, 49, 44, 86]. The largest and most well-known project
is SETI@home [86], which runs an embarrassingly parallel astronomy application that
5
currently utilizes 20 TeraFlops/sec from about 500,000 active desktops. In SETI@home,
a worker daemon runs on each participating host, which requests tasks from a central-
ized server. The worker daemon ensures that the task runs only when the host is idle.
Software on a centralized server consists of an HTTP server so that hosts behind fire-
walls can download/update tasks, and a database server, which stored the location of
inputs, outputs, and various statistics about participants. The first implementation of
SETI@home was application-specific and had few tools for managing the system.
One impact of the SETI@home project was proving that people in significant
numbers were willing to donate the idles cycles of their desktops for large-scale computing
projects. A remarkable social phenomenon was that when SETI@home listed on the web
the top contributors to the project in terms of completed tasks, soon teams composed
of enthusiastic desktop users formed, and sought to gain higher status in the rankings.
The success of SETI@home spurred numerous other projects, and also academic
and industrial endeavors for developing multi-application desktop grid software. In early
2000, academic desktop grid infrastructures such as XtremWeb [37] and BOINC [39] were
implemented. Also, several commercial companies, such as Entropia [28] and United De-
vices [90], were founded, and these companies developed industrial-grade desktop grid
software that was professionally tested and supported for the purpose of deploying task
parallel applications. Many of these systems have tools for large-scale system manage-
ment, and support user authentication and data encryption.
I.B Prospects and Challenges
As commodity compute, storage, and network technologies technology improve,
become less costly and more pervasive, desktop grids are increasingly attractive for
running large-scale applications. As of April, 2005, for about $400, one can purchase a
Dell Dimension 3000, which has a 2.8GHz Pentium Processor with a 533 MHz front side
bus, 512MB SDRAM at 400MHz, a 80GB Ultra ATA/100 Hard Drive (7200 RPM), and a
100Mbps network interface. For $215, one can buy a Dell PowerConnectTM2716 16 Port
Gigabit Ethernet Switch. With $150,000, a company or university could purchase about
367 of these desktops and about 12 switches, excluding costs of installation, maintenance,
6
space, and power. Given that CPU availability in shared desktop environments is often
80% or more [62] and average free disk space [18] is about 50%, the resulting platform
would have an aggregate computing power close to 1 TeraFlop and about 15 TeraBytes
of disk space. Thus, desktop grids can have a high return on investment.
We believe that desktops in enterprise settings are especially useful as they
contribute a significant fraction of the cumulative computing power of Internet desktop
grids. (Many desktop grid researchers [29, 69] have reported that the ratio in useful hosts
to non-useful hosts participating in desktop grid projects can be as low as 1 to 10.) This
is because the enterprise desktops have relatively high availability and usually constant
network connectivity. As such, many desktop grid companies such as Entropia [28],
and United Devices [90] have separate products that target enterprise environments
exclusively.
Desktop grids restricted to enterprise environments are attractive for several
other reasons. First, the hosts are often under the same or limited number of system
administrative domains. So the software configuration of hosts (e.g., operating system
and version, and software libraries) are similar, which can simplify the process of the
software infrastructure and application deployment; developers do not need to make the
software or application portable for every combination of operating system, operating
system version, and programming library. Second, security in terms of ensuring that
the application executable and data have not been tampered with is less of an issue.
Presumably, the desktop users within the company or university are not malicious and
attempting to thwart the computation. This certainly does not preclude accidental harm
to the application, but it does reduce the risk of such an occurrence.
Although enterprise desktop grids are attractive, there exists a wide dispar-
ity between the structural complexity of applications runnable on MPP’s and current
desktop grid applications. Most Internet desktop grid applications are task parallel and
compute bound. Table I.1 obtained from [63] shows a list of typical applications run on
enterprise desktop grids. The third column in the table is the server bandwidth required
to support 1000 workers, and the forth column is the maximum number of workers as-
suming they can use only 20 Mbits/sec. Virtually all desktop grid applications deployed
over the Internet resemble the docking application shown in the table. That is, the ap-
7
CharacteristicsApplication Task run
timeTaskdata size
Serverband-width(1000workers)
Maximumworkersusing 20Mbits/sec
Docking 20 min. 1 Mbyte 6.67Mbits/sec
2,998
small data, med run 10 min. 1 Mbyte 13.3Mbits/sec
1,503
BLAST 5 min. 10Mbyte
264Mbits/sec
75
large data, large run 20 min. 20Mbyte
132Mbits/sec
150
Table I.1: Characteristics of desktop grid applications [63]
plications are compute-bound and task parallel with task sizes on the order of kilobytes
or megabytes and run times on the order of minutes or hours. The higher capacity of
networks in enterprise environments allow applications with higher communication and
computation ratios to run on desktop grids, as shown in the lower rows of the table.
Moreover, the majority of applications run on desktop grids are high-throughput, i.e.,
these applications have many more tasks than hosts available.
The reason most applications deployed on desktop grids are task parallel, com-
pute bound and high-throughput is that the hosts are volatile and heterogeneous. The
hosts are volatile in the sense that CPU availability for the desktop grid application may
fluctuate dramatically over time because the host is shared with the user/owner of the
machine. The host is shared in a way that user/owner’s activities (such keyboard/mouse
activity and other processes) get higher priority than the desktop grid task and so, the
host can not be reserved for any block of time. Moreover, hosts often have a wide range
of clock rates, which makes application deployment even more complicated.
8
I.C Goal, Motivation, and Approach
The goal of this thesis is to broaden the range of applications that can utilize
desktop grids. In particular, we focus on designing scheduling heuristics to enable rapid
application turnaround on enterprise desktop grids. By rapid application turnaround,
we mean turnaround on the order of minutes or hours (versus days or months, which is
typical of high-throughput applications run on desktop grids).
Our own experience in discrete-event simulation suggests that users often desire
turnaround within time windows that are hours or minutes in length, for example, having
results before the lunch hour, or by the next morning. This is true especially in industry
where results are required by short-term deadlines. Others have also indicated the need
for fast turnaround with respect to biological docking simulations [66]. Often these
simulations (especially simulations that explore a range of parameters) can be organized
into hundreds or thousands of independent tasks where each task consists of data input
sizes on the order of kilobytes, and each task takes on the order of minutes or hours to
run. Also, most applications from MPP workloads are less a day in length, indicating
that short jobs are not uncommon [58].
Applications consisting of independent tasks with soft real-time requirements
are also commonly found in the area of interactive scientific visualization [57, 80, 40].
An example of such an application that requires rapid turnaround is on-line parallel to-
mography [80]. Tomography is the construction of 3-D models from 2-D projections, and
it is common electron microscopy to use tomography to create 3-D images of biological
specimens. When a electron microscopist takes 2-D images of a specimen, the 3-D model
would ideally be refreshed after a series of projections, incorporating the additional in-
formation obtained from the new projections. After a refresh, the microscopist could
view the new 3-D model and then redirect his/her attention to a different area in the
specimen or correct the configuration of the microscope. Interactively viewing the model
after a set of projection allows the microscopist to converge on a correct model quickly,
and this in turn, reduces the chance of damage to the sample from excessive exposure
to the election beam [77].
In [80], the authors determine that on-line tomography is amenable to grid
9
computing environments (which include networks of workstations), and they develop
scheduling heuristics for supporting the soft deadline of the application. In particular,
the tomography application is embarrassingly parallel as each 2-D projection can be de-
composed independent slices that must be distributed to a set of resources for processing.
Each slice is on the order of kilobytes or megabytes in size, and there are typically hun-
dreds or thousands of slices per projection, depending on the size of each projection.
Ideally, the processing time of a single projection can be done while the user is acquiring
the next image from the microscope, which typically takes several minutes [46]. As such,
on-line parallel tomography could potentially be executed on desktop grids if there were
effective heuristics for meeting the application’s relatively stringent time demands.
In this thesis, we develop heuristics to allow applications requiring rapid turnaround
to utilize desktop grids effectively, focusing particularly on enterprise environments. Our
approach is to first develop a characterization of the volatility and heterogeneity in real
enterprise desktop grids. We then use this characterization to influence the design of
scheduling heuristics. Our heuristics are based on three scheduling techniques, namely
resource prioritization, resource exclusion, and task replication. Often, there is large
difference in the effective compute rate of hosts in a desktop grid, and so doing resource
prioritization would cause tasks to be assigned to the best hosts first. Moreover, the
worst hosts can significantly impede application execution, and excluding such hosts
may remove the bottleneck. We examine various criteria by which to exclude some hosts
and never use them to run application tasks. Finally, replicating a task on multiple
hosts can be used to reduce the chance that a task fails and slows application execution.
This method has the drawback of wasting CPU cycles, which could be a problem if the
desktop grid is to be used by more than one application. We investigate several issues
pertaining to replication, including which task to replicate and which host to replicate
to.
I.D Contributions
The crux of this dissertation can be summarized in the following thesis state-
ment:
10
”Scheduling heuristics based on resource prioritization, exclusion,
and reactive task replication techniques that use relatively static information
about resources can result in tremendous performance gains for task parallel,
compute-bound applications needing rapid turnaround.”
To that end, the contributions of the thesis are as follows:
1. An accurate measurement and characterization of desktop grids.
We use a simple but novel method for measuring availability of resources in four
desktop grid platforms. This method records the availability that would be ex-
perienced by a real application. We then characterize the temporal structure of
CPU availability for each platform and individual resources, identifying important
similarities and differences. Our measurement and characterization can be useful
for creating generative, predictive, and explanatory models, driving desktop grid
simulations, and shaping the design of scheduling heuristics.
2. Effective resource management heuristics for rapid application turnaround.
Using the desktop grid characterization, we design heuristics based on three schedul-
ing techniques, namely resource prioritization, resource exclusion, and task dupli-
cation.
We evaluate these heuristics through trace-driven simulations of four representa-
tive desktop grid configurations. We find that ranking desktop resources according
to their clock rates, without taking into account their availability history, is sur-
prisingly effective in practice. Our main result is that a heuristic that uses the
appropriate combination of resource prioritization, resource exclusion, and task
replication achieves performance often within a factor of 1.7 of optimal.
3. Scheduler prototype for scheduling applications
We implement a scheduler prototype for scheduling application requiring rapid
turned. Our implementation proves the feasibility of our heuristics in real settings.
The thesis is structured as follows. First, in Chapter II we will give describe
the state of the art of desktop grid systems. Then, in Chapter III, we will describe our
11
measurement and characterization of a desktop grid system. We will detail the method
by which we made measurements and how this method differs from other studies. Then
in Chapter IV we will outline the design and evaluation of our scheduling heuristics for
rapid application turnaround by describing our simulation models, general scheduling
techniques, and performance metrics. In Chapter V we describe scheduling heuristics
that use prioritization and exclusion effectively for resource selection, and quantify their
performance according to the optimal schedule achievable by an omniscient scheduler.
Even with the best resource selection techniques, task failures can continue to impede
application execution and so in Chapter VI, we investigate methods for masking task
failures by means of task replication. We will examine issues such as when to replicate
and which host to replicate on. We then implement our best heuristic to demonstrate
its feasibility and describe the implementation in Chapter VII. Finally, in Chapter VIII,
we will summarize the conclusions and impact of the thesis.
Chapter II
Desktop Grid System Design and
Implementation: State of the Art
II.A Background
A desktop grid system consists of a large set of network-connected computa-
tional and storage resources that are harvested when unused for the purpose of large-scale
computations. The computational resources are usually shared with the users or owners
of the machines, who often demand priority over desktop grid applications. As a result,
the resources are unreserved in that the availability of any set of machines cannot be
guaranteed for any period of time. Moreover, the resources are often volatile due to
user activity, machine hardware failures, and network failures, for example, and these
factors in turn prevent tasks from running to completion. In addition to being volatile,
the resources are usually heterogeneous in terms of clock rate, memory and disk size and
speed, network connectivity, and other characteristics.
Terminology related to the components of desktop grids are defined as follow.
We use the term client to refer to the user that has an application for submission. To
utilize a desktop grid, a client submits an application, which consists of a set of tasks,
to the server. The scheduler on the server then assigns tasks to each available worker,
which is a daemon that manages task execution and runs on each host. We use the term
host and resource synonymously.
12
13
The ideal desktop grid system would have the following characteristics:
1. Scalability: The throughput of the system should increase proportionally with
the number of resources.
2. Fault tolerance: The system must be tolerant of both server failure (for example,
data server crashes) and worker failure (for example, the user shutting off his/her
machine). (Traditionally, the term failure refers to a defect of hardware or software.
We use the term failure broadly to include all causes of task failure, including not
only failure of the host’s hardware or worker software, but also keyboard/mouse
activity that causes the worker to kill a running task.)
3. Security: The machine including its data, hardware, and processes must be pro-
tected from a misbehaving desktop grid application. Conversely, the application’s
executable, input, and output data, which may be proprietary, must be protected
from user inspection and corruption.
4. Maneagability: Increasingly, human resources are more costly than computing
resources. Systems should provide tools for installing and updating workers eas-
ily, and also tools for managing applications and resources, and monitoring their
progress.
5. Unobtrusiveness: Since the desktop grid application shares the system with the
user, the user processes must have priority over the client’s. When the worker
detects user activity, the task should be suspended temporarily until the activity
subsides, or the task should be killed and restarted later when the host becomes
available again.
6. Usability: Integration of an application within a desktop grid system should be
as transparent as possible; in many cases, the complexity of the (legacy) program
or the fact that the source code is proprietary and is not available makes it difficult
to modify the code to use a desktop grid system.
14
II.B System Anatomy and Physiology
Currently, there exist a number of academic and industrial desktop grid systems
that harvest the idle cycles of desktop PC’s in Internet environments and/or enterprise
environments. We describe how these systems achieve (or fail to achieve) the design
goals described in the previous section. These systems share many features of architec-
tural design and organization, and we give an overview of the anatomy and physiology
of current systems, identifying commonalities and important differences at the client,
application and resource management, and worker levels (see Figure II.1, which reflects
logical organization of the various components of a desktop grid system). (Note that the
physical organization may be different than what is shown in Figure II.1. For example,
components of the client level often reside on the same host as the worker.)
At the Client Level, a user submits an application to a desktop grid, using tools
for controlling the application’s execution and monitoring its status. At the Application
and Resource Management Level, the application is then scheduled on workers, and
information about applications and workers is stored. At the Worker Level, the worker
ensures the application’s task executes transparently with respect to other user processes
on the hosts.
As an overview, we first give a procedural outline for the submission and exe-
cution of a desktop grid application, noting where action fits in parentheses with respect
to Figure II.1:
1. The user that has an application to submit authenticates him/herself to the desktop
grid server (Client Level & Application and Resource Management Level)
2. As an optional first step, the application input data (e.g., database of protein
sequences) is partitioned into work units, and then organized into batches of tasks.
(Client Level)
3. Task batches generated from either the client manager or the application itself are
sent to the application manager. Once the application is submitted, the client
manager can be used to control and monitor the application. (Client Level &
Application and Resource Management Level)
15
4. The application manager assigns the application to a scheduler that oversees its
completion. (Application and Resource Management Level)
5. The scheduler assigns work to the workers according to the application/worker
constraints and scheduling heuristic. (Application and Resource Management Level
& Worker Level)
6. When available, the worker computes its task and returns a result to the scheduler,
which relays it to the application manager after the application has been completed.
(Worker Level & Application and Resource Management Level)
7. The application manager tallies the results and returns them to the application or
client manager, which does post-processing as necessary. (Client Level)
We detail next the various components at each level shown in Figure II.1 that
are involved with the above procedure of application submission, management, and exe-
cution. When relevant, we inject in the discussion details about four particular systems
(namely Entropia [28], United Devices [90], XtremWeb [37] and BOINC [39]) used cur-
rently by large projects that incorporate hundreds to thousands of resources. Entropia
and United Devices are commercial companies that offer desktop grid software that is
professionally developed, tested, and supported. Both companies have separate products
tailored for either enterprise or Internet environments. In our discussion below, our ref-
erences to the Entropia or United Devices frameworks refer to the software designed for
Internet environments. XtremWeb and BOINC are open source Internet desktop grid
frameworks. The XtremWeb system is an academic project developed at the University
of Paris-Sud, and has been used on hundreds of machines for over 10 projects. The
BOINC system has been deployed over hundreds of thousand of hosts and is currently
used to support the SETI@home project as well as five other large projects.
II.B.1 Client Level
In order for a user to submit his/her application to the desktop grid system,
the user must register the application binary with the application manager by sending
the executable and specifying the access permissions. Then, the application’s input data
16
APPLICATION
CLIENTMANAGER
APPLICATION MANAGER
CLIENTLEVEL
DATABASE
APPLICATION& RESOURCEMANAGEMENT
LEVEL
SANDBOX
WORKER DAEMON
WORKER APPLICATIONWORKER LEVEL
SCHEDULER
Figure II.1: A Common Anatomy of Desktop Grid Systems
(stored in a database or as flat files, for example) must be partitioned and formatted
into tasks. Several systems such as XtremWeb [37], Entropia [63], United Devices [90],
and Nimrod-G [7] provide tools packaged as part of the client manager for creating tasks
from a range or set of parameters.
The client manager most often provides a command-line interface through which
the user can submit the tasks to the application manager. Another option offered by the
Entropia and United Devices systems is to use the application manager’s API to submit
tasks programmatically. After the application is submitted, the client manager can be
used to monitor the progress of the application and control its execution. Many systems
such as Entropia and XtremWeb provide the functionality of the client manager through
a web browser as well.
II.B.2 Application and Resource Management Level
When an application binary is submitted, the application manager creates a
corresponding entry in the application table of a relational database to record the path
to the corresponding binary, permissions for accessing information about the application,
and any constraints on the resources to which the application’s tasks can be scheduled
(e.g., minimum CPU speed, memory size). When a set of tasks is submitted, the ap-
17
plication manager creates an entry in the task table of the database to record which
application each task corresponds to and the paths to the corresponding input files on
the server.
Moreover, the application manager is responsible for supplying tasks of the
application to the scheduler, which oversees resource selection and binding. When the
scheduler receives a request from the worker, it makes a scheduling decision based on
information (such as CPU speed, memory size, disk space, network speed) about the
worker stored in the worker table in the database and the resource constraints of the
application. Then the scheduler packages an application binary with data inputs and
sends the inputs back to the worker in response. Most schedulers [39, 37] in current
systems assign tasks to resources in First-Come-First-Server (FCFS) order, and thus are
tailored towards high throughput jobs. (The Entropia system uses a multi-level priority
queue for task assignment [63].)
Schedulers are passive in the sense that they cannot “push” tasks to workers;
instead the scheduler must wait for a worker to make a connection to the server before
being able to assign a task to it. This is due to the fact that hosts found on the
Internet (including enterprises) are often protected by firewalls that block all incoming
connections, but these firewalls usually allow some kind of outgoing connections . Thus,
any connection made between the worker and the server must be initiated by the worker.
After the worker successfully completes the task, it returns the task to the
application manager, which then records the completion in the results database table
(e.g., storing the time of completion and which user completed the task) and credits the
user whose host completed the task.
II.B.3 Worker Level
II.B.3.a Worker Daemon
On each host, a worker daemon runs in the background to control communi-
cation with the server and task execution on the host, while monitoring the machine’s
activity. The worker has a particular recruitment policy used to determine when a task
can execute, and when the task must be suspended or terminated. The recruitment
policy consists of a CPU threshold, a suspension time, and a waiting time. The CPU
18
threshold is some percentage of total CPU use for determining when a machine is con-
sidered idle. For example, in Condor, a machine is considered idle if the current CPU
use is less than the 25% CPU threshold by default. The suspension time refers to the
duration that a task is suspended when the host becomes non-idle. A typical value for
the suspension time is 10 minutes. If a host is still non-idle after the suspension time
expires, the task is terminated. When a busy host becomes available again, the worker
waits for a fixed period of time of quiescence before starting a task; this period of time
is called the waiting time. In Condor, the default waiting time is 15 minutes.
Figure II.2 shows an example of the effect of recruitment policy on CPU avail-
ability. The task initially uses all the CPU for itself. Then, after some user key-
board/mouse activity, the task gets suspended. (The various causes of task termination
to enforce unobtrusiveness include user-level activity such as mouse/keyboard activity,
other CPU processes, and disk accesses, or even machine failures, such as a reboot, shut-
down or crash.) After the activity subsides and the suspension time expires, the task
resumes execution and completes. The worker then uploads the result to the server and
downloads a new task; this time is indicated by the interval labelled “gap”. The task
begins execution then gets suspended and eventually killed, again due to user activity;
usually all of the task’s progress is lost as most systems do not have system-level sup-
port for checkpointing. Later, after the host becomes available for task execution and
the waiting time expires, the task restarts and shortly after beginning execution the host
is loaded with other processes but the CPU utilization is below the threshold, and so
the task continues executing, but only receiving a slice of CPU time.
In addition to controlling the execution of the desktop grid application, the
worker daemons in Xtremweb and Entropia periodically poll the server to indicate the
current state of the worker (for example, running a task, or waiting for the machine to
be idle) and whether host and worker are up. If a task has been assigned to a worker
and the worker stops sending heartbeats to the server, the worker is assumed to have
failed, and the task is reassigned a another worker.
19
Figure II.2: CPU Availability During Task Execution.
II.B.3.b Worker Sandbox
To ensure the protection of the underlying host when a task is executing, several
systems provide some form of a sandboxed environment [19, 21]. In particular, Entropia
provides a virtual machine as a sandbox that guards the machine from errant worker
processes. The virtual machine is a user-level program that simulates the Windows
kernel, and the worker application runs as a thread of this virtual machine. When the
application runs and makes a system call, the virtual machine catches the system call
(presumably using a call analogous to Ptrace in Linux) and executes it in the simulated
environment. The virtual machine is configured to map application virtual file accesses
to file accesses on the actual machine. For example, many applications make changes to
the Windows registry, which could be potentially obtrusive to the host. The Entropia
virtual machine has a shadow registry within its installation directory to which writes
are made, thereby preventing modifications to the actual registry.
There are several other benefits of the virtual machine. The virtual machine
enables fine grain control of network, memory, disk, and computing resources used by
the application in order to limit its obtrusiveness. Also, the Entropia virtual machine
simplifies application integration into the desktop grid system by allowing any (propri-
etary) Windows executable to be run by the worker without any changes to the (legacy)
20
source code or recompiling to link with special libraries.
Other than Entropia, the Xtremweb research group investigated the use of a
user-level sandbox [19], which intercepts any system calls of the application, and for each
intercepted system call, runs a security check to ensure the call is valid before allowing
its execution. Specifically, XtremWeb deploys this method by using Ptrace to allow
a parent process to retain control over its child when specific operations are executed
by system calls. When a ptraced child process makes a system call, its execution is
paused, and the parent process can inspect the parameters of the call before allowing
its execution. If the child’s system call fails its parent’s check, the parent can kill the
child process. The drawback to the above sandboxing techniques is the overhead of at
least two context switches between the child and parent processes. So applications with
significant IO will perform poorly on such systems.
Alternatively, the Xtremweb group considered using a a kernel-level sandbox
technique, where a kernel patch is installed that adds hooks at the beginning or end
of particular kernel functions. A superuser can insert a module that implements these
hooks to define a specific security policy. The advantage of this method is that no context
switches are necessary. XtremWeb considered using such a technique but because it
required root privileges this method was not implemented.
While sandboxes protect the host machine from a misbehaving application,
workers have several security mechanisms to protect the application (including its data)
from the user. To deter inspection, the application executable and data can be encrypted
with multiple keys to make examination difficult [28, 90] and modifications detectable.
II.B.4 Design Trade-offs of Centralization
At one end of the spectrum, a desktop grid can be completely centralized where
the client manager and all the components in the Application and the Resource Manage-
ment Layer are located on a single machine. At the other end, each host in the desktop
grid would be completely autonomous with little or no knowledge of other hosts or ap-
plications in the system. While most desktop grid systems are centralized, we identify
the trade-offs of centralized versus decentralized design with respect to the system goals
outlined in Section II.A, focusing particularly on scalability, server fault tolerance, task
21
result verification, and worker software manageability.
II.B.4.a Scalability
We focus on two aspects, namely resource management and application data
management. Two important parts of resource management are monitoring of the re-
sources to determine dynamic information such as CPU or network activity, and resource
selection. Several systems, such as NWS [93], use a hierarchical approach, which is
amenable to incorporation with desktop grid systems, to allow for scalable monitoring of
resources. Regarding resource selection, there exist several centralized systems [48]. For
example, the system in [48] can execute expressive resource queries (including ranking,
clustering) over a large set of attributes of millions of resources on the order of sec-
onds on a modest machine. The particular implementation used a relational database to
store the hierarchical structure of a set of resources, and using an XML database could
improve performance even further. Also, the authors in [68] showed that decentralized
resource management (specifically, monitoring and selection) is not always advantageous
performance-wise; the authors found that strategically placing 4-node server clusters to
support resource discovery results in performance comparable to that of decentralized
approaches based on distributed hash tables (DHT’s). In terms of ease of implementa-
tion, our own experience suggests that resource selection is greatly simplified if there is
a global view of the resources in the system, which is lost in a fully decentralized system.
At the same time, there are have been several efforts to decentralize resource
monitoring and discovery to achieve scalability and fault tolerance, such as in SWORD [67],
Xenoservers [84], and GUARD [64]. The general approach in SWORD [67] and Xenoservers [84]
is to use DHT’s for distributing the data about resources and related queries among a
set of hosts. For example, to store data about host CPU availability, one host may store
values between 0 and 20%, another host may store values between 20 and 40% and so
on. Queries in the form of <attribute, value> are mapped to unique keys, which are
then routed to the host containing the corresponding data. The advantage of such an
approach is that it can tolerate host failures as the DHT will automatically restructure
itself as needed. The approach use in GUARD [64] is to create a “gossiping” protocol
based on distance vectors where resource information propagates automatically to a node
22
through its neighbors. The protocol is designed to be scalable and to withstand host
failures.
While benefits of decentralized resource management relative to centralized
management are debatable, one of the most limiting aspects of a centralized design is
application data management, in particular storage and distribution. In [63], the authors
show that an application with medium input sizes (10MB) and low execution times (5
minutes) requires significant bandwidth (264 Mbps) for a medium number of workers
(1000). Many applications can have much higher data input sizes, and distributing such
data inputs from a centralized server could be infeasible, and thus mandates decentralized
approaches, such as peer-to-peer (P2P) methods described in [85, 16, 72]. For example,
the Chord protocol provides a fast method by which to locate a datum stored on a set
of volatile hosts on a wide area. In particular, Chord is based on a distributed hash
table primitive that supports data lookups using only log(H) messages, where H is the
number of hosts in the system. The hosts in a DHT are organized using a logical overlay
that maps a unique id corresponding to a host to some position in the overlay. Each
node contains a routing table that indicates which of its neighbors are “closer” to the
datum. A datum has a unique identifier and is mapped to the “closest” node in the
logical overlay. Although there are many P2P methods for locating a datum on a set
of volatile hosts, linking the computation with the data (and addressing issues such as
locality) is still an open problem.
II.B.4.b Fault Tolerance
Centralization can cause the server to become a single point of failure. To
avoid failure, we argue that replicating the server, i.e., components of the application
and resource management level, can reduce the probability of failure significantly. For
example, in 2001, the SETI@home server (including the web and data servers) become
nonfunctional 16 times [79]. (The causes of failure included hardware upgrades, updates
of the database or database software, RAID cards failing, electrical storms and repairs,
power outages, full disks, database failures, and rearranging of hardware.) Assuming
the server fails at a rate of 16 times per year, if the server was replicated on two servers
that had the same and independent failure rates, the probability that all servers fail at
23
once is less then 10−8 or approximately once every 38,000 decades. The point is that
setting up a few extra servers or a server farm (versus a totally decentralized solution),
which several systems such as BOINC and XtremWeb support, could reduce the chance
of failure down to near-zero.
Even if a failure does occur, the effect is small since most Internet desktop grid
applications are high-throughput and an outage of a few hours is not significant as the
applications do not have stringent time requirements. Also, systems such as BOINC and
XtremWeb have mechanisms for graceful recovery. For example, when the data server
fails, all workers finish computing their tasks [10], and when the server comes back up,
it could become inundated by worker upload requests. To reduce the storm of requests,
BOINC and XtremWeb force exponential backoff of their workers when the server is
overloaded.
Regarding task result verification, the result returned by a worker can often be
erroneous. In particular, the authors in [88] found that significant computation errors or
differences could be caused by hardware malfunctions, incorrect software modifications,
malicious attacks, or differences in floating-point hardware, libraries and compilers. The
error rates for two scientific application (MFold and CHARMM) deployed over an Inter-
net desktop grid was 1.9% and 8.7% respectively.
Task replication has been used as a means for detection and correction [88,
74, 87]. Multiple copies of a work unit are sent to different workers. When the results
are returned, they are compared and the result that appears most often or has been
computed by a credible worker is assumed to be correct. When a worker is found to
have computed a bad result, it is blacklisted to prevent the worker from effecting the
application further. Blacklisting a worker in a centralized system is trivial, but a fully
decentralized system could require a notification of each node hosting components of
the Application and Resource Management level in order to prevent the sabotager from
participating in the computation.
Regarding manageability, Entropia, XtremWeb, and United Devices all provide
some form of a command-line tools or web interfaces by which a single administrator can
manage applications, monitor their progress, or install workers and send updates. How
to effectively manage applications in an decentralized environment is still an open area
24
of research.
In summary, while there certainly many potential beneficial of centralization,
there are significant challenges that must be overcome before the costs of decentralization
outweigh the benefits. Currently, there are at least two research efforts for developing a
decentralized desktop grid system, namely the Cluster Computing On the Fly system [53]
and the Organic Grid [27]. The authors in [53] propose the Cluster Computing On the Fly
system that uses distributed hash table techniques for locating available resources. The
authors in [27] describe a prototype of the Organic Grid, which is a fully decentralized
system based on mobile agents.
Chapter III
Resource Characterization
The measurement and characterization of desktop grids is useful for several
reasons. First, the data can be used for the performance evaluation of the entire system,
subsets, or individual hosts. For example, one could determine the aggregate compute
power of the entire desktop grid over time. Second, the data can be used to develop pre-
dictive, generative, or explanatory models [35]. For example, a predictive model could be
formed to predict the availability of a host given that it has been available for some pe-
riod of time. A generative model based on the data could be used to generate host clock
rate distribution or availability intervals for simulation of the platform. After showing
a precise fit between the data and the model, the model can often help explain certain
trends shown in the data. Third, the measurements themselves could be used to drive
simulation experiments. Fourth, the characterization should influence design decisions
for resource management heuristics. We discuss in this chapter our measurement tech-
nique for obtaining traces of several real desktop grids, and a statistical characterization
of each system as a whole and also individual hosts.
III.A The Ideal Resource Trace
The design and evaluation of our scheduling heuristics requires an accurate
characterization of a desktop grid system. An accurate characterization involves obtain-
ing detailed availability traces of the underlying resources. The term “availability” has
different meanings in different contexts and must be clearly defined for the problem at
25
26
hand [15]. In the characterization of these data sets, we distinguished among three types
of availability:
1. Host availability. This is a binary value that indicates whether a host is reachable,
which corresponds to the definition of availability in [18, 8, 15, 30, 76]. Causes of
host unavailability include power failure, or a machine shutoff, reboot, or crash.
2. Task execution availability. This is binary value that indicates whether a task
can execute on the host or not, according to the worker’s recruitment policy. We
refer to task execution availability as exec availability in short. Causes of exec
unavailability include prolonged user keyboard/mouse activity, or a user compute-
bound process.
3. CPU availability. This is percentage value that quantifies the fraction of the CPU
that can be exploited by a desktop grid application, which corresponds to the
definition in [12, 25, 80, 32, 92]. Factors that affect CPU availability include system
and user level compute-intensive processes.
Host unavailability implies exec unavailability, which implies CPU unavailabil-
ity. Clearly, if a host becomes unavailable (e.g., due to shutdown of the machine), than
no new task may begin execution and any executing task would fail. If there is a period
of task execution unavailability (e.g., due to keyboard/mouse activity), then the desktop
grid worker will stop the execution of any task, causing it to fail, and disallow a task to
begin execution; as a result of task execution unavailability, the task will observe zero
CPU availability.
However, CPU unavailability does not imply exec unavailability. For example,
a task could be suspended and therefore have zero CPU availability, but since the task
can resume and continue execution, the host is available in terms of task execution.
Similarly, exec unavailability does not imply host unavailability. For example, a task
could be terminate due to user mouse/keyboard activity, but the host itself could be still
up.
Given these definitions of availability, the ideal trace of availability would have
the following characteristics:
27
1. The trace would log CPU availability in terms of the CPU time a real application
would receive if it were executing on that host.
2. The trace would record exec availability, in particular when failures occur. From
this, one can find the temporal structure of availability intervals. We call the inter-
val of time in between two consecutive periods of exec unavailability an availability
interval.
3. The trace would determine the cause of the failures (e.g., mouse/user activity,
machine reboot or crash). This would enable statistical modeling (for the purpose
of prediction, for example) of each particular type of failure.
4. The trace would capture all system overheads. For example, some desktop grid
workers run within virtual machines [19, 21], and there may be overheads in terms
of start-up, system calls, and memory costs.
III.B Related Work on Resource Measurements and Mod-
elling
Although a plethora of work has been done on the measurement and char-
acterization of host and CPU availability, there are two main deficiencies of this re-
lated research. First, the traces do not capture all causes of task failures (e.g., users’
mouse/keyboard activity), and inferring task failures and the temporal characteristics of
availability from such traces is difficult. Second, the traces may reflect idiosyncrasies of
the OS [92, 32] instead of showing the true CPU contention for a running task.
In this section, we highlight the shortfalls of the trace methods used in these
studies, and explain why the many of the statistical models founded on this trace data
are inapplicable to desktop grids. Table III.1 summarizes methods of the representative
studies and the shortfalls.
III.B.1 Host Availability
Several traces have been obtained that log host availability throughout time.
In [20], the authors designed a sensor that periodically records each machine’s uptime
28
from the /proc file system and used this sensor to monitor 83 machines in a student
lab. In [56], the authors periodically made RPC calls to rpc.statd, which runs as part
of the Network File System (NFS), on 1170 hosts connected to the Internet (see the
row corresponding to the data set Long in Table III.1). A response to the call indicated
the host was up and a missing response indicated a failure. In [15], a prober runs on
the Overnet peer-to-peer file-sharing system looking up host ID’s. A machine with a
corresponding ID is available if it responds to the probe; about 2,400 machines were
monitored in this fashion. The authors in [30, 76] determine availability by periodically
probing IP addresses in a Gnutella system.
Using traces that record only host availability for the purpose of modeling
desktop grids is problematic because it is hard to relate uptimes to CPU cycles usable
by a desktop grid application. Several factors can affect an application’s running time
on a desktop grid, which include not only host availability but also CPU load and user
activity. Thus, traces that indicate only uptime are of dubious use for performance
modeling of desktop grids or for driving simulations.
III.B.2 Host Load and CPU Utilization
There are numerous data sets containing traces of host load or CPU utilization
on groups of workstations. Host load is usually measured by taking a moving average
of the number of processes in the ready queue maintained by the operating system’s
scheduler, whereas CPU utilization is often measured by the CPU time or clock cycles
per time interval received by each process. Since host load is correlated with CPU
utilization, we discuss both type of studies in this section.
The CPU availability traces described in [62, 92, 93] are obtained using the
UNIX tools ps and vmstat, which scan the /proc filesystem to monitor processes. In
particular, the author’s in [62] used ps to measure CPU availability on about 13 VAXsta-
tionII workstations over a period of 3 months, and then later monitored 20 workstations
over a period of 4 months (see the row corresponding to the data set Condor in Ta-
ble III.1). Then they post-processed the data to determine each machine’s unavailability
intervals. A host was considered unavailable if its CPU utilization by user processes
went over 25%. They assumed a waiting time of 1 minute for the first 3 month period
29
of traces, and 5 minutes for second 4 month trace period. Similarly, the authors in [35]
measured host load by periodically using the exponential moving average of the number
of processes in the ready queue recorded by the kernel (see the row corresponding to
the data set Dinda in Table III.1). The study was based on about 38 machines over a 1
week period in August 1997. In contrast to previous studies on UNIX systems, the work
in [18] measured CPU availability from 3908 Windows NT machines over 18 days by
periodically inspecting windows performance counters to determine fractions of cycles
used by processes other than the idle process.
None of the trace data sets mentioned above record the various events that
would cause application task failures (e.g., keyboard/mouse activity) nor are the data
sets immune to OS idiosyncrasies. For example, most UNIX process schedulers (the
Linux kernel, in particular) give long running processes low priority. So, if a long running
process were running on a CPU, a sensor would determine that the CPU is completely
unavailable. However, if a desktop grid task had been running on that CPU, the task
would have received a sizable chunk of CPU time. Furthermore, processes may have a
fixed low priority. Although the cause of task failures could be inferred from the data,
doing so is not trivial and may not be accurate.
III.B.3 Process Lifetimes
In [45], the authors conduct an empirical study on process lifetimes, and pro-
pose a function that fits the measured distribution of lifetimes. Using this model, they
determine which process should be migrated and when. Inferring the temporal struc-
ture of availability from this model of process lifetimes would be difficult because it is
not clear how to determine the starting point of each process in time in relationship
to one another. Moreover, the study did not monitor keyboard/mouse activity, which
significantly impacts availability intervals [73] in addition to CPU load.
30
Char
acte
rist
ics
Dat
aSet
OS
Tra
ceda
tes
#of
host
sM
etho
dU
ser
base
CPU
thre
sh-
old
Wai
ting
tim
eSu
spen
dtim
eLe
ssth
an10 ye
ars
old?
Hos
tav
ail-
abil-
ity?
Exe
cav
ail-
abil-
ity?
Tru
eC
PU
avai
l-ab
il-ity?
Long
[56]
UN
IX3
mon
th,
1995
1170
host
srp
cca
llsto
rpc.
stat
d
mix
over
the
Inte
rnet
N/A
N/A
N/A
YY
NN
Dinda
[32]
Dig
ital
UN
IX1
wee
k,8/
9738
host
sm
ovin
glo
adav
erag
e
fron
t-en
d,in
ter-
acti
ve,
batc
hho
sts,
com
pute
serv
ers,
desk
tops
,cl
uste
r
N/A
N/A
N/A
YY
NN
Condor
[62]
4.2B
SDU
nix
7 mon
ths
tota
l,9/
86-
1/87
,9/
87-
12/8
9
13VA
Xs-
tati
onII
wor
ksta
-ti
ons,
20w
orks
ta-
tion
s
load
av-
erag
evi
aps
facu
lty,
syst
empr
o-gr
am-
mer
s,gr
adua
test
uden
ts
25%
5m
in,
1m
inno
neN
YY
N
SDSC
Win
dow
s1
mon
thto
tal
∼220
host
sre
alm
easu
re-
men
tta
sks
secr
etar
ies,
inco
n-fe
renc
ero
oms,
adm
inis
-tr
atio
ns,
rese
arch
staff
20%
10m
in10
min
YY
YY
XtremWeb
Lin
ux1
mon
th,
1/05
∼100
host
sre
alm
easu
re-
men
tta
sks
clus
ter,
stud
ents
10%
30se
c0
YY
YY
UCB
[12]
Ult
rix
4.2a
46-d
ays,
2/94
-3/
94
85D
EC
5000
wor
ksta
-ti
ons
user
leve
lda
emon
EE
/CS
grad
stud
ents
5%1
min
none
NY
Der
iv.
N
Tab
leII
I.1:
Cha
ract
eris
tics
ofde
skto
pgr
idap
plic
atio
ns.
(Der
iv.
deno
tes
“der
ivab
le”)
31
III.C Trace Method
We gather traces by submitting measurement tasks to a desktop grid system
that are perceived and executed as real tasks. These tasks perform computation and
periodically write their computation rates to file. This method requires that no other
desktop grid application be running, and allows us to measure exactly the compute power
that a real, compute-bound application would be able to exploit. Our measurement
technique differs from previously used methods in that the measurement tasks consume
the CPU cycles as a real application would.
During each measurement period, we keep the desktop grid system fully loaded
with requests for our CPU-bound, fixed-time length tasks, most of which were around 10
minutes in length. The desktop grid worker running on each host ensured that these tasks
did not interfere with the desktop user and that the tasks were suspended/terminated as
necessary; the resource owners were unaware of our measurement activities. Each task of
fixed time length consists of an infinite loop that performs a mix of integer and floating
point operations. A dedicated 1.5GHz Pentium processor can perform 110.7 million such
operations per second. Every 10 seconds, a task evaluates how much work it has been
able to achieve in the last 10 seconds, and writes this measurement to a file. These files
are retrieved by the desktop grid system and are then assembled to construct a time
series of CPU availability in terms of the number of operations that were available to
the desktop grid application within every 10 second interval.
For the Windows version of the measurement task, we implement the timing as-
pect of a task using Window’s multimedia timer. The timer is implemented by spawning
a high-priority thread that sets a kernel timer and then blocks. When the thread wakes
up, it executes our callback routine posting the number of operations completed during
the last 10 seconds in the windows application’s message queue, and then sleeps. (Note
that the message queue is sufficiently large to preclude overfilling of the queue during
the 15 minutes of measurements per task.) The frequency at which the kernel interrupts
to check the timer expiration is set when initializing the timer. We tried various fre-
quencies from once every millisecond to once every second, and noticed little difference
in the total operations logged per time period. Thus, we used a timer resolution of 1 ms
32
assuming the overhead of using the timer is negligible. Regardless, the overhead of using
the timer should be constant across time intervals, and so the number of operations per
time interval would be equally affected.
We implement the computational aspect of a task by iteratively performing
integer and floating calculations within a integer array and a double array. Intra-loop
dependencies were added to prevent compiler optimization. The size of each array (60
and 224 bytes respectively) were small enough to fit in cache, and so we excluded the
costs of memory accesses from our traces. The linux version of the measurement task
was implemented in a similar manner.
The main advantage of obtaining traces in this fashion is that the application
experiences host and CPU availability exactly as any real desktop grid application would.
This method is not susceptible to OS idiosyncrasies because the logging is by done by
a CPU-bound task actually running on the host itself. Also, this approach captures
all the various causes of task failures, including but not limited to mouse/keyboard
activity, operating system failures, and hardware failures, and the resulting trace reflects
the temporal structure of availability intervals caused by these failures. Moreover, our
method takes into account overhead, limitations, and policies of accessing the resources
via the desktop grid infrastructure.
Every measurement method has weaknesses, and our method is certainly not
flawless. One weakness compared to the ideal trace data set is that we cannot identify the
specific causes of failures, and so we cannot distinguish failures caused by user activity
versus power failures, for example. This, in turn, could make stochastic failure prediction
models more difficult to derive, as one source of failure could skew the distribution of
another source. Nevertheless, all types of failures are still subsumed in our traces, in
contrast to other studies that often omit many types of desktop failures as described in
earlier sections.
Also, the tasks were executed by means of a desktop grid worker, which used a
particular recruitment policy. This means that the trace may be biased to the particular
worker settings used in the specific deployment. However, with knowledge of these
settings, it is straightforward to infer reliably at which points in the trace the bias
occurs and thus possible to remove such bias. After removing the bias, one could post-
33
process the traces according to any other CPU-based recruitment policy to determine the
corresponding CPU availability. This makes it possible to collect data using our trace
method from a desktop grid with one recruitment policy, and then simulate another
desktop grid with a different recruitment policy using the same set of traces with minor
adjustments.
III.D Trace Data Sets
Using the previously described method, we collected data sets from two desktop
grids. One of these desktop grids consisted of desktop PC’s at the San Diego Super
Computer Center (SDSC) and ran the commercial Entropia [28] desktop grid software.
We refer to the data collected from the SDSC environment as the SDSC trace. The other
desktop grid consisted of desktop PC’s at the University of Paris South, and ran the open
source XtremWeb [37] desktop grid software. The Xtremweb desktop grid incorporated
machines from two different environments. The first environment was a cluster used
by a computer science research group for running parallel applications and benchmarks,
and we refer to the data set collected from this cluster as the LRI trace. The second
environment consisted of desktop PC’s in classrooms used by first-year undergraduates,
and we refer to the data set as the DEUG trace. Finally, we obtained the traces described
in [12] which were measured using a different trace method and refer to this data set as
the UCB trace. (We describe this method in Section III.C.) The advantages of using
these data sets versus others are highlighted in Table III.1, and labelled as data sets
SDSC, XtremWeb, and UCB in the table.
The traces that we obtained from our measurements contain gaps. This is
expected as desktop resources become unavailable for a variety of reasons, such as the
rebooting or powering off of hosts, local processes using 100% of the CPU, the desktop
grid worker detecting mouse or keyboard activity, or user actively pausing the worker.
However, we observe that a very large fraction (≥ 95%) of these gaps are clustered in the
2 minute range. Figures III.1(a) and III.1(b) plot the distribution of these small gaps
for the Entropia desktop grid at SDSC and the Xtremweb desktop grid at the University
of Paris-Sud respectively. The average small gap length is 35.9 seconds on the Entropia
34
grid, and 19.5 seconds on the Xtremweb grid.
0 20 40 60 80 100 1200
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
4
Num
ber
of G
aps
Gap length (sec)
(a) SDSC
0 20 40 60 80 100 1200
1
2
3
4
5
6
7
8
9
10x 10
4
Num
ber
of G
aps
Gap length (sec)
(b) Xtremweb
Figure III.1: Distribution of “small” gaps (<2 min.).
After careful examination of our traces, we found that these short gaps occur
exclusively in between the termination of a task and the beginning of a new task on the
same host. We thus conclude that these small gaps do not correspond to actual exec
unavailability, but rather are due to the delay of the desktop grid system for starting a
new task. In the SDSC grid, the majority of the gaps are spread in the approximate range
of 5 to 60 seconds (see Figure III.1(a)). The sources of this overhead include various
system costs of receiving, scheduling, and sending a task as well as an actual built-in
limitation that prevents the system from sending tasks to resources too quickly. That
is, the Entropia server enforces a delay between the time it receives a request from the
worker to the time it sends a task to that worker. This is to limit the damaging effect of
the “black-hole” problem where a worker does not correctly execute tasks, and instead,
repeatedly and frequently sends requests for more tasks from the server. Without the
artificial task sending delay, the result of the “black-hole” problem is applications with
thousands of tasks that completed instantly but erroneously.
In the XtremWeb grid, the majority of the gaps are between 0 to 5 seconds and
40 to 45 seconds in length. When the Xtremweb worker is available to execute a task,
it sends a request to the server. If the server is busy or there is no task to execute, the
35
worker is told to make another request after a certain period of time (43 seconds) has
expired. This explains the bimodal distribution of gaps length in the XtremWeb system.
Therefore, these small availability gaps observed in both the Entropia and
XtremWeb grids would not be experienced by the tasks of a real application, but only in
between tasks. Consequently, we eliminated all gaps that were under 2 minutes in our
traces by performing linear interpolation.
Specifically, we interpolate gaps under 2 minutes in length using the following
method. Let prevnumops be the number of operations measured in the subinterval of
length prevduration that ends just before the gap begins. Let postnumops be the number
of operations measured in the subinterval of length postduration that begins just after
the gap ends. (prevduration and postduration are most often 10 seconds in length since
measurements are made usually every 10 seconds.) Let gapduration be the gap length,
and let gapnumops be the interpolated number of operations that are available during
the gap. We calculate gapnumops using a weighted average of the rate of operations
completed immediately before and after the gap so that the rate of operations during
the longer of the two subintervals carries more weight in the interpolation:
gapnumops = gapduration×{(
prevnumops
prevduration
)×
(prevduration
prevduration + postduration
)
+(
postnumops
postduration
)×
(postduration
prevduration + postduration
)}
Usually, the subintervals immediately preceding and following the gap are 10 seconds in
length and so the interpolated rate is just the average of the rates before and after the
gap.
A small portion of the gaps larger than 2 minutes may be also attributed to the
server delay and this means that our post-processed traces may be slightly optimistic.
Note that although we use interpolation, we use the average small gap length in our
performance models, which we describe in Section III.G, to account for the server delay.
For a real application, the gaps may be larger due to transfer of input data files necessary
for task execution. Such transfer cost could be added to our average small gap length
and thus easily included in the performance model developed in Section III.G.
The weakness of interpolating the relatively small gaps is that this in effect
36
masks short failures less than 2 minutes in length. Failures due to fast machine reboots
for example could be overlooked using this interpolation method.
III.D.1 SDSC Trace
The first data set was collected using the Entropia DCGridTMdesktop grid
software system deployed at SDSC for a cumulative period of about 1 month across 275
hosts. We conducted measurements during four distinct time periods: from 8/18/03 until
8/22/03, from 9/3/03 until 9/17/03, from 9/23/03 to 9/26/03, and from 10/3/03 and
10/6/03 for a total of approximatively 28 days of measurements. For our characterization
and simulation experiments, we use the longest continuous period of trace measurements,
which was the two-week period between 9/3/03 - 9/17/03.
Of the 275 hosts, 30 are used by secretaries, 20 are public hosts that are avail-
able in SDSC’s conference rooms, 12 are used by system administrators, and the remain-
ing are used by SDSC staff scientists and researchers. The hosts are all on the same
class C network, with most clients having a 10Mbit/sec connection and a few having
a 100Mbit/sec connection. All hosts are desktop resources that run different flavors
of WindowsTM. The Entropia server was running on a dual-processor XEON 500MHz
machine with 1GB of RAM.
To validate the measurements made by our monitoring application on the SDSC
grid, we isolated a small set of machines on which we accessed system counters to deter-
mine the CPU utilization of each process while our monitoring task was running on each
host controlled through the Entropia worker daemon. In particular, while continuously
sending our monitoring tasks to the Entropia system, we used the Windows Management
Instrumentation (WMI) to remotely access the system counters of seven machines at the
SDSC to determine the clock ticks devoted to each process running on the host. Only
a limited set of machines could be accessed for about 2 hours on September 12, 2003
as we needed superuser privileges to use WMI. Note that because this method of moni-
toring the machines went through the network to record each machine’s system counter
readings, measurements could be delayed due to network congestion. We compared the
WMI measurements with the task measurements, and found that the availability and
non-availability intervals recorded by the monitoring tasks corresponded to the times at
37
which the task appeared in list of processes found through WMI. Moreover, CPU avail-
ability measured by the monitoring task closely matched the CPU utilization measured
by the WMI queries.
During our experiments, about 200 of the 275 hosts were effectively running
the Entropia worker (on the other hosts, the users presumably disabled the worker) and
we obtained measurements for these hosts. Their clock rates ranged from 179MHz up
to 3.0GHz, with an average of 1.19GHz. Figure III.2 shows the cumulative distribution
function (CDF) of clock rates. The curve is not continuous as for instance no host has
a clock rate between 1GHz and 1.45GHz. The curve is also skewed as for instance over
30% of the hosts have clock rates between 797MHz and 863MHz, which represents under
3.5% of the clock rate range.
An interesting feature of the Entropia system is its use of a Virtual Machine
(VM) to insulate application tasks from the resources. While this VM technology is
critical for security and protection issues, it also makes it possible for fine grain control
of an executing task in terms of the resources it uses, such as limiting CPU, memory, and
disk usage, and restricting I/O, threads, processes, etc. The design principle is that an
application should use as much of the host’s resources as possible while not interfering
with local processes. One benefit is that this allows an application to use from 0% to
100% of the CPUs with all possible values in between. It is this CPU availability that
we measure. Note that our measurements could easily be post-processed to evaluate a
desktop grid system that only allows application tasks to run on host with, say, more
than 90% available CPU.
One weakness of our measurement method used for the SDSC data set is that
the resolution of the traces is limited to the length of the task. That is, in Entropia,
when a task is terminated, the task’s output is lost and as a result, the trace data is
lost at the same time. Consequently, the unavailability intervals observed in the data
set are pessimistic by at most the task length, and the statistical analysis may suffer
from some periodicity. However, we believe we can still use the data set for modelling
and simulation, and we cross-validate our findings using three other data sets where this
limitation in measurement method was removed.
Another weakness is that the interpolation of gaps may have hidden short
38
failures, such as reboots. However, the SDSC system administrations recorded only
seven reboots after applying Windows patches during the entire 1-month trace period.
(In particular, server reboot at 6PM on 8/18/03, reboot of all desktops at 6:30PM on
8/21, reboot of all desktops on 9/5/03 at 3AM, possible reboot of all desktops if user
was prompted on 9/7, server reboot at 6PM on 9/8, reboot of all machines at 11PM on
9/10/03, and reboot of all machines on 9/11/03 (not simultaneous).) Although this is
only a lower bound, we believe that the hosts were rebooted infrequently given that there
was usually a single desktop allocated for each user and in that sense were the desktops
were dedicated systems. So the phenomenon described in [20] of undergraduates sitting
at the desktop rebooting their machines to “clean” the system of remote users causing
high load does not occur.
Finally, because the Entropia system at SDSC was shared with other users, we
could only take measurements a few days at a time. By taking traces over consecutive
days versus weeks or months, we do not keep track the long-term (e.g., monthly) churn
of machines and could lose the long-term temporal structure of availability.
0 500 1000 1500 2000 2500 30000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Clock Rate (MHz)
Frac
tion
SDSCDEUGLRIUCB
Figure III.2: Host’s clock rate distribution in each platform
III.D.2 DEUG and LRI Traces
The second data set was collected using the XtremWeb desktop grid software
continuously over about a 1 month period (1/5/05 - 1/30/05) on a total of about 100
39
hosts at the University of Paris-Sud. In particular, Xtremweb was deployed on a cluster
(denoted by LRI) with a total of 40 hosts, and a classroom (denoted by DEUG) with 40
hosts respectively. The LRI cluster is used by the XtremWeb research group and other
researchers for performance evaluation and running scientific applications. The DEUG
classroom hosts are used by first year students. Typically, the classroom hosts are turned
off when not in use. Classroom hosts are turned on on weekends only if there is a class
on weekends.
The XtremWeb worker was modified to keep the output of a running task if it
had failed, and to return the partial output to the server. This removes any bias due to
the failure of a fixed-sized task, and the period of availability logged by the task would
be identical to that observed by a real desktop grid application.
Compared to the clock distribution of hosts in the SDSC platform, the hosts in
the DEUG and LRI platforms have relatively homogeneous clock rates (see Figure III.2).
A large mode in the clock rate distribution for the DEUG platform occurs at 2.4GHz,
which is also the median; almost 70% of the hosts have clock rates of 2.4GHz. The clock
rates in the DEUG platform range from 1.6GHz to 2.8GHz. In the LRI platform, a mode
in the clock rate distribution occurs at about 2GHz, which is also the median; about
65% of the hosts in have clock rates at that speed. The range of clock rates is 1.5GHz
to 2GHz.
III.D.3 UCB Trace
We also obtained an older data set first reported in [12], which used a different
measurement method. The traces were collected using a daemon that logged CPU and
keyboard/mouse activity every 2 seconds over a 46-day period (2/15/94 - 3/31/94) on 85
hosts. The hosts were used by graduate students the EE/CS department at UC Berkeley.
We use the largest continuously measured period between 2/28/94 and 3/13/94. The
traces were post-processed to reflect the availability of the hosts for a desktop grid
application using the following desktop grid settings. A host was considered available
for task execution if the CPU average over the past minute was less than 5%, and there
had been no keyboard/mouse activity during that time. A recruitment period of 1 minute
was used, i.e., a busy host was considered available 1 minute after the activity subsided.
40
Task suspension was disabled; if a task had been running, it would immediately fail with
the first indication of user activity.
The clock rates of hosts in the UCB platform were all identical, but of extremely
slow speeds. In order to make the traces usable in our simulations experiments, we
transform clock rates of the hosts to a clock rate of 1.5GHz (see Figure III.2), which is
a modest and reasonable value relative to the clock rates found in the other platforms,
and close to the clock rate of host in the LRI platform.
The reason the UCB data set is usable for desktop grid characterization is
because the measurement method took into account the primary factors affecting CPU
availability, namely both keyboard/mouse activity and CPU availability. As mentioned
previously, the method of determining CPU availability may not be as accurate as our
application-level method of submitting real tasks to the desktop grid system. However,
given that the desktop grid settings are relatively strict (e.g., a host is considered busy
if 5% of the CPU is used), we believe that the result of post-processing is most likely
accurate. The one weakness of this data set is that it more than 10 years old, and
host usage patterns might have changed during that time. We use this data set to
show that in fact many characteristics of desktop grids have remained constant over the
years. Note that the UCB trace is the only previously existing data set that tracked user
mouse/keyboard activity, which is why it is usable for our desktop grid simulations.
III.E Characterization of Exec Availability
In this section, we characterize in detail exec availability and in our discussion,
the term availability will denote exec availability unless noted otherwise. We report and
discuss aggregate statistics over all hosts in each platform, and when relevant, we also
describe per host statistics.
III.E.1 Number of Hosts Available Over Time
We observed the total number of hosts available over time to determine at which
times during the week and during each day machines are the most volatile. This can
be useful for determining periods of interest when testing various scheduling heuristics.
41
Figures III.3(a), III.3(b), III.3(c), and III.3(d) show the number of available hosts for
a one week period for the SDSC, DEUG, LRI, and UCB traces respectively. The first
date shown in each figure corresponds to a Sunday, and the series of dates proceeds until
Saturday. Each date shown corresponds to 12:01AM on the particular day. The number
of hosts on the y-axis represents number of hosts that had at least a single availability
interval during a one hour range for the SDSC, DEUG, and LRI platforms, and a minute
range for the UCB platform since the unavailability intervals on this platform tended to
be much smaller than in the rest of the platforms.
With the exception of the LRI trace, we observe a diurnal cycle of volatility
beginning in general during weekday business hours. That is, during the business hours,
the variance in the number of machines over time is relatively high, and during non
business hours, the number becomes relatively stable. In the case of UCB and SDSC
trace, the number of machines usually decreases during these business hours, whereas
in the DEUG trace, the number of machines can increase or decrease. This difference
in trends can be explained culturally. Most machines in enterprise environments in the
U.S. tend to be powered on through the day, and so any fluctuation in the number hosts
are usually downward fluctuations. In contrast, in Europe, machines are often powered
off when not in use during business hours (to save power or to reduce fire hazards, for
example), and as a result, the fluctuations can be upward.
Given that students and staff scientists form the majority of the user base at
SDSC and UCB, we believe the cause of volatility during business hours is primarily
keyboard/mouse activity by the user or keyboard, or perhaps short compilations of
programming code rather than long computations, which can be run on clusters or
supercomputers at SDSC or UCB. This is supported by observations in other similar
CPU availability studies [73, 62]. In the DEUG platform, the volatility is most likely
due to machines being powered on and off, in addition to interactive desktop activity.
The number of hosts in the LRI trace (see Figure III.3(c)) does not follow any
diurnal cycle. This trend can be explained by the user base of the cluster, i.e., computer
science researchers that submit long running batch jobs to the cluster. The result is that
hosts tend to be unavailable in groups at a time, which is reflected by the large drop in
host number on 1/10/05. Moreover, there is little interactive use of the cluster, and so
42
0
50
100
150
Time
07−S
ep−2
003
08−S
ep−2
003
09−S
ep−2
003
10−S
ep−2
003
11−S
ep−2
003
12−S
ep−2
003
13−S
ep−2
003
14−S
ep−2
003
To
tal N
um
be
r O
f H
osts
Ava
ilab
le
(a) SDSC
0
10
20
30
40
50
Time
09−J
an−2
005
10−J
an−2
005
11−J
an−2
005
12−J
an−2
005
13−J
an−2
005
14−J
an−2
005
15−J
an−2
005
16−J
an−2
005
Tot
al N
umbe
r O
f Hos
ts A
vaila
ble
(b) DEUG
Figure III.3: Number of hosts available for a given week for each platform.
43
0
2
4
6
8
10
12
14
16
18
Time
09−J
an−2
005
10−J
an−2
005
11−J
an−2
005
12−J
an−2
005
13−J
an−2
005
14−J
an−2
005
15−J
an−2
005
16−J
an−2
005
Tot
al N
umbe
r O
f Hos
ts A
vaila
ble
(c) LRI
0
20
40
60
80
Time
26−F
eb−1
994
27−F
eb−1
994
28−F
eb−1
994
01−M
ar−1
994
02−M
ar−1
994
03−M
ar−1
994
04−M
ar−1
994
05−M
ar−1
994
To
tal N
um
be
r O
f H
ost
s A
vaila
ble
(d) UCB
Figure III.3: *
Number of hosts available for a given week for each platform (cont.)
44
the number of hosts over time remains relatively constant. Also, the LRI cluster (with
the exception of a handful of nodes, possibly front-end nodes) is turned off on weekends,
which explains the drop on Sunday and Saturday.
We refer to the daily time period during which the set of hosts is most volatile as
business hours. After close examination of the number of hosts over time, we determine
that times delimiting business hours for the SDSC, DEUG, and UCB platforms are 9AM-
5PM, 6AM-6PM, and 10AM-5PM respectively. Regarding the LRI platform, we make
no distinction between non-business hours and business hours.
III.E.2 Temporal Structure of Availability
The successful completion of a task is directly related to the size of availabil-
ity intervals, i.e., intervals between two consecutive periods of unavailability. Here we
show the distributions of various types of availability intervals for each platform, which
characterize its volatility.
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length (hours)
Cum
ulat
ive
perc
enta
ge
SDSC mean: 2.0372DEUG mean: 0.47727LRI mean: 23.535UCB mean: 0.16663
SDSCDEUGLRIUCB
(a) Business hours.
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length (hours)
Cum
ulat
ive
perc
enta
ge
SDSC mean: 10.2327
DEUG mean: 17.4239
UCB mean: 0.82374
SDSCDEUGUCB
(b) Non-business hours.
Figure III.4: Cumulative distribution of the length of availability intervals in terms of
time for business hours and non-business hours.
In Figure III.4, we show the length of the availability intervals in terms of
hours over all hosts in each platform. Figures III.4(a) and III.4(b) show the intervals
for business hours and non-business hours respectively where the business hours for each
45
platform are as defined in Section III.E.1. In the case a host is available continuously
during the entire business hour or non-business hour period, we truncate the intervals
at the beginning and end of the respective period. We do not plot availability intervals
for LRI during non-business hours, i.e., the weekend, because most of the machines were
turned off then.
Comparing the interval lengths from business hours to non-business hours, we
observe expectedly that the lengths tend to be much longer on weekends (at least 5
times longer). For business hours, we observe that the UCB platform tends to have the
shortest lengths of availability (µ '10 minutes), whereas DEUG, SDSC have relatively
longer lengths of availability (µ 'half hour, two hours, respectively). The LRI platform
by far exhibits the longest lengths (µ '23.5 hours).
The UCB platform has the shortest length most likely because the CPU thresh-
old of 5% is relatively low. The authors in [62] observed that system daemons can often
cause load up to 25%, and so this could potentially increase the frequency of availability
interruptions. In addition, the UCB platform was used interactively by students, and
often keyboard/mouse activity can cause momentary short bursts of 100% CPU activity.
Since the hosts in the DEUG and SDSC platforms also were used interactively, the in-
tervals are also relatively short. We surmise that the long intervals of the LRI platform
are result of the cluster’s workload, which often consists of periods of high activity and
followed by low activity. So, when the cluster is not in use, the nodes tend to be available
for longer periods of time.
In summary, the lower the CPU threshold, the shorter the availability intervals.
Availability intervals tend to be shorter in interactive environments, and intervals tend
to be longer during business hours than non-business hours.
In Figures III.4(a) and III.4(b), the CDF corresponding to the UCB trace during
business hours appear quite similar to the CDF during non-business hours whereas the
number of hosts shown in Figure III.3(d) varies considerably during business hours versus
non-business hours. The reason for this discrepancy is that the CDF does not weight
the distribution according to the total sum of availability. For example, consider to the
following two data sets A = {1, 1, 1, 1, 1, 10, 10, 10, 10, 10} and B = {1, 1, 1, 1,
1, 10 , 10, 10, 10, 1000} where each element is availability length using the same time
46
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cum
ulat
ive
perc
enta
ge
Interval length (hours)
business hoursnon−business hours
Figure III.5: Cumulative distribution of the length of availability intervals normalized to
total duration of availability in terms of time for business hours and non-business hours
for the UCB platform.
units across both sets. The CDF’s of A and B within the x-range of [0,10] will appear
quite similar, but in fact the platform from which B was derived is considerably more
stable given that it has an availability interval of 1000 time units. In Figure III.5, we
show the cumulative distribution where the interval length is weighted to the total sum
of availability. We can see then that a larger portion of the availability intervals during
business hours are smaller in comparison to the intervals during non-business hours.
While this data is interesting for applications that require hosts to be reachable
for a given period of time (e.g., content distribution) and could be used to confirm and
extend some of the work in [18, 15, 30, 76], it is less relevant to our problem of scheduling
compute-intensive tasks. Indeed, from the perspective of a compute-bound application,
a 1GHz host that is available for 2 hours with average 80% CPU availability is less
attractive than, say, a 2GHz host that is available for 1 hour with average 100% CPU
availability.
By contrast, Figure III.6 plots the cumulative distribution of the availability
intervals, both for business hours and non-business hours, but in terms of the number
of operations performed. So instead of showing availability interval durations, the x-
axis shows the number of operations that can be performed during the interval, which is
47
computed using our measured CPU availability. This quantifies directly the performance
that an application, factoring in the heterogeneity of hosts. Other major trends in the
data are as expected with hosts and CPUs are more frequently available during business
hours than non-business hours. This empirical data enables us to quantify task failure
rates and develop a performance model (which we describe later in Section III.G).
0 0.5 1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length in number of operations (trillions)
Cum
ulat
ive
perc
enta
ge
SDSC mean: 0.80 trilion opsDEUG mean: 0.26 trilion opsLRI mean: 10.35 trilion opsUCB mean: 0.07 trilion ops
SDSCDEUGLRIUCB
(a) Interval length during business hours.
0 0.5 1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length in number of operations (trillions)
Cum
ulat
ive
perc
enta
ge
SDSC mean: 4.2513 trilion opsDEUG mean: 9.6545 trilion opsUCB mean: 0.32814 trilion ops
SDSCDEUGUCB
(b) Interval length during non-business hours.
Figure III.6: Cumulative distribution of the length of availability intervals in terms of
operations for business hours and non-business hours.
III.E.3 Temporal Structure of Unavailability
When scheduling an application, it useful also to know how long a host is
typically unavailable, i.e., unable to execute a task. Given two hosts with identical avail-
ability interval lengths, one would prefer the host with smaller unavailability intervals.
Using availability and unavailability interval data, one can predict whether a host has a
high chance of completing a task by a certain time, for example.
Figure III.7 shows the CDF of the length of unavailability intervals in terms
of hours during business hours and non-business hours for each platform. Note that
although a platform may exhibit a heavy-tailed distribution, it does not necessarily
mean that the platform is generally less available. (We describe CPU availability later
in Section III.F.1.)
48
We observe several distinct trends for each platform. First, for the SDSC,
platform we notice that the unavailability intervals are longer during business hours than
non-business hours. This can be explained by the fact that on weekends patches were
installed and backups were done, and several short unavailability intervals could result
if these were done as a batch. Second, for the DEUG platform, we found unavailability
intervals tend to be much shorter on business hours (µ =∼20 min ) versus non-business
hours (µ =∼32 hours ). The explanation is that many of the machines are turned off at
night or on weekends, resulting in long periods of unavailability. The fact that during
non-business hours more than 50% of DEUG’s unavailability intervals are less than a half
hour in length could be explained by a few machines still being used interactively during
non-business hours. Third, for the LRI platform, we notice that 60% of unavailability
intervals are less than 1 hour in length. This could be due to the fact that most jobs
submitted to clusters or MPP’s tend to be quite short in length [58]. Lastly, for the UCB
platform, the CDF of unavailability intervals appears similar. We believe this is because
the platform’s user base are students and while the number of machines in use during
business hours versus non-business hours is less, the pattern in which the students use
the machines is the same (i.e., using the keyboard/mouse and running short processes),
resulting in nearly identical distributions.
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length (hours)
Cum
ulat
ive
pece
ntag
e
SDSC mean: 1.2561LRI mean: 3.7564DEUG mean: 0.3569UCB mean: 0.11909
SDSCDEUGLRIUCB
(a)
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Interval length (hours)
Cum
ulat
ive
pece
ntag
e
SDSC mean: 1.2645DEUG mean: 31.9517UCB mean: 0.10379 SDSC
DEUGUCB
(b)
Figure III.7: Unavailability intervals in terms of hours
49
III.E.4 Task Failure Rates
0 50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Task size (minutes on a 1.5GHz machine)
Frac
tion
of T
asks
Fai
led
SDSC ρ: 0.99231
DEUG ρ: 0.98812
LRI ρ: 0.99934
UCB ρ: 0.98223
SDSCDEUGLRIUCB
(a) Aggregate
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Failure Rate
Cum
ulat
ive
Frac
tion
SDSCDEUGLRIUCB
(b) Per host for 35 minute tasks
Figure III.8: Task failure rates during business hours
Based on our characterization of the temporal structure of resource availability,
it is possible to derive the expected task failure rate, that is the probability that a
host will become unavailable before a task completes, from the distribution of number
of operations performed in between periods of unavailability (from the data shown in
Figure III.6(a)) based on random incidence. To calculate the failure rate, we choose
hundreds of thousands of random points during periods of exec availability in traces.
At each point, we determine whether a task of a given size (i.e., number of operations)
would run to completion given that the host was available for task execution. If so, we
count this trial as a success; otherwise, we count the trial as a failure. We do not count
tasks that are started during periods of exec unavailability since we assume the worker
in this case would not be connected to the scheduler, and so scheduling a task at that
point would not possible.
Figure III.8(a) shows the expected task failure rates computed for business
hours for each platform. Also, the least squares line for each platform is superimposed
with a dotted magenta line, and the correlation coefficient ρ is shown. For illustration
purposes, the x-axis shows task sizes not in number of operations, but in execution time
on a dedicated 1.5GHz host, from 5 minutes up to almost 7 hours. A maximum task
50
size of 350 minutes was chosen so that a significant number of task executions could be
simulated during business hours.
The expected task failure rate is strongly dependent on the task lengths. (The
weekends show similar linear trends, albeit the failure rates are lower.) The platforms
with failure rates from lowest to highest are LRI, SDSC, DEUG, and UCB, which agrees
with the ordering of each platform shown in Figure III.6(a). It appears that in all
platforms the task failure rate increases with task size and that the increase is almost
linear; the lowest correlation coefficient is 0.98, indicating that there exists a strong linear
relationship between task size and failure rate. By using the least squares fit, we can
define a closed-form model of the aggregate performance attainable by a high-throughput
application on the corresponding desktop grid (see Section III.G). Note, however, that
the task failure rate for larger tasks will eventually plateau as it approaches one.
While Figure III.8(a) shows the aggregate task failure rate of the system, Fig-
ure III.8(b) shows the cumulative distribution of the failure rate per host in each platform,
using particular task size of 35 minutes on a dedicated 1.5GHz host. The heavier the
tail, the more volatile the hosts in the platform. Overall, the distributions appears quite
skewed. That is, a majority of the hosts are relatively stable. For example, with the
DEUG platform, about 75% of the hosts have failure rates of 20% or less. The UCB
platform is the least skewed, but even so, over 70% of the hosts have failure rates of 40%
or less. The fact that most hosts have relatively low failure rates can affect scheduling
tremendously, and we discuss this effect later in Chapter V.
Surprisingly, in Figure III.8(a), SDSC has lower task failure rates than UCB,
yet in Figure III.8(b), SDSC has a larger fraction of hosts with failure rates 20% or less
compared to UCB. The discrepancy can be explained by the fact that UCB still has a
larger fraction of hosts with failure rates 40% or more than SDSC; after averaging, SDSC
has lower failure rates.
III.E.5 Correlation of Availability Between Hosts
An assumption that permeates fault tolerance research in large scale systems
is that resource failure rates can be modelled with independent and identical probability
distributions (i.i.d). Moreover, a number of analytical studies in desktop grids assume
51
that exec availability is i.i.d. [51, 50, 43] to simplify probability calculations. We inves-
tigate the validity of such assumptions with respect to exec availability in desktop grid
systems.
First, we studied the independence of exec availability across hosts by calculat-
ing the correlation of exec availability between pairs of hosts. Specifically, we compared
the availability for each pair of hosts by adding 1 if both machines were available or both
machines were unavailable, and subtracting 1 if one host was available and the other
one was not. This method was used by the authors in [18] to study the correlation of
availability among thousands of hosts at Microsoft.
Figure III.9 shows the cumulative fraction of hosts pairs that are below a partic-
ular correlation coefficient. The line labelled “trace” in the legend indicates correlation
calculated according to the traces. The line labelled “min” indicates the minimum pos-
sible correlation given the percent of time each machine is available, and likewise for the
line labelled “max”.
In Figure III.9b corresponding to the DEUG platform, the point (0.4, 0.6)
means that for 60% of all possible host pairing, the percent of time that the two
host’s availability “matched” was 40% or less than the time the hosts’ availability “mis-
matched”. The difference between points (0.4, 0.6) and (0.66, 0.6) means that for the
same fraction of host pairings, there is 26% more “matching” possible among the hosts.
The difference between points (-0.2, 0.1) and (-0.8, 0.1) indicates for 10% of the host
pairings, there could have been at most 60% more “mismatching” availability. The point
(-0.4, 0.05) means that for 5% of the host pairings, the availability “mismatched” 40%
or more of the time more often than “matched”.
Overall, we can see that in all the platforms at least 60% of the host pairings
had positive correlation, which indicates that a host is often available or unavailable
when another host is available or unavailable respectively. However, Figures III.9(a)
and III.9(d) also show that this correlation is due to the fact that most hosts are usually
available (when combined with the fact that 80% of the time, hosts have CPU availability
of 80% or higher), which is reflected by how closely the trace line follows the random
line in the figure. That is, if two host are available (or unavailable) most of the time,
they are more likely to be available (or unavailable) at the same, even randomly. As
52
a result, the correlation observed in the traces would in fact occur randomly because
the hosts have such high availability. This in turn gives strong evidence (although not
completely sufficient) that the exec availabilities of hosts in the SDSC and UCB platforms
are independent.
We believe the high likelihood of independence of exec availability in the SDSC
and DEUG traces is primarily due to the user base of the hosts. As mentioned previously,
the user base of the SDSC consisted primarily of research scientists and administrators
who we believe used the Windows hosts primarily for word processing, surfing the Inter-
net, and other tasks that did not directly affect the availability of other hosts. Similarly,
for the UCB platform, we believe the students used the host primarily for short dura-
tions, which is evident by the short availability intervals. So, because the primary factors
causing host unavailability, i.e., user processes and keyboard/mouse activity, are often
independent from one machine to another in desktop environments, we observe that exec
availability in desktop grids is often not significantly correlated.
On the other hand, Figures III.9(b) and III.9(c) show different trends where the
line corresponding to the trace differs significantly with respect to correlation (as much
as 20%) from the line corresponding to random correlation. The weak correlation of the
DEUG trace is due to the particular configuration of the classroom machines. These
machines had wake-on-LAN enabled Ethernet adapters, which allowed the machines to
be turned on remotely. The system administrators had configured these machines to
“wake” every hour if they had been turned off by a user. Since most machines are
turned off when not in use, many machines were awakened at the same time, resulting
in the weak correlation of availability. We believe that this wake-on-LAN configuration
is specific to the configuration of the machines in DEUG, and that in general, machine
availability is independent in desktop grids where keyboard/mouse activity is high.
Hosts in the LRI platform also shows significant correlation relative to random.
This behavior is expected as batch jobs submitted to the cluster tend to consume a large
number of nodes simultaneously, and consequently, the nodes are unavailable for desktop
grid task execution at the same times.
The independence result is supported by the host availability study performed
at Microsoft reported in [18]. In this study, the authors monitored the uptimes of about
53
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation Coefficient
Fra
ctio
n of
Hos
t Pai
rings
mintracerandommax
(a) SDSC
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation Coefficient
Fra
ctio
n of
Hos
t Pai
rings
mintracerandommax
(b) DEUG
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation Coefficient
Fra
ctio
n of
Hos
t Pai
rings
mintracerandommax
(c) LRI
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation Coefficient
Fra
ctio
n of
Hos
t Pai
rings
mintracerandommax
(d) UCB
Figure III.9: Correlation of availability.
54
50,000 desktops machines over a 1 week period, and found that the correlation of host
availability matched random correlation. Host unavailability implies exec unavailability,
and so our measurements subsume host unavailability in addition to the primary factors
that cause exec unavailability, i.e., CPU load and keyboard/mouse activity. We show
that exec availability between hosts is not significantly correlated when taking into the
effects of CPU load, keyboard/mouse activity, in addition to host availability.
Another difference between our study and the Microsoft study is that the Mi-
crosoft study analyzed the correlation of the 50,000 machines as a whole and did not
separate the 50,000 desktop machines into departmental groups (such as hosts used by
administrators, software development groups, and management, for example). So it is
possible that some correlation within each group was hidden. In contrast, the groups
of host that we analyzed had a relatively homogeneous user base, and our result is
stronger in the sense that we rule out the possibility of correlation among hosts with
heterogeneous user bases.
Other studies of host availability have in fact shown correlation of host fail-
ures [9] due to power outages or network switches. While such failures can clearly affect
application execution, we do not believe these types of failures are the dominant cause
of task failures for desktop grid applications. Instead, the most common cause of task
failures is due to either high CPU load or keyboard/mouse activity [73], and our study
directly takes into account these factors affect exec availability in contrast to the other
studies. So although host availability may be correlated, this correlation is significantly
weakened by the major causes of exec unavailability.
The evidence for host independence can simplify the reliability analysis of these
platforms. In particular, we use this result to simplify the calculation of the probability
of multiple host failures, which we describe in Chapter VI.
III.E.6 Correlation of Availability with Host Clock Rates
We hypothesized that a host’s clock rate would be a good indicator of host
performance. Intuitively, host speed could be correlated with a number of other machine
characteristics. For example, the faster a host’s clock rate is, the faster it should complete
a task, and the lower the failure rate should be. Or the faster a host’s clock rate is, the
55
X Y ρ
clock rate mean availability interval length (time) -0.0174log (clock rate) mean availability interval length (time) -0.0311clock rate % of time unavailable 0.1106log (clock rate) % of time unavailable 0.1178clock rate mean availability interval length (ops) 0.5585log (clock rate) mean availability interval length (ops) 0.5196clock rate task failure rate (15 minute task) -0.2489log (clock rate) task failure rate (15 minute task) -0.2728clock rate P(complete 15 min task in 15 min) 0.859log (clock rate) P(complete 15 min task in 15 min) 0.792
Table III.2: Correlation of host clock rate and other machine characteristics during
business hours for the SDSC trace
more often a user will be using that particular host, and the less available the host will
be. Surprisingly, host speed is not as correlated with these factors as we first believed.
Table III.2 shows the correlation of clock rate with various measures of host
availability for the SDSC trace. (Because the clock rates in the other platforms were
roughly uniform, we could not calculate the correlation.) Since clock rates often increase
exponentially throughout time, we also compute the correlation of the log of clock rates
with the other factors. We compute the correlation between clock rate and the mean
time per availability interval to capture the relationship between clock rate the temporal
structure of availability . However, it could be the case that a host with very small
availability intervals is available most of the time. So, we also compute the correlation
between clock rate and percent of time each host is unavailable . We find that there is
little correlation between the clock rate and mean length of CPU availability intervals
in terms of time (see rows 1 and 2 in Table III.2), or percent of the time the host
is unavailable (see rows 3 and 4). We explain this by the fact that many desktops
for most of the time are being used for intermittent and brief tasks, for example for
word processing, and so even machines with relatively low clock rates can have high
unavailability (for example due to frequent mouse/keyboard activity), which results in
availability similar to faster hosts. Moreover, the majority of desktops are distributed in
office rooms. So desktop users do not always have a choice of choosing a faster desktop
to use.
56
Task size Failure rate ρ
5 0.063 -0.205310 0.097 -0.248215 0.132 -0.248920 0.155 -0.262425 0.177 -0.273930 0.197 -0.28035 0.220 -0.2650
Table III.3: Correlation of host clock rate and failure rate during business hours. Task
size is in term of minutes on a dedicated 1.5GHz host.
What matters more to an application than the time per interval is the opera-
tions per interval, how it affects the task failure rate, and whether it is related to the
clock rate of the host. We compute the correlation between clock rate and CPU avail-
ability in terms of the mean operations per interval and failure rate for a 15 minute
task. There is only weak positive correlation between clock rate and the mean number
of operations per interval (see rows 5 and 6), and weak negative correlation between the
clock rate and failure rate (see rows 7 and 8). Any possibility of strong correlation would
have been weakened by the randomness of user activity. Nevertheless, the factors are
not independent of clock rate because hosts with faster clock rates tend to have more
operations per availability interval, thus increasing the chance that a task will complete
during that interval.
Furthermore, in rows 9 and 10 of Table III.2), we see the relationship between
clock rate and rate of successful task completion within a certain amount of time. In
particular, rows 9 and 10 show the fairly strong positive correlation between clock rate
and the probability that a task completes in 15 minutes or less. The size of the task is
15 minutes when executed on a dedicated 1.5GHz. (We also computed the correlation
for other task sizes and found similar correlation coefficients). Clearly, whether a task
completes in a certain amount of time is related to clock rate. However, this relationship
is slightly weakened due to the randomness of exec unavailability, as unavailability could
cause a task executing on a relatively fast host to fail. One implication of this correlation
shown in rows 5-10 is that a scheduling heuristic based on using host clock rates for task
assignment may be effective.
57
The correlation between host clock rate and task failure rate should increase
with task size (until all hosts have very high failure rates close to 1). Short tasks that
have a very low mean failure rate (near zero) over all hosts will naturally have low
correlation. As the task size increases, the failure rate will be more correlated with clock
rates since in general the faster hosts will be able to finish the tasks sooner. Table III.3
shows the correlation coefficient between clock rate and failure rate for different task
sizes. There is only weak negative correlation between host speed and task failure rate,
and it increases in general with the task size as expected. Again, we believe the weak
correlation is partly due to the randomness of keyboard and mouse activity on each
machine. One consequence of this result with respect to scheduling is that the larger the
task size, the more important it is to schedule the tasks on hosts with faster clock rates.
We also recomputed the correlation for only those hosts with failure rates
greater than 10%, 20% and 30% in an effort to remove those hosts that were exceptionally
available, but did not find significant changes in the correlation coefficients. Removing
certain hosts from the correlation calculation left relatively few hosts remaining, so it is
not clear if there was enough data to make a meaningful calculation.
III.F Characterization of CPU Availability
While the temporal structure of availability directly impacts task execution,
it is also useful to observe the CPU availability of the entire system for understanding
how the temporal structure of availability affects system performance as a whole, and
how the CPU availability per availability interval fluctuate if at all. That is, aggregate
CPU availability puts availability and unavailability intervals into perspective, captur-
ing the effect of both types of intervals on the compute power of system as statistics
of either availability intervals or unavailability intervals do not necessarily reflect the
characteristics of the other.
III.F.1 Aggregate CPU Availability
An estimate of the computational power (i.e., number of cycles) that can be de-
livered to a desktop grid application is given by an aggregate measure of CPU availability.
58
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Availability Threshold (%)
% o
f T
ime
Abo
ve T
hres
hold
SDSCDEUGLRIUCB
(a) Business hours
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Availability Threshold (%)
% o
f T
ime
Abo
ve T
hres
hold
SDSCDEUGUCB
(b) Non-business hours
Figure III.10: Percentage of time when CPU availability is above a given threshold, over
all hosts, for business hours and non-business hours.
For each data point in our measurements (over all hosts), we computed how often CPU
availability is above a given threshold for both business hours and non-business hours
on each platform. Figures III.10(a) and III.10(b) plot the frequency of CPU availability
being over a threshold for threshold values from from 0% to 100%: the data point (x, y)
means that y% of the time CPU availability is over x%. For instance, the graphs show
that CPU availability for the SDSC platform is over 80% about 80% of the time during
business hours, and 95% of the time during non-business hours.
In general, CPU availability tends to be higher during business hours than non-
business hours. For example, on business hours, the SDSC and DEUG platforms have
zero CPU availability about 15% of the time, whereas during business hours, the same
hosts in the same platforms almost always have CPU availability greater or equal to
zero. During business hours, we observe that SDSC and DEUG have an initial drop
from a threshold of 0% to 1%. We believe the CPU unavailability indicated by this drop
is primarily due to exec unavailability cause by user activity/process on the hosts in
the system. Other causes of CPU unavailability include the suspension of the desktop
grid task or brief bursts of CPU unavailability that are not long enough to cause load
on the machine to go below the CPU threshold. The exceptions are the UCB and LRI
59
platforms that shows no such drop. The UCB platform is level from 0 to 1% because
the worker’s CPU threshold is relatively stringent (5%), resulting in common but very
brief unavailability intervals which have little impact on the overall CPU availability of
the system. One reason that the LRI plot is relatively virtually constant between 0 and
1% is that the cluster is lightly loaded so most host’s CPU’s are available most of the
time.
After this initial drop between threshold of 0 and 1%, the curves remain almost
constant until the respective CPU threshold is reached. The levelness is an artifact of
the worker’s CPU threshold used to determine when to terminate a task. That is, the
only reason our compute bound task (with little and constant I/O) would have a CPU
availability less than 100% is if there were other processes of other users running on the
system. When CPU usage of the host’s user(s) goes above the worker’s CPU threshold,
the task is terminated, resulting in the virtually constant line from the 1% threshold to
(100% - worker’s CPU threshold). (Note in the case of UCB, we assume that either the
host’s CPU is completely available or unavailable, which is valid given that the worker’s
CPU threshold was 5%.) The reason that the curves are not completely constant during
this range is possibly because of system processes that briefly use the CPU, but not long
enough to increase the moving average of host load to cause the desktop grid task to be
terminated.
Between the threshold of (100% - worker’s CPU threshold) and 100%, most of
the curves have a downward slope. Again, the reason this slope occurs, is that system
processes (which often use 25% of the CPU [52]) are running simultaneously. On aver-
age for the SDSC platform, CPU’s are completely unavailable 19% of the time during
weekdays and 3% of the cases during weekends. We also note that both curves are rela-
tively flat for CPU availability between 1% and 80%, denoting that hosts rarely exhibit
availabilities in that range.
Other studies have obtained similar data about aggregate CPU availability in
desktop grids [28, 55]. While such characterizations make it possible to obtain coarse
estimates of the power of the desktop grid, it is difficult to related them directly to what
a desktop grid application can hope to achieve. In particular, the understanding of host
availability patterns, that is the statistical properties of the duration of time intervals
60
during which an application can use a host, and a characterization of how much power
that host delivers during these time intervals, are key to obtaining quantitative measures
of the utility of a platform to an applications. We develop such characterization in the
next chapter.
III.F.2 Per Host CPU Availability
While aggregate CPU availability statistics reflect the overall availability of the
system, it is possible some hosts are less available than others. Here we show CPU
availability per host to reveal any potential imbalance.
Figures III.11, III.12, III.13, and III.14 show the CPU availability per host.
Each vertical bar corresponds to the CPU availability for a particular host, where the
hosts are sorted by clock rate along the x-axis. In each bar, there are sub-bars that
correspond the percent of the time the host’s CPU availability fell in particular range.
In Figures III.11 and III.12, we observe heavy imbalance in terms of CPU unavailability.
Moreover, it appears that CPU availability does not always increase with host clock rate
as several of the slowest 50% of the hosts have CPU unavailability greater than or equal
to 40% of the time. In Figure III.13, which corresponds to the LRI platform, we see
that the system is relatively underutilized as few of the hosts in the LRI cluster have
CPU unavailability greater than 5%. In Figure III.14, which corresponds to the UCB
platform, we observe that the system is also underutilized as none of the hosts have CPU
unavailability greater than 5% of the time; we believe this is a result of the system’s low
CPU threshold. In summary, the CPU availability does not show strong correlation
with clock rate, which is reaffirmed by Table III.3, and the amount of unavailability is
strongly dependent on the user base and worker’s criteria for idleness. One implication
of this result is that simple probabilistic models of desktop grid systems (such as those
described in [43, 50]) that assume hosts have constant frequencies of unavailability and
constant failure rates are insufficient for modelling these complex systems.
61
5010
015
020
00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
Nod
es s
orte
d by
clo
ck r
ate
% of Time Within Availability Range0−
10%
10−
20%
20−
30%
30−
40%
40−
50%
50−
60%
60−
70%
70−
80%
80−
90%
90−
100%
Fig
ure
III.11
:C
PU
avai
labi
lity
per
host
inSD
SCpl
atfo
rm.
62
2040
6080
100
120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
Nod
es s
orte
d by
clo
ck r
ate
% of Time Within Availability Range0−
10%
10−
20%
20−
30%
30−
40%
40−
50%
50−
60%
60−
70%
70−
80%
80−
90%
90−
100%
Fig
ure
III.12
:C
PU
avai
labi
lity
per
host
inD
EU
Gpl
atfo
rm.
63
1020
3040
500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
Nod
es s
orte
d by
clo
ck r
ate
% of Time Within Availability Range0−
10%
10−
20%
20−
30%
30−
40%
40−
50%
50−
60%
60−
70%
70−
80%
80−
90%
90−
100%
Fig
ure
III.13
:C
PU
avai
labi
lity
per
host
inLR
Ipl
atfo
rm.
64
1020
3040
5060
7080
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
Nod
es s
orte
d by
clo
ck r
ate
% of Time Within Availability Range0−
10%
10−
20%
20−
30%
30−
40%
40−
50%
50−
60%
60−
70%
70−
80%
80−
90%
90−
100%
Fig
ure
III.14
:C
PU
avai
labi
lity
per
host
inU
CB
plat
form
.
65
III.G An Example of Applying Characterization Results:
Cluster Equivalence
The results of our measurement and characterization study have numerous uses
for desktop grid modelling and simulation. In this section, we give an example of how
one could use our characterization to derive a performance model of a desktop grid
system, quantifying the negative impact that heterogeneous and volatile resources have
on system throughput. We measure the impact and utility using a cluster equivalence
metric; given a N-node desktop grid, what is the equivalent M-node dedicated cluster?
We focus our analysis on the SDSC platform since the DEUG, LRI, and UCB platforms
each had less than 100 hosts, but our models and utility metrics are applicable to the
other platforms as well.
III.G.1 System Performance Model
In Section III.E.4, we found that task failure rate was strongly correlated with
task size, and so we could use a linear function of task size to model the task failure
rate. In this section, we use this task failure rate function in a performance model for the
desktop grid, and with this model, determine the grid’s cluster equivalence for a high-
throughput application. In particular, we propose a model for an application’s expected
work rate (that is the number of useful operations performed by time units), W (s), given
a uniform task size s (number of operations per task expressed as minutes on a dedicated
1.5GHz host) as follows.
From our measurements we can determine the average overhead, g in between
each task scheduled on the same resource due to the desktop grid server (see Sec-
tion III.D). Using the method describe in Section III.E.4, we can compute the task
failure rate, f(s), as a function of the task size. We can also estimate the average com-
pute rate in operations per second for a host in the desktop grid, r, by computing the
average delivered operations per second for each host using our availability traces, and
taking an average over all hosts. W (s) is the number of operations per seconds for an
66
application using N hosts in the desktop grid and is given by:
W (s) = N × r(1− f(s))1 + r
sg, (III.1)
where r(1 − f(s)) is the effective compute rate accounting for failures and rsg is the
overhead for scheduling each task.
We can then instantiate the above model with data obtained from any of the
desktop grid platforms to determine the corresponding work rate. As an example, we
compute W (s) for the SDSC grid with N = 220 hosts. The average overhead g for
scheduling a task was characterized to be 36.9 seconds on the SDSC platform (see Sec-
tion III.D). Our least squares fit to the weekday task failure rate (as calculated using the
method described in Section III.E.4) for the SDSC data was f(s) = (0.0031)s + .0142.
The average weekday work rate r (taking into account unavailability and host load) per
host was 31.6 million operations per second. Substituting into equation III.1, we obtain
a closed-form expression for W (s) for the SDSC platform:
W (s) =(6863257896− 21582572 ∗ s)
1 + 19462293741136∗s
(III.2)
Figure III.15 plots this rate W (s) for a range of task sizes executed on the
SDSC grid on weekdays, and also plots the rate for weekends. On weekdays, for task
sizes below 13 minutes per-host, progress increases rapidly as the task size compensates
for the fixed overhead. However, as task size increases further, the per-host progress
decreases as the penalty of additional task failures wastes some of the CPU cycles. This
trend is also exhibited for weekends, but the longer availability intervals enable compute
rates to improve up to task sizes around 25 minutes in length. Thus, for both weekdays
and weekends, the trade-off between overhead and failures produces an optimal task size,
which is 13 and 25 minutes respectively. Note that these are the number of minutes that
a task execution would require on a dedicated 1.5GHz host, so the effective execution
times experienced on the SDSC Entropia grid range from approximately five times longer
to two times shorter.
67
0 10 20 30 40 504.5
5
5.5
6
6.5
7
7.5
8
8.5x 10
9
Task size (minutes on a dedicated 1.5GHz host)
Wor
k ra
te (
ops
per
sec)
weekdayweekend
Figure III.15: Model of application work rate for entire SDSC desktop grid, in number
of operations per seconds versus task size,in number of minutes of dedicated CPU time
on a 1.5GHz host.
III.G.2 Cluster Equivalence
To characterize the impact of resource volatility in a desktop grid on usable
performance, we use a cluster equivalence utility metric, which was first introduced
in [8]. That is, for a given desktop environment (and corresponding temporal CPU
availability), what fraction of a dedicated cluster CPU is each desktop CPU worth to
an application? With this information, we can establish for a desktop grid the size of
a dedicated cluster to which its performance is equivalent. More precisely: “Given an
N -host desktop grid, how many nodes of a dedicated cluster, M , with comparable CPU
clock rates, are required such that the two platforms have equal utility?” We define
M/N as the cluster equivalence ratio of the desktop grid. Because the objective is to
quantify the performance impact of resource volatility, we normalize assuming that the
CPU clock rate of each node in the cluster is equal to the mean CPU clock rate in the
desktop grid 1.
It is clear from our desktop grid measurements that the cluster equivalence
ratio depends on the application’s structure and characteristics. Here we consider only
task parallel applications with various task sizes. The higher the task size the lower1 Numerous industrial interactions by one of the committee members suggest that this is true in many
companies.
68
the cluster equivalence ratio since the application becomes more subject to failures (see
Figure III.15).
0 10 20 30 40 500.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Clu
ster
Equ
ival
ence
Rat
io
Subjob size (minutes on a 1500MHz machine)
weekdaysweekends
0 10 20 30 40 50110
120
130
140
150
160
170
180
190
200
210
220
Equ
ival
ent N
umbe
r of
Clu
ster
Nod
es
Cluster Equivalence
(a) Whole SDSC desktop grid
0 10 20 30 40 500.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Clu
ster
Equ
ival
ence
Rat
io
Subjob size (minutes on a 700MHz machine)
weekdaysweekends
0 10 20 30 40 50
80
90
100
110
120
130
140
150
Equ
ival
ent N
umbe
r of
Clu
ster
Nod
es
Cluster Equivalence
(b) SDSC desktop grid as of 2001
Figure III.16: Cluster equivalence of a desktop grid CPU as a function of the application
task size. Two lines are shown, one for the the resources on weekdays and weekends.
We compute the cluster equivalence for a range of application task sizes, as
shown in Figure III.16(a). Thus curves are essentially scaled versions of those in Fig-
ure III.15. The data points on this graph can be used to determine the effective cluster
CPU’s that the SDSC desktop grid delivers. For example, for 56 million operation tasks
(approximately 25 minutes on a 1.5 GHz CPU), the performance of the 220-node SDSC
Entropia desktop grid is equivalent to a 209-node cluster on weekends, and to a 160-node
cluster on weekdays.
For comparison, Figure III.16(b) shows the cluster equivalence metric computed
for a subset of the desktop grid that excludes the most recent machines (in this case, the
153 machines produced after the year 2001, which had clock rates higher than 1GHz).
The mean clock rate of this subset of hosts was approximately 700MHz. We observe
that the trends are similar to that seen in Figure III.16(a). In fact, the average relative
difference between the cluster equivalence ratios for the entire desktop grid and the
subset, over all task sizes, is approximately 10%.
The fact that our cluster equivalence metric is relatively consistent for different
69
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100
Top % of sorted hosts
Cum
ulat
ive
% o
f Tot
al P
ower
By delivered powerBy clock rates
Figure III.17: Cumulative percentage of total platform computational power for SDSC
hosts sorted by decreasing effectively delivered computational power and for hosts by
clock rates.
subsets of the desktop grid is explained by Figure III.17. This figure plots the cumulative
percentage of operations delivered by a subset of the entire platform, corresponding to
an increasing percentage of the sorted hosts. In other words, data point (x, y) on the
graph means that x% of the hosts (taking the most “useful” hosts first) deliver y% of
the compute operations of the the entire platform. Hosts are sorted either by number
of delivered operations per seconds (as computed from our measurements) or by the
corresponding clock rate, as seen in the two curves in Figure III.17. We can see that the
two curves are strikingly similar. This indicates that the average availability patterns
of the hosts in our platform over our measurement period are uncorrelated with host
clock rates (as shown in Chapter III). This in turn explains why our cluster equivalence
metric is consistent for the whole platform and a subset containing only older machines.
Interestingly, we also find that the curves in Figure III.17 while not linear, are
only moderately skewed (as compared to the dotted line in the figure). For instance,
the 30% most useful hosts deliver 50% of the overall compute power. Similarly the 30%
least useful hosts deliver approximately 14% of the overall compute power. Note that
this skew is not as high as to justify using only a small fraction of the resources.
70
III.H Summary
In this chapter, we described a simple trace method that measured CPU avail-
ability in a way that reflects the same CPU availability that could be experienced by real
application. We used this method to gather data sets on three platforms with distinct
clock rate distributions and user types. In addition, we obtained a fourth data set gath-
ered earlier by the authors in [12]. Then, we derived several useful statistics regarding
exec availability and CPU availability for each system as a whole and individual hosts.
The findings of our characterization study can be summarized as follows. First,
we found that even in the most volatile platform with the strictest host recruitment policy
availability intervals tend to be 10 minutes in length or greater, while the mean length
over all platforms with interactive users was about 2.6 hours. We showed in Section III.G
that using relatively short task lengths of 10 minutes to utilize such availability intervals
does not significantly harm the aggregate achievable performance of the application, even
with the server’s enforced task assignment overhead.
Second, we found that task failure rates on each system were correlated with
the task size, and so task failure rate can be approximated as a linear function of task
size. This in turn allowed us to construct a closed-form performance model of the system,
which we describe in Section III.G.
Third, we observed that on platforms with interactive users, exec availability
tends to be independent across hosts. However, this independence is affected significantly
by the configuration of the hosts; for example wake-on-LAN enabled Ethernet adapters
can cause correlated availability among hosts. Also, in platforms used to run batch jobs,
availability is significantly correlated.
Fourth, the availability interval lengths in terms of time are not correlated with
clock rates nor is the percentage of time a host is unavailable. This means that hosts with
faster clock rates are not necessarily used more often. Nevertheless, interval lengths in
terms of operations and task failure rates are correlated with clock rates. This indicates
that the selecting resources according to clock rates may be beneficial.
Finally, we studied the CPU availability of the resources. We found that be-
cause of the recruitment policies of each worker, the CPU availability of the hosts are
71
either above 80% or zero. Regarding the CPU availability per host, there is wide varia-
tion of availability from host to host, especially in the platforms with interactive users.
As a result, even in platforms with hosts of identical clock rates, there is significant
heterogeneity in terms of the performance of each host with respect to the application.
Chapter IV
Resource Management: Methods,
Models, and Metrics
IV.A Introduction
In the previous chapter, we measured and characterized four desktop grids, and
virtually all of the statistics that we computed influence our design and implementation
of scheduling heuristics, which we discuss in the remaining chapters. In this chapter,
we outline the scheduling techniques on which these heuristics are based. In addition,
we describe our platform and application models and instantiations, simulation method,
and performance metrics used to evaluate our scheduling heuristics.
We consider the problem of scheduling applications at the application and re-
source management level on an enterprise desktop grid, which consists of volatile hosts
within a LAN. LAN’s are often found within a corporation or university, and several
companies such as Entropia and United Devices have specifically targeted these LAN’s
as a platform for supporting desktop grid applications. Enterprise desktop grids are an
attractive platform for large scale computation because the hosts usually have better
connectivity with 100Mbps Ethernet for example and have relatively less volatility and
heterogeneity than desktop grids that span the entire Internet. Nevertheless, compared
to dedicated clusters, enterprise grids are volatile and heterogeneous platforms, and so
the main challenge is then to develop fault-tolerant, scalable, and efficient scheduling
72
73
heuristics. Although we evaluate our heuristics in enterprise environments (since we
have traces from only enterprise desktop grids), we design the heuristics so that they
would be applicable to Internet environments as well. This has a number of consequences
with respect to the platform and application models, which we describe in Section IV.B.
The most commonly used scheduling method in desktop grid systems [63,
37, 39] is First-Come-First-Serve (FCFS). Desktop grids, such as SETI@home, FOLD-
ING@home, and FIGHTAIDS@home, have typically been used to run high-throughput
jobs, where the performance metric is the aggregate work rate of the system over weeks
or months. As such, the highest aggregate work rate can be achieved by assigning tasks
to hosts in a FCFS manner as the start-up and the wind-down phases of application
executions are negligible compared to the steady-state phase. When the number of tasks
is far greater than the number hosts, the scheduler should allocate tasks to as many
resources as possible, and so there is no issue of resource selection.
Although desktop grids are commonly used for high-throughput jobs, desktop
grids are also an attractive platform for supporting rapid application turnaround on
the order of minutes or hours. As discussed in Chapter I, numerous applications from
computational biology or graphics (such as interactive scientific visualization [34]) require
rapid turnaround. In addition, most applications from MPP workloads are less than a
day in length [58], and applications in a company’s workload often require relatively
rapid turnaround within a day’s time [29].
Supporting rapid turnaround for applications on volatile desktop grids is chal-
lenging for a number of reasons. First, the resources are heterogeneous in terms of clock
rates, memory and disks sizes, and network connectivity, for example. The distribution
of clock rates in many typical desktop grids often spans over an order of magnitude.
Assuming no system capability for task preemption, an application with equally-sized
tasks, where the number of tasks is roughly the number of hosts, could potentially suffer
from severe load imbalance if the last few tasks are allocated to slow hosts near the end
of application completion, delaying the application completion as most of the hosts sit
needlessly idle.
Second, the resources are shared with the desktop user or owner in a way that
any subset of machines cannot be reserved for a block of time. Without dedicated
74
access to a set of machines, unplanned interruptions in task execution make scheduling
more difficult than if the resources were completely dedicated. In particular, because the
desktop user activities are given priority over the desktop grid application, the desktop is
volatile as a result of fluctuating CPU and host availability, which can result in frequent
task failures. For example, a task that takes 35 minutes to run on a dedicated 1.5GHz
host has a 22% failure rate calculated by means of random incidence over a trace data
set collected from the desktop grid at SDSC (as show in Section III.E.4). In a system
without checkpointing support, failures near the end of application execution can result
in poor application performance because a failed task must be restarted from scratch,
which in turn delays application completion.
These causes of volatility have significant effects on applications that require
rapid turnaround. In Figure IV.1, we show the cumulative number of tasks completed
over time observed using trace-driven simulations, given a server that schedules tasks to
hosts in a FCFS manner. These results are obtained for an application with 100, 200,
and 400 tasks run on a platform with about 190 hosts, where each task would execute for
15 minutes on a dedicated 1.5GHz processor, and simulation the platform is driven the
our SDSC trace (see Section IV.B for a detailed description of our simulation methodol-
ogy). In each of the three curves there is an initial hump as the system reaches steady
state, after which throughput increases roughly linearly. The cumulative throughput
then reaches a plateau, which accounts for an increasingly large fraction of application
makespan as the number of tasks decreases. For the application with 100 tasks, 90% of
the tasks are completed in about 39 minutes, but the application does not finish until
79 minutes have passed, which is almost identical to the makespan for a much larger
application with 400 tasks.
There are two main causes of this plateau. The first cause is task failures that
occur near the completion of the application. When a task fails, it must be started from
scratch, and when this occurs near the end of the application execution, it will delay
the application’s completion. The second cause is tasks assigned to slow hosts. Once a
task is assigned to a slow host, a FCFS scheduler without task preemption or replication
capabilities will be forced to wait until the slow host completes the result.
As the number of tasks gets large when compared to the number of hosts in
75
the platform, the plateau becomes less significant, thus justifying the use of a FCFS
strategy. However, for applications with a relatively small number of tasks, resource
selection could improve the performance of short-lived applications significantly.
0 10 20 30 40 50 60 70 800
50
100
150
200
250
300
350
400
Time (minutes)
Cum
ulat
ive
Num
ber
of T
asks
Com
plet
ed
100 tasks200 tasks400 tasks
Figure IV.1: Cumulative task completion vs. time.
We design various resource management methods to address this scheduling
problem, which we describe in the next section.
IV.B Models and Instantiations
In order to design and evaluate scheduling heuristics, we create models of the
desktop grid systems and targeted applications that capture only the most relevant char-
acteristics. These models enable computationally tractable simulations of a number of
heuristics for a large range of applications and platforms. We then instantiate these
models with our desktop grid traces described in Chapter III. That is, we implement a
discrete event simulator based on the application and platform models, and then drive
the simulator using the traces described in Chapter III that were collected from four
real desktop grid platforms as well as other representative grid configurations. We use
simulation for studying resource selection on desktop grids as direct experimentation
does not allow controlled and thus repeatable experiments. In true desktop grid fashion,
76
Figure IV.2: Scheduling Model
our simulations are deployed using the XtremWeb [37] desktop grid system. We describe
these system and application models and their instantiations in this section, referenc-
ing the Client, Application and Resource Management, and Worker levels described in
Chapter II and shown in Figure II.1.
IV.B.1 Platform model and instantiation
At the application and resource management level, we assume a scheduler that
maintains a queue of tasks to be scheduled and a ready queue of available workers (see
Figure IV.2). As workers become available, they notify the server, and the scheduler on
the server places the workers’ corresponding task requests in the ready queue. During
resource selection, the scheduler examines the ready queue to determine the possible
choices for task assignment.
Because the hosts are volatile and heterogeneous, the size of the host ready
queue changes dramatically during application execution as workers are assigned tasks
(and thus removed from the ready queue), and as workers of different speeds and avail-
ability complete tasks and notify the server. The host ready queue is usually only a small
subset of all the workers, since workers only notify the server when they are available for
task execution.
At the Worker level, we assume that the worker running on each host periodi-
cally sends a heartbeat to the server that indicates the state of the task. We assume the
77
worker sends a heartbeat every minute to indicate whether the task is running or has
failed, as it is done the XtremWeb system.
We do not assume that the system provides remote checkpointing abilities.
All Internet desktop grids systems lack remote checkpointing. One reason is that a
significant number of machines (as high as 88% in 2000 [13]) are connected with dial-up
modems and so transferring large core dumps (often ≥ 512MB in size [65]) over a wide
area quickly is not feasible. However, the task itself could save its state persistently on
disk so that if the task is terminated, the task can revert to its previous state at the next
idle period, losing little of its past computation. Consequently, in Chapter VI, we also
consider the case where the task itself can checkpoint its state onto the local disk.
Also, we do not assume the server can cancel a task once it has been scheduled
on a worker. The reason for this is that resource access is limited, as firewalls are usually
configured to block all incoming connections precluding incoming RPC’s and to allow
only outgoing connections (often on a restricted set of ports like port 80). As such, the
heuristics cannot preempt a task once it has been assigned, and workers must make the
initiative to request tasks from the server.
Our platform model deviates significantly from traditional grid scheduling mod-
els [14, 25, 26, 41]. The scheduling model used in most grid scheduling research is often
a collection of tasks ready to be assigned to a pool of resources. A grid scheduler can
devise a plan that determines when applications including their data will be placed at
specific resources to minimize application makespan. However, in desktop grid environ-
ments, devising a (static) plan for task assignment may be futile because tasks cannot
be pushed to workers. Moreover, the pool of resources from which to select from can
vary dynamically over time depending on which workers are available. So even if plan is
devised, the set of resources chosen may not be available at the time of assignment.
We use this platform model for each of the desktop grids mentioned in Chap-
ter III, namely the SDSC, DEUG, LRI and UCB platforms. We instantiated each plat-
form model with about 200 hosts per desktop grid with the availability defined by the
corresponding traces. SDSC was the only platform with about 200 hosts, and the re-
maining platform had significantly fewer hosts. So to compensate, we aggregated host
traces on different days until there was approximately 200 host traces per platform. After
78
aggregation, there were at least seven full days of traces to be used to drive simulations.
As shown in a number of studies [8, 62] including our own, hosts during weekday
business hours often exhibit higher and more variable load than during off-peak hours
on weekday nights and weekends. As such, all simulations were performed using traces
captured during business hours which varied depending on the platform. (9AM-6PM
for SDSC, 6AM-6PM for DEUG, all day for LRI, and 10AM-5PM for UCB.) We be-
lieve our heuristics would perform relatively the same during off-peak hours when host
performance is more predictable, although the performance difference might be lessened.
Given the diversity of desktop configurations, we compare the performance of
our heuristics on two other configurations representative of Internet and multi-cluster
desktop grids. Because we do not have access to these types of desktop grids, we were
unable to gather traces for these types of platforms. Nevertheless, many desktop grid
projects [44, 54] publicly report the clock rates of participating hosts. So instead of
using real traces, we transform the clock rates of hosts in the SDSC grid to reflect
the distribution of clock rates found in a particular platform, and transform the CPU
availability trace corresponding to each host accordingly.
For example, Internet desktop grids that utilize machines both in the enterprise
and home settings usually have many more slow hosts than fast hosts, and so the host
speed distribution is left heavy. We used host CPU statistics collected from the GIMPS
Internet-wide project [44] to determine the distribution of clock rates, which ranged from
25MHz to 3.4GHz. Other projects such as Folding@home and FightAids@home show
similar distributions [91].
Much work [42, 37, 70] in desktop grids has focused on using resources found
in multiple labs. Recently, [54] reports the use of XtremWeb [37] at a student lab in LRI
with nine 1.8GHz machines, and a Condor cluster in WISC with fifty 600MHz machines
and seventy-three 900MHz machines. We use the configuration specified in that paper
to model the multi-cluster scenario. We plot the cumulative clock rate distribution
functions for our additional two platform scenarios in Figure IV.3(b). For each of these
distributions, we ran simulations using the SDSC desktop grid traces but transforming
host clock speeds accordingly.
Our justification for these clock rate transformations is that the availability
79
0 500 1000 1500 2000 2500 30000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Clock Rate (MHz)
Frac
tion
SDSCDEUGLRIUCB
(a) Real
0 500 1000 1500 2000 2500 30000
0.2
0.4
0.6
0.8
1
Clock Rate (MHz)
Fra
ctio
n
LRI−WISC
GIMPS
(b) Simulated
Figure IV.3: Cumulative clock rate distributions from real and simulated platform
Platform Range of Clock Rates VolatilitySDSC high medDEUG low medLRI low lowUCB low high
GIMPS very high medLRI-WISC bimodal med
Table IV.1: Qualitative platform descriptions.
interval size is independent of host clock rate as found in Chapter III. So for a desktop
grid with a similar user base as SDSC (i.e., researchers, administrative assistants that
work from 9AM-5PM), we expect that hosts will have similar availability interval lengths,
regardless of CPU speed.
In summary, Table IV.1 shows the type of platforms on which we evaluate each
of our heuristics. As clock rates and host volatility are the primary sources of poor
application performance, we explore real and hypothetical platforms with a wide range
of clock rates and volatility levels that are representative of real desktop grid systems. In
particular, we examine all the cases where a platform has medium volatility and a wide
range of clock rates, and a low range of clock rates but wide range of volatility levels.
80
IV.B.2 Application model and instantiation
The majority of desktop grid applications are high-throughput and embarrass-
ingly parallel. That is, the applications have many more tasks relative to the number
of hosts, and have no dependencies nor communication among tasks. In an effort to
broaden the set of applications supportable on desktop grids, we also study the schedul-
ing of applications that have more stringent time demands, i.e., require turnaround on
the order of minutes or hours versus days or months, and have a number of tasks on
the order of the number of hosts. The experience of the Sprite developers [36], our own
characterization of availability in Chapter III, and numerous interactions with industrial
companies by one of the committee members [29] suggest that desktop grids within the
enterprise are often underutilized, and so the scenario in which the number of resources
is of an order of magnitude comparable to the number of tasks is not uncommon. As
such, we investigate techniques for scheduling applications that consist of T independent
and identically-sized tasks on N volatile hosts, where T is on the order of N . Appli-
cations contain 100, 200, or 400 tasks (which correspond to roughly half, equal, and
double the number hosts, respectively, and also long, medium, and short plateaus found
with the cumulative number of tasks completed during application execution as seen in
Figure IV.1).
In addition, we vary the lengths of the tasks, which affects the failure rate of
the application’s tasks. We experiment with tasks that would exhibit 5, 15, and 35
minutes of execution time on a dedicated 1.5GHz host. Each of these task sizes has a
corresponding failure rate when scheduled on the set of resources during business hours.
As described in Chapter III, we determined the failure rate of a task given its size using
random incidence over the entire trace period. That is, in the collected traces, we chose
many thousands of random points to start the execution of a task and noted whether the
task would run to completion or would meet a host failure. Task failure rate increases
linearly with task size from a minimum of 6.33% for a 5 minute task, 13.2% for a 15
minute task, and to a maximum of 22% for a 35 minute task. A maximum task size
of 35 minutes is chosen so that a significant number of applications can complete when
scheduled during the business hours of a single weekday.
We assume the tasks are compute bound; since we focus on applications with
81
small data input/output sizes on the order of kilobytes or megabytes, and fast networks
in enterprise environments , we do not take into account network effects in the completion
time of the application.
IV.C Proposed Approaches
We consider four general approaches for resource management:
Resource Prioritization – One way to do resource selection is to sort hosts in the
ready queue according to some criteria (e.g., by clock rate, by the number of cycles
delivered in the past) and to assign tasks to the “good” hosts first. Such prioritization
has no effect when the number of tasks left to execute is greater than the number of hosts
in the ready queue. However, when there are fewer tasks to execute than ready hosts,
typically at the end of application execution, prioritization is a simple way of avoiding
the “bad” hosts.
Resource Exclusion Using a Fixed Threshold – A simple way to select resources
is to exclude some hosts and never use them to run application tasks. Filtering can
be based on a simple criterion, such as hosts with clock rates below some threshold.
Often, the distribution of resource clock rates is so skewed [44, 91] that the slowest
hosts significantly impede application completion, and so excluding them can potentially
remove this bottleneck. However, using a fixed threshold can unintentionally exclude
relatively slow hosts that could have contributed to application completion, and we
address this deficiency with the next approach.
Resource Exclusion via Makespan Prediction – A more sophisticated resource
exclusion strategy consists of removing hosts that would not complete a task, if assigned
to them, before some expected application completion time. In other words, it may
be possible to obtain an estimate of when the application could reasonably complete,
and not use any host that would push the application execution beyond this estimate.
The advantage of this method compared to blindly excluding resources with a fixed
threshold is that it should not be as sensitive to the distribution of clock rates. That
is, relatively slow hosts that could contribute to application completion would not be
excluded unnecessarily, in contrast to the previous method that uses a fixed threshold.
82
Task Replication – Even with the best resource selection method, task failures near
the end of application execution are almost inevitable. To deal with such failures, we
examine the effect of replicating multiple instances of a particular task and assigning
them to different hosts; replicating a task may increase the chance that at least one task
instance will complete.
We study these approaches in the order that they are listed above. The ap-
proaches themselves are listed in increasing order with respect to complexity and costs,
which will become evident in the following chapters. For each approach, we examine a
wide range of scheduling heuristics that vary in complexity and the quality of information
used about the resources (such as static or dynamic information). For each heuristic,
we identify the costs and benefits when increasing the heuristic’s complexity or quality
of information about the resources. Using the best heuristic found with a particular ap-
proach, we then study how to augment this heuristic addressing its weaknesses using the
subsequent approach. This inductive method of heuristic design allows us to examine a
manageable number of heuristics.
In each stage of our heuristic design, we evaluate and compare our heuristics
in simulation, which we detail in the next section.
IV.D Measuring and Analyzing Performance
For each experiment (i.e., for a particular number of tasks, and a task size),
we simulated our competing resource management heuristics for applications starting at
different times during business hours. We ran each experiment for about 150 such starting
times and averaged results to obtain statistically significant results. In this section, we
discuss how we evaluate the performance of each application scheduled with a particular
heuristic, and how we automatically analyzed the performance of each heuristic.
IV.D.1 Performance metrics
While application makespan is a good metric to compare results achieved with
different scheduling heuristics, we wish to compare it to the execution time that could
be achieved by an oracle that has full knowledge of future host availabilities. Our oracle
83
works as follows. First, it determines the soonest time that each host would complete a
task, by looking at the future availability traces and scheduling the task as soon as the
host is available. Then, it selects the host that completes the task the soonest, and it
repeats this process until all tasks have been completed. This greedy algorithm results
in an optimal schedule, and we compare the performance of our heuristics using the ratio
of the makespan for a particular heuristic to the optimal makespan that is achieved by
our oracle. The optimality of the greedy algorithm is easy to see intuitively. Neverthe-
less, we prove this formally at the last section of this chapter, as much of our analysis
is based on the results of this algorithm. Note that our work in upcoming chapters
focuses on minimizing the overall execution elapsed time, or makespan, of a single par-
allel application, rather than trying to optimize the performance of multiple, competing
applications. Nevertheless, the heuristics we develop to schedule a single application
provide key elements for designing effective multi-application scheduling strategies (e.g.,
for doing appropriate space-sharing among applications, for selecting which resources are
used for which application, for deciding the task duplication level for each application).
IV.D.2 Method of Performance Analysis
To determine the cause of poor performing heuristic, we visually inspect the
execution trace of a subset of all applications scheduled by the heuristic. That is, for
each host, we graph the time at which it starts each task. If the task completes, we graph
its completion time, and if it fails, we plot the time of failure. However, analyzing the
performance using visual inspection of each application’s execution trace is not possible
due to the high number, i.e., thousands, of applications executed in simulation. So to
supplement our visually analysis of application execution traces, we develop a simple
automated approach to determine the causes of poor performance.
As discussed in Section IV.A, delays in task completion near the end of applica-
tion execution can result in a plateau of task completion rate and thus, poor performance;
we refer to the tasks completed extraordinarily late during application execution as lag-
gers. After visual inspection of numerous application execution traces, we hypothesize
that the causes of these lagger are hosts with relatively low clock rates, and task failures
that occur near the end of application execution. We confirm this hypothesis in the
84
following manner. First, we use an automated method to find laggers in application ex-
ecutions that have been coordinated by a FCFS scheduler. Then, then we automatically
classify the cause of each lagger to be either slow host clock rate or task failure. We
find that a high percentage of the laggers (>70%) are caused by either low host clock
rates or task failures, thus giving strong evidence for our hypothesis. After confirming
the hypothesis, we use this automated method to determine the impact of slow host
clock rates and task failures on application execution when other scheduling heuristics
are used. We describe the automated method in detail below.
To determine the cause of poor performing applications automatically, we mine
the simulation logs, determining the number of laggers and classifying each lagger by a
particular cause, i.e., low clock rate or task failure. First, we classify completed tasks
as laggers by determining the interquartile range (IQR) of task completion. The IQR is
defined as the range between the lower quartile (25th percentile) and upper quartile (75th
percentile) of task completion times, excluding task executions that fail to complete. We
use the method defined by Mendenhall and Sincich in [59] for finding sample quantiles,
where each quantile is an actual data point. (This method is less biased than other
methods that take the average of the data points.) Then, we multiply the IQR by a
certain factor (which we term IQR factor) and add the result to the upper quartile
to give a lagger threshold. If a task is completed after the threshold, it is classified
as a lagger. In particular, assuming an IQR factor F , if there exists a lagger after
the threshold, then a lower bound on the task completion rate slowdown between the
interquartile and the last quartile of tasks is given by 1/2F . So laggers signify a dramatic
decrease in task completion rate.
Figure IV.4 shows the cumulative throughput for an application with 400 tasks
as execution progresses. In the figure, the first and third quartiles of task completions
are labelled, showing the IQR. Using an IQR factor of 1, the figure also shows where the
lagger threshold is with respect to the third quartile. The tasks that finish execution
after the threshold are considered laggers.
Our rationale to use the IQR to define the lagger threshold is that the task
completion rate during the IQR is a close approximation to the optimal. This is because
application execution enters steady state during the IQR as each available host is assigned
85
0 20 40 60 800
50
100
150
200
250
300
350
400
Cum
ulat
ive
Num
ber
of T
asks
Com
plet
ed
Time (minutes)
1st qr.
3rd qr.
IQR
lagger threshold(3rd qr.+ IQR*1)
Figure IV.4: Laggers for an application with 400 tasks.
a task. If T >= N , the task completion rate during the IQR is guaranteed to be
optimal and follows trivially from our proof of the optimal scheduling algorithm discuss
in Section IV.E.
An alternative is to use find the standard deviation of application makespans,
and then to define a lagger threshold according to some factor of the standard deviation.
However, by the very nature of the laggers, they tend to be relatively extreme outliers
in terms of task completion times, and the standard deviation could be too sensitive to
these outliers; an extreme lagger could cause the standard deviation to be quite high
and using a lagger threshold based on the standard deviation could then result in many
false negatives. Our approach of using quantiles is less affected by these extreme laggers.
Another alternative method is to classify the last X% of completed tasks in an application
as laggers. However, in the case of an optimal application execution, this method would
classify the last X% of tasks as laggers, and yield relatively high false positives.
One question related to our lagger analysis is how to choose a suitable IQR
factor F . Clearly, an extremely low IQR factor would yield several false positives, and
an extremely high IQR factor would miss most of the true laggers, limiting our analysis
to an insignificant number of laggers. To determine the set of possible IQR factors to
use, we conducted a simple sensitivity analysis of the number of laggers determined
86
by the IQR factor. The lowest possible IQR factor is about .5, since the steady state
and maximum task completion rate usually occurs during the IQR. We found that the
maximum IQR factor was about 1.5; for IQR factors greater than 1.5, there were near-
zero laggers for all the application. Within the range of IQR factors of .5 and 1.5, we
found that the number of laggers decreases only gradually as the factor is increased. We
choose an intermediate IQR factor of 1 and found that using values of .5 and 1.5 do not
significantly change the distribution of laggers (see Appendix A).
After we find the set of laggers for each application, we determine the cause
of each lagger as follows. To determine if the cause is a task failure near the end of
application completion, we look at all tasks completed after the 75th percentile; if the
task fails after that point, we conclude that failure is at least one cause of the lagger.
To determine if the cause is a slow clock rate, we use the clock rate of the slowest host
used in the corresponding optimal application execution. That is, for an application
that begins execution at a particular time, we run the optimal execution (using an
omniscient scheduler) and find the slowest host used in that execution. The clock rate
of the slowest host is then used to classify whether a lagger was assigned to a slow host
or not. The advantage of using this method of classification is that it actually confirms
that a faster host could have been used by the scheduling heuristic by looking at the
application’s optimal execution. Another advantage is that this method determines a
clock rate threshold for each instance of application execution. So, if only relatively
fast hosts are used during the optimal application execution, the resulting clock rate
threshold will also tend to be higher.
In our comparison of heuristics, we do not consider the number of laggers as
we found weak correlation (correlation coefficient is usually less than .22) between the
number laggers and makespan for the FCFS scheduling method when a relatively low
IQR factor of .5 is used. (By using a low IQR factor of .5 we ensure that all laggers are
counted.) The weak correlation is cause by the application waiting for the completion of
a task scheduled on an extremely slow host when the rest of the tasks have already been
completed (as shown for the application with 100 tasks in Figure IV.1). In this case, the
number of laggers is relatively small, but the effect on application makespan of the sole
lagger is tremendous.
87
Another reason not to consider the number of laggers is the weak correlation
between the mean application makespan and mean number of laggers across the set of
scheduling heuristics; that is, a heuristic that results in a lower mean makespan could in
fact have more laggers than a different heuristic that results in a higher mean makespan.
The reason for this is that the IQR is a metric relative to the total makespan of the
application; as the mean makespan for a particular heuristic decreases, the IQR itself
decreases, which in turn lowers the threshold defined by the 75th quantile, the IQR, and
the IQR factor F . A lower threshold could raise the chance that a host will complete
a task after that threshold, since the clock rate distribution and CPU availability of
the hosts in the platform remains fixed. Thus, to supplement our lagger analysis, we
also show the absolute measure of length of times intervals delimited by the the first,
second, third, and fourth quartiles of task completion times. While we could have used an
absolute IQR for all heuristics (for example, the IQR resulting from the FCFS scheduling
method), the IQR for FCFS can be much higher (as much as three times higher, although
in general the IQR’s tend to be similar) than the IQR’s for the other heuristics, and this
could result in several false negatives. So, instead we use define a relative IQR for each
heuristic.
IV.E Computing the Optimal Makespan
We prove that the greedy algorithm proposed in Section IV.D.1 results in the
the optimal makespan for jobs with identical and independent hosts scheduled on volatile
hosts. The optimal algorithm consisted of starting a task on each host as soon as possible
(during an availability interval), and assigning tasks to a host as soon as the previous
task on that host completes. If a task fails due to the end of the availability interval being
reached before all necessary operations have been delivered to the task, then the task is
restarted from scratch at the beginning of the next availability interval. In this greedy
fashion, the availability intervals of all hosts can be “filled” with an infinite number
of tasks, with tasks packed together as tightly as possible. These tasks are sorted by
increasing completion time, and we pick the first T tasks. The assignment of these
T tasks to hosts corresponds to the optimal schedule. While this schedule is intuitively
88
optimal, one may wonder whether inserting some delays before or in between tasks could
not be beneficial in order to match the overhead periods of length h with periods during
which host CPUS exhibit low, or in fact 0%, availability. In the following sections we
give a formal description of the algorithm and formal proofs of its optimality first for a
single availability interval on a single hosts, then for multiple availability intervals on a
single hosts, and finally for multiple availability intervals on multiple hosts, which is the
general case. While, a posteriori, the proof is straightforward, the algorithm’s optimality
is still worth proving formally since it is used heavily in the following chapters.
Our approach is as follows. After defining the problem formally in Section IV.E.1,
we first show optimality within a single availability interval on a single host in Sec-
tion IV.E.2. Then, in Section IV.E.3, we show optimality for multiple availability inter-
vals separated by failures on a single host. In Section IV.E.4, we show the optimality
for multiple availability intervals across multiple hosts. Finally, in Section IV.E.5, we
consider a variation of the problem that allows for task checkpointing.
IV.E.1 Problem Statement
Consider a job that consists of T tasks, and consider N hosts, with variable
CPU availability described by traces as in the previous chapter, and with possibly dif-
ferent maximum amount of operations delivered per time unit. We denote by f(t) the
instantaneous number of operations delivered at instant t. Although Figure II.2 depicts
an availability trace as continuous, it is in fact a discrete step function. We denote by∫ ba f(t) dt the number of operations that would be delivered by the host to the desk-
top grid application between time a and time b, provided that the interval [a, b] is fully
contained within an availability interval.
All the tasks are of identical computational cost and independent of one another.
We denote by S the task size in number of operations, and h is the overhead in seconds
for scheduling each task, which is incurred before computation can begin. (We have
observed this overhead in practice, as explained in Chapter III.) Finally, we denote by
fm(t) the instantaneous number of operations delivered at time t on some host m, which
is fully known given a trace (see the previous section). Recall that although in practice
function fm would not be known, in this report we focus on developing an optimal
89
schedule that could be achieved by an omniscient algorithm that has foreknowledge of
future host availability for all hosts. The scheduling problem is to assign the tasks to
the hosts so that the job’s makespan is minimized.
IV.E.2 Single Availability Interval On A Single Host
We first focus on optimally scheduling tasks within a single availability interval.
We formalize the algorithm and then prove its optimality.
IV.E.2.a Scheduling Algorithm
Let us first define a helper function, INTG, that takes 4 arguments as input: a
number of operations, numop, an overhead in seconds, overhead, a task start time, a, an
upper bound on task finish time, b, and a CPU availability function, f that corresponds
to a single availability interval. INTG returns the time at which a task of size numop,
started at time a on a host whose CPU availability is described by function f , incurring
overhead overhead before it can actually start computing, would complete if it would
complete before time b, or −1 otherwise. It is assumed that time a lies inside the single
availability interval defined by function f . In other words, function INTG returns, if it
exists, a time t < b such that∫ ta+h f(x) dx = numop, or -1 if such a t does not exist. We
show an implementation of INTG in pseudo-code in Figure IV.5. Note that the pseudo-
code shows a discrete implementation; the loop increments the value of local variable t by
one, assuming the step size of the trace’s step function is 1 second. Intuitively, function
INTG will be used by our scheduling algorithm to see whether a task can actually “fit”
inside an availability interval when started somewhere in that interval.
The greedy algorithm OPTINTV given in Figure IV.6 computes the schedule
described informally in Section IV.D.1. It takes the following parameters:
T : the number of tasks to be scheduled,
h: the overhead for starting a task,
S: the task size in number of operations,
f : the CPU availability function
[a, b]: the absolute start and end times of the host availability interval we
consider,
90
Algorithm : INTG(numop, overhead, a, b, f)
sum← 0
for t← a + h to b
sum← sum + f(t)
if sum = S
return (t)
return (−1)
Figure IV.5: INTG: helper function for the scheduling algorithm.
A: an array that stores the time at which each task begins, to be filled in,
B: an array that stores the time at which each task completes, to be filled in,
and returns the number of tasks that could not be scheduled in the availability interval,
out of the T tasks. From the pseudo-code it is easy to see that OPTINTV schedules tasks
from the very beginning of the availability interval, and a task is scheduled immediately
after the previous task completes. The duration of each task execution is computed via
a call to the INTG helper function.
IV.E.2.b Proof of Optimality
Let [t1, t2, t3, ..., tT ] denote the times at which each task begins execution in the
schedule computed by the OPTINTV algorithm. Note that t1 is just the beginning of the
availability interval. Let [e1, e2, e3, ..., eT ] be the task execution times without counting
the overhead, such that ti = ti−1 + h + ei−1, for 2 ≤ i < T .
Consider another schedule obtained by an algorithm, which we call OPTDE-
LAY, that does not start each task as early as possible. In other words, the algorithm
adds a time delay, wi ≥ 0, before starting task i, for 1 ≤ i ≤ T . Let [t′1, t′2, t
′3, ..., t
′T ]
be the times at which tasks start execution and [e′1, e′2, e
′3, ..., e
′T ] be the task execution
times without counting the overhead, in the OPTDELAY schedule. We prove that the
OPTDELAY schedule is never better than the OPTINTV schedule, which then guar-
antees that the OPTINTV schedule is optimal within an availability interval. Let the
91
Algorithm : OPTINTV(T, h, S, f, a, b, A, B)
t← a
B[1]← a
for i← 2 to T + 1
t← INTG(S, h, t, b, f)
if t ≥ 0
// The time at which task i− 1 completes is the time at which task i
// is scheduled
B[i− 1]← t
if i < T + 1
A[i]← t
else return (T − (i− 2))
end for
return (0)
Figure IV.6: Scheduling algorithm over a single availability interval.
92
proposition P (k) be that OPTINTV schedules k tasks optimally. We prove P (k) by
induction.
Base case – Let us assume that P (1) does not hold, meaning that t1 + h + e1 >
t′1 + w1 + h + e′1. This situation is depicted in Figure IV.7. For convenience, let c1 and
c′1 denote the completion times of the task under the OPTINTV and the OPTDELAY
schedules, respectively, so that our assumption is c1 > c′1.
Figure IV.7: An example of task execution for OPTINTV (higher) and OPTDELAY
(lower) at the beginning of the job. Both jobs arrive at the same time. In the case of
OPTINTV, the first task is scheduled immediately and an overhead of h is incurred.
In the case of OPTDELAY, the scheduler waits of a period of w1 before scheduling the
task.
For the OPTINTV schedule we can write that:
S =∫ t1+h+e1
t1+hf(t) dt,
which just means that, during its execution, the task consumes exactly the number of
operations needed. We can write that:∫ t1+h+e1
t1+hf(t) dt =
∫ t1+h+w1
t1+hf(t) dt +
∫ c′1
t1+h+w1
f(t) dt +∫ c1
c′1f(t) dt.
First, we note that the second integral in the right-hand side of the above equation is
equal to S, since it corresponds to the full computation of the task in the OPTDELAY
schedule. Second, we note that the third integral is strictly positive. Indeed, if it were
equal to zero, then the number of operations delivered by the host to the task during the
[c′1, c1] interval would be zero, meaning that no useful computation would be performed
93
on that interval in the OPTINTV schedule. Therefore, in the OPTINTV schedule, the
completion time c1 would in fact be lower or equal to c′1, which does not agree with our
hypothesis. As a result, we have:
S =∫ t1+h+e1
t1+hf(t) dt > S,
which is a contradiction. We conclude that P (1) holds.
Inductive case – Let us assume that P (j) holds, and let us prove P (j + 1). The
execution timeline for both the OPTINTV and OPTDELAY schedule is depicted on
Figure IV.8 with tj+1 lower or equal to t′j+1 due to P (j). As for the base case, cj+1 and
c′j+1 denote the completion times of task j + 1 under both schedules.
Figure IV.8: An example of task execution for OPTINTV (higher) and OPTDELAY
(lower) in the middle of the job.
Suppose cj+1 > c′j+1. For the OPTINTV schedule, we can write that:
S =∫ cj+1
tj+1+hf(t) dt,
which just means that S operations are delivered to the application task during its
execution. We can split the above integral as follows:∫ cj+1
tj+1+hf(t) dt =
∫ t′j+1+wj+1+h
tj+1+hf(t) dt +
∫ c′j
t′j+1+wj+1+hf(t) dt +
∫ cj
c′jf(t) dt.
The first integral is valid because wj+1 ≥ 0 and tj+1 ≤ t′j+1 (due to property P (j)).
Following the same argument as in the base case, the last integral in the right-hand side
94
of the above equation is strictly positive (otherwise cj+1 ≤ cj). The second integral
is equal to S, since this is the number of operations delivered to task j + 1 during its
execution in the OPTDELAY schedule. We then obtain that
S =∫ cj+1
tj+1+h> S,
which is a contradiction. Therefore cj+1 ≤ c′j+1, and property P (j + 1) holds, which
completes our proof by induction.
IV.E.3 Multiple Availability Intervals On A Single Host
In this section, we consider scheduling tasks during multiple intervals of avail-
ability, whose start and stop times are denoted by [ai, bi]. Without loss of generality, we
can ignore all availability intervals during which a single task cannot complete, i.e. for
which∫ bi
aif(t) dt < S. We also consider an infinite number of availability intervals for
the host, or at least a number large enough to accommodate all T tasks. The scheduling
algorithm OPTMINTV, seen in Figure IV.9, takes the the following parameters:
T : the number of tasks to be scheduled,
h: the overhead for starting a task,
S: the task size in number of operations,
f : the CPU availability function
C: an array that stores start times of all the tasks, to be filled in,
D: an array that stores completion times of all the tasks, to be filled in.
Let P (i) be the property that OPTMINTV schedules k tasks optimally. We prove that
P (k) is true by induction for k ≥ 1.
Base Case – P(1) is true because OPTMINTV(1,...) is equivalent to running OPT-
INTV(1,..) for the first availability interval, which we know leads to an optimal schedule
from Section IV.E.2.
P (2) – When there are two tasks to schedule, either both tasks can be scheduled in the
first availability interval, or if both tasks cannot finish in the first availability interval,
one task must be scheduled in the first interval and the other scheduled in the second.
In the former case, OPTMINTV(2,...) is equivalent to OPTINTV(2,...) for the first
interval, and the result is optimal as proved in the previous section. In the latter case, a
95
Algorithm : OPTMINTV(T, h, S, f, C, D)
numleft← T
i← 1
while numleft 6= 0
numleft← OPTINTV(numleft, h, S, f, ai, bi, A, B)
C ← CONCAT(C,A)
D ← CONCAT(D, B)
i← i + 1
end while
Figure IV.9: Scheduling algorithm over multiple availability intervals.
task is scheduled at the beginning of the first and at the beginning of the second interval.
This results in the optimal schedule since the earliest the second task can execute (and
finish) is at the beginning of the second interval.
Inductive Case – Let us assume that P (j) is true. Then OPTMINTV(j,...) gives
the optimal schedule for the first j tasks. If OPTMINTV must schedule another task,
OPTMINTV will either place it in the in the same interval as the jth task, or if that
is not possible, it will place it at the beginning the following interval. The former case
results in an optimal schedule since OPTINTV will schedule the j+1th task optimally in
that interval, and the resulting makespan for all j + 1th tasks will also be optimal. The
latter case results in an optimal schedule since we know that OPTINTV(1,..) is optimal
on the second availability interval. Therefore, P (j + 1) is true.
It follows that P (k) is true for all k ≥ 1, and thus OPTMINTV computes an
optimal schedule.
IV.E.4 Multiple Availability Intervals On Multiple Hosts
The algorithm for scheduling tasks across multiple availability intervals over
multiple hosts, OPTIMAL, seen in Figure IV.10, takes the following parameters:
T : the number of tasks to be scheduled;
96
h: the overhead for starting a task;
S: the task size in number of operations;
v: the arrival time of the job;
E: a N ×T matrix that stores the start times of the T tasks scheduled on each
of the N hosts; each row of E is computed by a call to OPTMINTV;
F : a N × T matrix that stores the completion times of the T tasks scheduled
on each of the N hosts; each row of F is computed by a call to OPTMINTV;
and returns the total makespan. We assume that the functions describing CPU avail-
abilities for each host, fm for m = 1, ..., N , are known. OPTIMAL uses a local variable,
I, which is a 1×N array that stores the index of the last completed task for each host.
Finally, we define the argmin operator in the classical way for a series, say {xi}i=1,..,n,
by xargmin(x) = mini=1,..,n xi.
Algorithm : OPTIMAL(N, T, h, S, v, E, F )
// schedule T tasks on each host and determine each task’s completion time
for i← 1 to N
OPTMINTV(T, h, S, fi, C, D)
E[i]← C
F [i]← D
// select the T tasks that completed the soonest
j ← T
while j > 0
i← argminiε1,...,N (F [I[i]])
I[i]← I[i] + 1
j ← j − 1
return (M [I[i]]− v);
Figure IV.10: Scheduling algorithm over multiple availability intervals over multiple
hosts
97
Since OPTMINTV(k,..) schedules each task optimally, OPTIMAL(N,k,..) se-
lects the k tasks that complete the soonest, resulting in the optimal schedule.
IV.E.5 Optimal Makespan with Checkpointing Enabled
In this section, we consider the scenario where a desktop grid system is able to
support local task checkpointing and restart. We assume that when a task encounters a
failure, it is always restarted on the machine where it began execution, i.e., we do not
consider process migration. Since the greedy algorithm that accounts for checkpointing
and its proof of optimality are similar to those described in the previous sections, we
give only a high-level description of the new optimal scheduling algorithm and proof
sketch of its optimality. For our discussion of checkpointing, we define the following new
parameters:
p: the overhead in terms of time for checkpointing a task
r: the overhead in terms of time for restarting a task from its checkpoint.
s: the number of operations to be completed before a checkpoint is performed.
We make the following changes to OPTINTV to account for checkpointing.
Because checkpointing is enabled, we can view intervals of failures as periods of 0%
CPU availability. So, the a host’s trace can be treated as single continuous availability
interval, on which we can use OPTINTV to schedule tasks. After a task is scheduled and
it begins execution, a checkpointing overhead of p is incurred after every s operations
are completed. If during execution a task encounters a failure, whatever progress made
since the last checkpoint is lost, and an overhead of r is incurred to restart the task from
the last checkpoint.
We can reduce the problem of scheduling tasks with checkpointing enabled to
the problem of scheduling task without checkpointing as follows. Consider a single task
k scheduled within an availability interval, i.e., the task’s execution does not encounter
any failures. When k is scheduled, it first incurs an overhead of h. Then, after every s
operations are complete, an overhead of p is incurred due to checkpointing. So, one can
treat the task k as dS/se subtasks, bS/sc of which are of size s and dS/se−bS/sc of which
are of size S−s∗bS/sc. The first subtask is scheduled with an overhead of h; thereafter,
each subtask is “scheduled” with an overhead of p. If no failures are encountered during
98
task execution, then OPTINTV achieves the optimal schedule by an argument similar to
the one used in Section IV.E.2, and the same is true if we used OPTVINTV to schedule
multiple tasks (treated as batches of subtasks) in the same availability interval.
If a failure is encountered during task execution, then whatever progress made
since the last checkpointing is lost, an overhead of r is incurred immediately after the
failure, and the task is restarted from the last checkpoint. By the same argument used to
prove OPTMINTV in Section IV.E.3, OPTINTV would have scheduled subtasks in the
previous availability interval optimally, and so starting a subtask at the beginning of the
next availability interval (and then incurring an overhead of r before execution) gives an
optimal schedule. Finally, we can replace OPTMINTV with OPTINTV in OPTIMAL
since failures are viewed as 0% CPU availability, and the resulting algorithm achieves
the optimal schedule over all hosts.
In conclusion, we have shown that a greedy algorithm that has full knowledge
of future host and CPU availabilities achieves the optimal makespan when scheduling a
job with identical and independent tasks on a volatile desktop grid. To the best of our
knowledge, previous work has not dealt with the case where CPU availability fluctuates
between 0 and 100% and at the same time taken into account host heterogeneity and
failures. Note also that although are algorithm achieves the optimal makespan, it does
not necessarily achieve optimal execution time, since delaying a task might allow it to
encounter periods of higher CPU availability. An interesting extension of this work is to
consider the multiple job scenario where minimizing execution time (versus makespan)
could be beneficial to system performance.
Chapter V
Resource Selection
We investigate various heuristics for resource selection, which involves deciding
which resources to use and which resource to exclude. Regarding the former issue, we
focus on resource prioritization techniques to use the “good” hosts first. Regarding the
latter issue, we study resource excluding techniques to filter out the “bad” hosts that
might impede completion from the application execution’s entirely. We evaluate these
heuristics first on the SDSC grid, which contains volatile hosts that exhibit a wide range
of clock rates. We also report the results of heuristics run on the other platforms when
applicable and interesting.
V.A Resource Prioritization
V.A.1 Heuristics
We examine three methods for resource prioritization using different levels of
information about the hosts, from virtually no information to comprehensive historical
statistics derived from our traces for each host, and we evaluate each method using
trace-driven simulation. For the PRI-CR method, hosts in the server’s ready queue
are prioritized by their clock rates. Similar to PRI-CR, PRI-CR-WAIT sorts hosts
by clock rates, but the scheduler waits for a fixed period of 10 minutes before assigning
tasks to hosts. The rationale is that collecting a pool of ready hosts before making task
assignments can improve host selection. The scheduler stops waiting if the ratio of ready
hosts to tasks is above some threshold so that resource selection is executed immediately
99
100
after a large pool of resources exists in the queue. A threshold ratio of 10 to 1 was used
in all our experiments. We experimented with other values for the fixed waiting period
and the above ratio, but obtained similar results.
In contrast to PRI-CR and PRI-CR-WAIT, which use static information
about the hosts, the method PRI-HISTORY uses dynamic information, i.e., history
of a host’s past performance to predict its future performance. Specifically, for each
host, the scheduler calculates the expected operations per availability interval (that is
how many operations can be executed in between two host failures) using the previous
weekday’s trace. For a particular availability interval, a task may begin execution any-
where in that interval, and a task has a higher probability of completing within a longer
interval than a shorter one. So longer intervals should be weighted more than shorter
ones when calculating the expected operations per interval. We take this into account by
considering all possible subinterval starting points with ∼10 second increments within
each availability interval. For each availability interval, this results in subintervals that
begin every ten 10 seconds in the availability interval, and end at the interval’s stopping
point (see Figure V.1).
Figure V.1: Subintervals denoted by the double arrows for each availability interval. The
length of each subinterval is shown, and the subinterval lengths differ by 10 seconds.
The expected operations per interval is then used to determine in which of
two priority queues a host is placed. If the expected number of operations per intervals
101
is greater than or equal to the number of operations of an application task, then on
average the task should execute until completion, and so the host is placed in the higher
of two priority queues. Otherwise, the host is put in the low priority queue, which
corresponds to the hosts on which the task is not expected to run until completion.
Within each queue, the hosts are prioritized according to the expected operations per
interval divided by expected operations per second; as a result, hosts in each queue are
prioritized according to their speed. The higher priority queue lists hosts on which the
task is expected to complete, and faster hosts (in terms of operations per interval) have
higher priority. The lower priority queue lists hosts on which the task is not expected to
complete, and faster hosts have higher priority. When scheduling, PRI-HISTORY will
check the higher priority queue first and select the host with the highest priority, i.e., the
fastest expected speed. If the higher priority queue is empty, PRI-HISTORY will check
the lower priority queue and select the host with highest priority, i.e., fastest expected
speed.
V.A.2 Results and Discussion
For the SDSC platform, Figure V.2 shows the mean makespan of the three
heuristics (PRI-CR, PRI-HISTORY, PRI-CR-WAIT), and the mean makespan of the
FCFS heuristic, all normalized to the mean optimal execution time for applications with
100, 200, and 400 tasks of lengths 5, 15, and 35 minutes on a dedicated 1.5GHz host. The
bold dotted line in the figure represents the normalized mean makespan of the optimal
algorithm. Recall that these averages are obtained for about one-hundred fifty distinct
experiments.
To explain the performance of the heuristics, we use both visual analysis of
particular application execution traces and also the automated method described in
Section IV.D.2 to give additional and more concrete evidence of our conclusions. Our
lagger analysis shown in Figure V.7 focuses on FCFS, and the PRI-CR, which we find to
be the best resource prioritization heuristic. We also show in that figure other heuristics,
which we discuss later in Section V.B
Figure V.7 shows the classification of laggers as either caused by slow hosts or
task failures for each heuristic and application size. The height of each bar corresponds
102
to the mean number of laggers for an application with a particular number of tasks
and task size. For a particular bar, the height of each sub-bar represents the number
of laggers caused by slow hosts or task failures. We find that the poor performance
of FCFS is predominately caused by slow hosts, and that the other heuristics achieve
better performance by eliminating these slow hosts as we discuss below. The reduction
in laggers caused by slow hosts corresponds to a reduction in laggers caused by task
failures, as we showed that task failure rate is correlated with host clock rate.
The effect of eliminating laggers on application makespan is shown in Fig-
ure V.8. The height of each bar corresponds to the mean makespan of applications
with a particular task size and number. For a particular bar, the sub-bars represent the
length of each quartile of task completion times. We observe that the length of the first,
second, and third quartiles of each heuristic are approximately equal to the optimal’s;
this is because task are completed at a steady-state and near-optimal rate. (It appears
that some quartiles are missing for optimal algorithm, when in fact the quartiles are
too small to be visible.) The difference in makespans is primarily due to the length of
the 4th quartile, which in turn is caused by the reduction of laggers resulting from task
execution on slow hosts. We discuss how each prioritization heuristic eliminate laggers
below.
The general trend shown in Figure V.2 is that the larger the number of tasks in
the application the closer the achieved makespans are to the optimal, which is expected
since for larger number of tasks resource selection is not as critical to performance and
a greedy method approaches the optimal.
Focusing on those applications with 200 and 400 tasks, we notice that the prior-
itization heuristics perform as badly as FCFS. The reason FCFS performs so well is that
hosts that appear most often and earliest in the queue tend to have the high task com-
pletion rates, as clock rates are negatively correlated with task failure rates. The reason
PRI-CR and PRI-HISTORY perform similarly to FCFS is that clock rate and expected
number of operations per interval are weakly correlated with task completion rate (as
shown in Chapter III) and so the prioritized hosts in the ready queue are in a similar
order as those during FCFS. This is reflected by the similar number and proportion of
laggers caused by slow hosts of task failures as shown in Figure V.7. PRI-CR-WAIT
103
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
PRI−CRPRI−HISTORYPRI−CR−WAIT
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure V.2: Performance of resource prioritization heuristics on the SDSC grid.
performs poorly for small 5 minutes tasks and improves thereafter, but never surpasses
PRI-CR. The initial waiting period of 10 minutes is costly for the 100 task / 5min ap-
plication, which takes about 6 minutes to complete in the optimal case. As the task size
increases (along with application execution time), the penalty incurred by waiting for
host requests is lessened, but since most hosts are already in the request queue when the
application is first submitted, the PRI-CR-WAIT performs almost identically to PRI-CR
and is no better. Figure V.4 provides additional insights as to why PRI-CR-WAIT is
largely ineffectual. This figure shows the number of available hosts and the number of
tasks that are yet to be scheduled throughout time for a typical execution. Initially,
there are about 150 hosts available and 400 tasks for execution, and this immediately
drops to 0 hosts and about 250 tasks as each available host gets assigned a task. One
can see that it is usually the case that either there are far more tasks to schedule than
ready hosts or far more ready hosts than tasks to schedule. In the former scenario,
PRI-CR-WAIT performs exactly as PRI-CR. In the latter case, waiting does not give
the algorithm more choice in selecting resources.
We noticed different trends when the number of tasks in the application was
roughly half the number of hosts. The reason FCFS performs so poorly is that initially
104
there are 100 tasks to schedule on about 200 hosts, and because FCFS chooses 100 hosts
randomly, some slow hosts are chosen, which then causes a reduction in application
performance. In contrast, PRI-CR will exclude the slowest 50% of the resources so that
these slow hosts are excluded from the computation also shown in Figure V.7.
Surprisingly, PRI-HISTORY performs poorly compared to PRI-CR, which uses
static instead of dynamic information. We found that the availability interval size, both
in terms of time and in terms of operations, was not stationary across weekdays and so,
the expected operations per second is a poor predictor of performance for certain hosts.
We determined the per host prediction error from one day to the next as follows. For
each host we calculated the mean number of operations per interval on a given weekday
during business hours. We then took the absolute value of the difference between a host’s
mean on one particular day and the next. In Figure V.3(a), we show the complementary
cumulative distribution function of prediction error of the expected time per interval for
each host. That is, the figure plots the fraction of prediction errors greater than some
length of time. We can see that 80% of the predictions errors are 50 minutes in length or
more. On average, the mean prediction error is 109 minutes in length and the is median
error is 122 minutes. Given that many applications are less than an hour in length, the
high prediction error could be problematic.
Moreover, in Figure V.3(b), we show the complementary CDF of prediction
error of the expected ops per interval for each host. That is, the figure plots the fraction
of prediction errors greater than some quantity of operations delivered per interval.
We find that 80% of the prediction errors are equivalent to 40 minutes or more on a
dedicated 1.5GHz host, On average, the mean prediction error is 99 minutes in length
and the median error is 85 minutes. Again, the high prediction error is significant given
that many applications are less than an hour in length (and since PRI-HISTORY will
tend to use hosts with high expected operations per interval). Similarly, the authors
of [94, 34] found that the using the host’s mean performance over long durations does
not reflect the dynamism of CPU availability, and thus is a poor predictor.
We also compared the prediction error of the compute rate per host estimated
using the expected operations and time length per interval. Since hosts are usually
completely idle as shown in Section III.F, the rate itself was predicted correctly. So we
105
attribute the poor performance of PRI-HISTORY to the poor operations per interval
predictions, which cause hosts to be put in the wrong priority queues.
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
Prediction Error of E[time per interval] (minutes)
Fra
ctio
n
mean: 109min
std dev: 54min
median: 122min
(a) Expected ops per interval
0 100 200 300 400 5000
0.2
0.4
0.6
0.8
1
Prediction Error of E[ops per interval] (min. on 1.5GHz)F
ract
ion
mean: 99min
std dev: 75min
median: 85min
(b) Expected time per interval
Figure V.3: Complementary CDF of Prediction Error When Using Expected Operations
or Time Per Interval
In summary, if the number of tasks is greater than or equal than the number
of hosts, there is little benefit or prioritization over FCFS since the fastest and most
available hosts will naturally requests tasks the soonest. Also, waiting to collect a pool
of available hosts does not improve resource selection and only delays task assignment.
This is because during application execution there are usually either far more tasks than
hosts or far fewer tasks than hosts; in either case, waiting is not beneficial.
If the number of tasks is less than the number of hosts, PRI-CR works as well
as or better than PRI-HISTORY since the expected number of operations per interval
tends to be unpredictable for certain hosts. PRI-CR works well only because the slowest
hosts are excluded from the computation, and so, prioritization resulting in exclusion
can improve performance. On average PRI-CR is 1.65 times better than FCFS for
applications with 100 tasks on the SDSC platform.
In conclusion, we see that although PRI-CR outperforms FCFS consistently, re-
source prioritization still leads to performance that is far from the optimal (by more than
a factor of 4 for an application with 100, 5-minute tasks). Looking at the task schedules
in detail, we noticed that using the slowest hosts significantly limited performance, and
106
we address this issue through heuristics described in the next section.
0 10 20 30 40 50 60 70 80 900
50
100
150
200
250
300
350
400
Num
ber
of T
asks
to b
e Sc
hedu
led
Time (minutes)
Number of tasksNumber of ready hosts
0 10 20 30 40 50 60 70 80 900
50
100
150
200
250
300
350
400
Num
ber
of R
eady
Hos
ts
Figure V.4: Number of tasks to be scheduled (left y-axis) and hosts available (right
y-axis).
V.B Resource Exclusion
To prevent slower hosts from delaying application completion, we developed
several heuristics that exclude hosts from the computation using a variety of criteria.
All these heuristics use only host clock rates to obtain lower bounds on task completion
time (as we have seen that the expected operations or time per interval is not a good
predictor of future performance). All of the resource exclusion heuristics prioritize re-
sources according to their clock rates since we found in the previous section that PRI-CR
performed the best out of all the prioritization heuristics.
V.B.1 Excluding Resources By Clock Rate
Our first group of heuristics excludes hosts whose clock rates are lower than
the mean clock rate over all hosts (1.2GHz for the SDSC platform) minus some factor
of the standard deviation of clock rates (730MHz for the SDSC platform) for the entire
duration of the computation. The heuristics EXCL-S1.5, EXCL-S1, EXCL-S.5, and
EXCL-S.25 exclude those hosts according a threshold that is 1.5, 1, .5, or .25 standard
107
deviations below the mean clock rate, respectively.
Figure V.5 shows the performance of the heuristics on the SDSC platform, and
we see that in all cases at least one of the exclusion heuristics improves performance
relative to PRI-CR. In most cases, the minimum makespan occurs at a threshold of .5
or 1; EXCL-S.5 effectively eliminates almost all of the laggers caused by slow hosts (see
Figure V.7). The makespan increases for higher or lower thresholds as too many useful
hosts or too few useless hosts are excluded from the computation. Usually, EXCL-S.25
excludes so many hosts that it not only removes the useless hosts but also excludes some
of the useful ones; the exception is the application with 100 tasks, which is equal to
roughly half the number of hosts. For this particular desktop grid platform, excluding
those hosts with speeds 25% below the mean will leave slightly more than half of the
hosts and thus filtering in this case does not hurt performance. EXCL-S1.5 excludes too
few hosts, and the remaining useless hosts hurt the application makespan.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
PRI−CREXCL−S1.5EXCL−S1EXCL−S.5EXCL−S.25
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure V.5: Performance of heuristics using thresholds on SDSC grid
In conclusion, resource exclusion can be significantly beneficial. In the above
experiments, it performs on average 1.49 times better than FCFS on the SDSC desktop
grid. For the SDSC platform, EXCL-S.5 has the particular threshold that yields the best
performance; on average, EXCL-S.5 performs 8%, 30%, and 6% better than PRI-CR for
108
applications with 100, 200, and 400 tasks respectively because many of the hosts with
slow clock rates are eliminated from the computation. However in other platforms with
different clock rate distributions, the fixed threshold may not be adequate. Figure V.10
shows the performance of EXCL-S.5 compared to FCFS on the multi-cluster LRI-WISC
platform. For applications with 400 tasks, we see the negative effect of using a fixed
threshold with EXCL-S.5 as the resulting performance is worse than FCFS. This is
because EXCL-S.5 excludes hosts with 900MHz clock rates, which contribute a significant
fraction of the platform’s overall compute power. So larger applications scheduled with
EXCL-S.5 exhibit worse performance than with FCFS. In the next section, we propose
strategies that use a makespan predictor to filter hosts in a way that is less sensitive
to the clock rate distribution, and compare it to EXCL-S.5 for different desktop grid
configurations.
V.B.2 Using Makespan Predictions
To avoid the pitfalls of using a fixed threshold such as a particular clock rate 50%
of the standard deviation below the mean in the case of EXCL-S.5, we develop a heuristic
where the scheduler uses more sensitive criteria for eliminating hosts. Specifically, the
heuristic predicts the application’s makespan, and then excludes those resources that
cannot complete a task by the projected completion time. Our rationale is that the
definition of a “slow” host should vary with the application size (or number of tasks to
be completed during runtime), instead of the distribution of clock rates. That is, large
applications with many tasks relative to the number of hosts should use most of the
hosts as long as they do not delay application completion, whereas small applications
with fewer tasks than hosts should use only a small subset of hosts that can complete a
task by the application’s projected makespan.
To predict the makespan, we compute the average operations completed per
second for each host taking into account host load and availability using the traces and
then computing the average over all hosts (call this average r). If N is the number
of hosts in the desktop grid, we assume a platform with N hosts of speed r, and then
estimate the optimal execution time for the entire application with T tasks of size s in
operations via wr = dT/Ne(s/r). The rationale behind this prediction method is that
109
the optimal schedule will never encounter task failures. So host unavailability and CPU
speed are the two main factors influencing application execution time, and these factors
are accounted for by r. In addition, we account for the granularity at which tasks can
be completed with dT/Ne.To assess the quality of our predictor wr, we compared the optimal execution
time with the predicted time for tasks 5, 15, and 35 minutes in size and applications
with 100, 200, and 400 tasks. The average error over 1,400 experiments is 7.0% with a
maximum of 10%. The satisfactory accuracy of the prediction can be explained by the
fact that the total computational power of the grid remains relatively constant, although
the individual resources may have availability intervals of unpredictable lengths. To show
this, we computed the number of operations delivered during weekday business hours in
5 minute increments, aggregated over all hosts. We found that the coefficient of variation
of the operations available per 5 minute interval was 13%. This relatively low variation
in aggregate computational power makes the accurate predictions of wr possible.
The heuristic EXCL-PRED uses the makespan prediction, and also adaptively
changes the prediction as application execution progresses. In particular, the heuristic
starts off with a makespan computed with wr, and then after every N tasks are com-
pleted, it recomputes the projected makespan. We choose to recompute the prediction
after N tasks are completed for the following reasons. On one extreme, a static predic-
tion computed only once in the beginning is prone to errors due to resource availability
variations. At the other extreme, recomputing the prediction every second would not
be beneficial since it would create a moving target and slide the prediction back (until a
factor of N tasks are completed).
If the application is near completion and the predicted completion time is too
early, then there is a risk that almost all hosts get excluded. So, if there are still
tasks remaining at time pred − .95 ∗meanops, where pred is the predicted application
completion time and meanops is the mean clock rate over all hosts, the EXCL-PRED
heuristic reverts to PRI-CR at that time. This ensures that EXCL-PRED switches to
PRI-CR when it is clear that most hosts will not complete a task by the predicted
completion time. Note that if the heuristic waited until time pred (versus pred − .95 ∗meanops) before switching to PRI-CR, it would result in poor resource utilization as seen
110
in some of our early simulations, since most hosts are available and excluded by time
pred. Therefore, waiting until time pred before making task assignments via PRI-CR
would cause most hosts to sit needlessly idle.
V.B.2.a Evaluation on Different Desktop Grids
We tested and evaluated our heuristics in simulation on all the desktop grid
platforms described in Section III.D. We focus our discussion here on the platforms on
which we found remarkable results, namely the SDSC, GIMPS, and LRI-WISC platforms,
and report the results of the other platforms in Appendix B. In particular, since all of
our heuristics use only clock rate information for resource selection or exclusion, the
heuristics executed on platforms that contained hosts with relatively similar clock rates
usually had similar results (for the DEUG and LRI platforms, with the exception of the
UCB platform [see Appendix B]).
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure V.6: Heuristic performance on the SDSC grid
Figure V.6 shows that on the SDSC grid PRI-CR performs nearly as well as
EXCL-PRED or EXCL-S.5 for applications with 100 or 400 tasks, but performs more
than 23% worse than EXCL-S.5 for applications with 200 tasks. The performance of
PRI-CR depends greatly on number of tasks in the application and whether this number
111
causes the slow hosts to be excluded from the computation. For the application with 100
tasks, the slow hosts get excluded and so PRI-CR does relatively well (see Figure V.7).
However, for the application with 200 tasks, PRI-CR assigns tasks to slow hosts, which
then impede application completion. For the application with 400 tasks, there are enough
tasks such that most hosts are kept busy with computation while the slow tasks complete.
In contrast to PRI-CR, the exclusion heuristics perform relatively well for all
application sizes. Figure V.6 shows that EXCL-PRED usually performs as well as EXCL-
S.5 on the machines at SDSC, but there is no clear advantage for using EXCL-PRED; for
the particular distribution of clock rates in the SDSC desktop grid, EXCL-S.5 appears
to have the particular threshold that yields the best performance. Of all the heuristics,
EXCL-S.5 eliminates the highest percentage of laggers caused by slow hosts; the reduc-
tion in the percent of laggers caused by slow hosts is as high as ∼60%. EXCL-PRED
has slightly more laggers caused by slow hosts than EXCL-S.5 as it is less aggressive in
filtering hosts than EXCL-S.5. Consequently, EXCL-PRED performs 13% more poorly
than EXCL-S.5 for the application with two-hundred 15-minute tasks. We have found
after close inspection of our traces and the laggers that this is because of a handful rela-
tively slow hosts that finish execution past the projected makespan and/or task failures
on these slow hosts occurring near the end of the application. For the application with
400 tasks, the delay is hidden as there are enough tasks to keep other hosts busy until
the slow hosts can finish task execution. For the application with 100 tasks, the rela-
tively slow and unstable hosts get filtered out as there are fewer tasks than hosts and
the heuristic prioritizes resources by clock rate.
Using the same reasoning for the SDSC platform, we can explain why EXCL-
S.5 outperforms EXCL-PRED for the GIMPS desktop grid (see Figure V.9), which like
the SDSC grid has a left heavy distribution of resource clock rates. On the GIMPS
resources, applications scheduled with FCFS or PRI-CR often cannot finish during the
weekday business hours period, i.e., have application completion times greater than 8
hours, because of the use of the extremely slow resources. So, slow hosts especially in
Internet desktop grids having a left-heavy distribution of clock rates are detrimental to
the performance of both FCFS and PRI-CR.
Although EXCL-S.5 performs the best for the SDSC and GIMPS desktop grids,
112
12
34
12
34
12
34
05 # of laggers
100
task
s pe
r ap
plic
atio
n
slow
hos
tfa
iled
task
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
12
34
12
34
12
34
05 # of laggers
200
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
12
34
12
34
12
34
05 # of laggers
400
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
Fig
ure
V.7
:C
ause
ofLag
gers
(IQ
Rfa
ctor
of1)
onSD
SCG
rid.
1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D
113
01
23
40
12
34
01
23
40
5000
1000
0
100
task
s pe
r ap
plic
atio
n
duration (sec)1s
t2n
d3r
d4t
h5
min
task
s15
min
task
s35
min
task
s
01
23
40
12
34
01
23
40
5000
1000
0
200
task
s pe
r ap
plic
atio
n
duration (sec)
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
01
23
40
12
34
01
23
40
5000
1000
0
400
task
s pe
r ap
plic
atio
n
duration (sec)
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
Fig
ure
V.8
:Len
gth
ofta
skco
mpl
etio
nqu
arti
les
onSD
SCG
rid.
0→
OP
TIM
AL.1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D
114
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure V.9: Heuristic performance the GIMPS grid
the threshold used by EXCL-S.5 is inadequate for different desktop grid platforms, and
the filtering criteria and adaptiveness of EXCL-PRED is advantageous in the other sce-
narios. In particular, EXCL-PRED either performs the same as or outperforms EXCL-
S.5 for the multi-cluster LRI-WISC platform. For the application with 400 tasks (see
Figure V.10), EXCL-PRED outperforms EXCL-S.5 in the case of the LRI-WISC by
17%. EXCL-S.5 in the LRI-WISC desktop grid excludes all 600MHz hosts, which con-
tribute significantly to the platform’s overall computing power. In general, the longer
the steady state phase of the application, the better EXCL-PRED performs with re-
spect to EXCL-S.5, since EXCL-S.5 excludes useful resources some of which are utilized
by EXCL-PRED. This explains why EXCL-PRED performs better than EXCL-S.5 for
applications with more tasks and larger task sizes. While PRI-CR does as well as EXCL-
PRED, clearly PRI-CR is not as effective on other platforms, especially those with a left
heavy distribution of clock rates.
In conclusion, using a makespan prediction can prevent unnecessary exclusion of
useful resources. However, this method is sometimes too conservative in the elimination
of hosts, especially for shorter applications.
115
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure V.10: Heuristic performance on the LRI-WISC grid
V.C Related Work
Since the emergence of grid platforms, resource selection on heterogeneous,
shared, and dynamic systems has been the focus of intense investigation. However, desk-
top grids compared with traditional grid systems incorporating mainly a set of clusters
and/or MPP’s are much more heterogeneous and volatile as reflected by the results in
Chapter III. Consequently, the platforms models used in grid scheduling are inadequate
for desktop grids.
One example of this inadequacy is the typical model of resource availability.
As discussed in Chapter III, availability models based on host or CPU availability, such
as those described in [32, 56, 92] do not accurately reflect the availability of resource
as perceived by a desktop grid application. So, the scheduling heuristics designed and
evaluated with these models are inapplicable to desktop grid environments.
For example, the work in [33] describes a system for scheduling soft real-time
tasks using statistical predictors of host load. The system presents to a user confidence
intervals for the running time of a task. These confidence intervals are formed using time
series analysis of historical information about host load. However, the work assumes a
homogeneous environment and disregards task failures caused by user activity (as the
116
system does not specifically target desktop grid environments). As such, the effectiveness
of the system on desktop grids is questionable.
Another example is the work in [80], which studies the problem of scheduling
tasks on a computational grid for the purpose of online tomography. The application
consist of multiple independent tasks that must be scheduled in quasi-real-time on a
shared network of workstations and/or MPP’s. To this end, the authors formalize the
scheduling problem as a constrained optimization problem. The scheduling heuristics
then construct a plan given the constraints of the user (e.g., requirement of feedback
within a certain time) and the characteristics of the application (e.g., data input size).
Although the scheduling model considers the fact that host can be loaded, the model
does not consider task failures. Given the high task failure rates in desktop grid systems,
the same heuristics executed desktop grids will likely suffer from poor performance.
The work described in [61] is the most relevant in terms of desktop grid schedul-
ing. The author investigates the problem of scheduling multiple independent compute-
bound applications that have soft-deadline constraints on the Condor desktop grid sys-
tem. Each “application” in this study consists of a single task. The issue addressed in the
paper is how to prioritize multiple applications having soft deadlines so that the highest
number of deadlines can be met. The author uses two approaches. One approach is to
schedule the application with the closest deadline first. Another approach is to deter-
mine whether the task will complete by the deadline using a history of host availability
from the previous day, and then to randomly choose a task that is predicted to complete
by the deadline. The author finds that a combined approach of scheduling the task
that is expected to complete with the closest deadline is the best method. Although the
platform model in that study considers shared and volatile hosts, the platform model
assumes that the hosts have identical clock rates and that the platform supports check-
pointing. So, the study did not determine impact of relatively slow hosts or task failures
on execution for a set of tasks; likewise, the author did not study the effect of resource
prioritization (e.g., according to clock rates) or resource exclusion.
117
V.D Summary
In this chapter, we investigated the use of two resource selection techniques for
improving application makespan, namely resource prioritization and resource exclusion.
We found that resource prioritization could improve application performance, but that
the improvement varied greatly with the number of tasks per application; if the applica-
tion consisted of many tasks, then the tasks would inevitably be assigned to slow hosts,
which limited performance. When the number of tasks is about equal or greater than
the number hosts, there was little benefit of prioritization over FCFS. The most capable
hosts tended to request tasks the most often, and so FCFS performed almost as well as
any of the prioritization heuristics we studied. Moreover, waiting for a pool of host re-
quests to collect before performing resource selection only delayed application execution.
When the number of tasks was less than the number of hosts, prioritization resulting in
exclusion of poor hosts improved performance. PRI-CR on average performed 1.65 times
better than FCFS for applications with 100 tasks. We found that using static clock rate
information was more useful than using relatively dynamic information about the length
of availability intervals; the mean availability interval length is a poor predictor of host
performance.
We then studied heuristics to eliminate these slow hosts from the application
execution. Our exclusion heuristics used either a fixed threshold (with respect to the
platform’s mean clock rate) by which to filter hosts, or an adaptive threshold based
on the application’s predicated makespan. When using a fixed threshold, the exclusion
heuristics achieved high performance gains; EXCL-S.5, which was the best performing
fixed-threshold heuristic on the SDSC platform, performed 1.49 times better than FCFS
on the SDSC grid. However, exclusion using a fixed threshold can sometime degrade
performance, depending on the distribution of host speeds. We then studied another
heuristic that excluded resources according to a predicted makespan. That is, periodi-
cally, the heuristic EXCL-PRED made a makespan prediction, and excluded only those
hosts that could complete a task by the predicted makespan. For the SDSC and the
GIMPS platforms, the EXCL-PRED proved to be too conservative in its exclusion of re-
sources and performed up to 1.14 times worse. However, on the multi-cluster LRI-WISC,
118
EXCL-PRED performed up to 1.19 times better, especially for longer applications that
can make more use of the slower hosts incorporated by EXCL-PRED but excluded by
EXCL-S.5. We will see in the next chapter how EXCL-PRED can be combined with
task replication so that it performs best (or close to best) on all platforms.
Chapter VI
Task Replication
VI.A Introduction
In the previous chapter, we explored a range of heuristics that determined on
which host to schedule a task. However, even if the best resource selection method is
used, performance degradation due to task failures is still possible. The relatively long
last quartile of task completion times of the best performing heuristic compared to the
last quartile of the optimal (which was as much as 20 minutes or 13.8 times shorter)
indicates there is much room for improvement. In this chapter, we augment the resource
selection and exclusion heuristics described previously to use task replication techniques
for dealing with failures.
We define task replication as the assignment of multiple task instances of a
particular task to a set of hosts; a task is the applications’s unit of work and a task
instance is the corresponding application executable and data inputs to be assigned to
a host. We refer to the first task instance created as the original and the replicated
task instances as replicas. By assigning multiple task instances to hosts, the probability
of all tasks failing can be reduced. Also, replication can be a means of adapting to
dynamic host arrivals (as most desktop grid systems do not support process migration);
for example, in the case where a task has been assigned to a relatively slow host but a
fast host arrives shortly thereafter, a task can be replicated on the fast host (as opposed
to migrated) to accelerate task completion.
Task replication is a plausible technique for coping with task failures and delays
119
120
for at least two reasons. First, there is often an abundance of resources available com-
pared to the amount of work to be completed. At one point in the SETI@home project,
there were more participants than actual tasks to distribute and so the scheduler began
replicating tasks just to keep the participants busy [86]. In the Sprite project [36], the
authors noted that the use of idle hosts is limited by the lack of applications instead
of the lack of hosts. Finally, personal communication with one of the committee mem-
bers [29] suggests that desktop grids within enterprises are often underutilized. Because
there is little contention for resources among applications, replication is often a plausible
option.
Second, task replication is relatively easier to implement and deploy than check-
pointing or process migration because replication requires no modification of the appli-
cation nor the hosts’ operating system. With little modification, schedulers in most
desktop grid systems [37, 39, 87] can support task replication; only simple bookkeeping
details for each task instance need to be added (see Chapter VII). In contrast, imple-
mentation of system-level checkpointing and process migration often requires integration
with the kernel (and is often highly specific to the kernel version) [52, 36], which is not
always possible considering the wide range of operating systems (and versions) on hosts
found in enterprise and Internet desktop grids [78]. Moreover, remote checkpointing
often requires servers to store checkpoints, and process migration often involves moving
the entire state of the application across different hosts. Considering that hosts with
memory sizes of 512MB are common and the relatively low data transfer speeds capable
through the Internet, remote checkpointing or process migration across Internet desk-
top grids may not be practical or feasible, especially for applications that require rapid
application turnaround.
In order to replicate tasks effectively, we investigate the following issues:
1. Which task to replicate and which host to replicate to. If a task instance is already
running on a fast and stable host, replicating the task on a different host with a
lower clock rate or less availability clearly will not improve performance. We study
different methods of choosing which task to replicate and on which host to schedule
a replica.
121
2. How much to replicate. Clearly, task throughput tends to decrease inversely with
the amount of replication. The reason is simply because if r task instances of a
particular task are assigned then the effective amount of work increases by a factor
of r, and so throughput is reduced by a factor of 1/r. On one extreme, a task
could be replicated only once, and on another extreme a task could be replicated
on all available hosts. We determine the performance improvement and waste for
various levels of replication.
Regarding the issue of when to replicate during an application’s execution, all
of our heuristics only replicate when there are more hosts than tasks. Applications
that have a number of tasks larger than the number of hosts will often have a steady-
state phase, and replicating during this steady state phase will usually not improve
makespan, and only delay task completion. The fact that the length of time in the
first three quartiles of application execution is close to the optimal supports our claim
that replication is unnecessary during this phase (see Figure V.7). Thus, we only use
replication after the point at which the number of available hosts is greater than the
number of tasks remaining, scheduling tasks only when there is a surplus of hosts, and
in this way, reducing the chance that a replicated task will delay the execution of another
task. Replicating anytime sooner could cause a host to do redundant work when there are
more unscheduled tasks than hosts, and thereby cause a delay in application completion.
We examine the above replication issues with respect to three broad approaches
for task replication, namely proactive, reactive, and hybrid approaches. With proactive
replication, multiple instances of each task are created initially and assigned as hosts
become available. Proactive replication techniques are aggressive in the sense that repli-
cation is done before a delay in application completion time has occurred. In contrast,
with reactive replication, the heuristics replicate a task only when the task’s completion
has been delayed and its execution is delaying completion; in this sense, the heuristics
are reactive. Finally, we develop a heuristic that uses a hybrid approach for replicating
tasks that either have a high risk of delaying application completion or are currently
delaying completion; as such, the heuristic uses both proactive and reactive replication
techniques.
122
VI.B Measuring and Analyzing Performance
VI.B.1 Performance metrics
Similar to Chapter V, we continue to use makespan relative to optimal as the
performance metric. In addition, we use waste, which is the percent of tasks replicated
(including those that fail), to quantify the expense of wasting CPU cycles. A replication
heuristic that has high waste would be problematic if the entire desktop grid is loaded
and multiple applications are competing for resources. (Note that the reason we did not
consider the heuristics with replication in the previous chapter is that replication is not
always an option when there is high resource contention among multiple applications in
the system.)
VI.B.2 Method of Performance Analysis
In general, we use the same techniques of lagger analysis used in Chapter V.
However, in our analysis of laggers, we take into account replication as follows. We
define a task instance as the executable and data of a particular task assigned to a host.
Replication involves assigning multiple instances of a task each to a different host. When
task replication is used, some task instances might complete before the lagger threshold
while others complete after the threshold. To address this scenario, we only classify task
instances of a task as laggers if the completion times of all task instances of that task fall
after the lagger threshold. In this way, if instances of a task have completed before the
lagger threshold, any instances of the task completed after the threshold are excluded
from lagger analysis. If a task instance is classified as a lagger, we consider all of the
instances of the corresponding task in the lagger analysis in order to assess why each
task instance was completed late.
VI.C Proactive Replication Heuristics
We augment the heuristics PRI-CR, EXCL-S.5, and EXCl-PRED described in
Chapter V to use replication and refer to the new heuristics as PRI-CR-DUP, EXCL-
S.5-DUP, and EXCL-PRED-DUP respectively. When scheduling an application,
123
each of the heuristics will create two instances (one original and one replicas) of each
task and place them into a priority queue. Replicas are scheduled from this queue only
when the number of hosts available is greater than the number of tasks to schedule. The
tasks are prioritized according to the clock rate of the host to which the original task
instance was assigned. So task instances assigned to slower hosts should be replicated
more often. The heuristics PRI-CR-DUP, EXCL-S.5-DUP, and EXCL-PRED-DUP differ
by the set of hosts considered for task assignment as described in Chapter V.
All the heuristics discussed above prioritize tasks according to the clock rate
of the host to which the original task instance was assigned. We study other criteria
for selecting which task to schedule. EXCL-PRED-DUP-TIME is similar to EXCL-
PRED-DUP except original task instances assigned farthest in the past are assigned
first; original task instances assigned farthest in the past most likely failed or are stuck
on slow hosts. EXCL-PRED-DUP-TIME-SPD prioritizes the tasks according to the
time the first task instance was assigned plus the shortest possible completion time of
the task, i.e., the task size divided by the host’s maximum compute rate. Since most
hosts are available most of the time, we expect that most hosts should complete tasks in
the shortest possible time, and the heuristic replicates those tasks that take longer and
whose execution has been delayed.
The heuristics above only create one replica for each task. We also study the
effect on application performance of varying the number of times a task is replicated. We
vary the number of replicas created by EXCL-PRED-DUP to be 2, 4, and 8 for heuristics
EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, and EXCL-PRED-DUP-8, respectively.
VI.C.1 Results and Discussion
The addition of replication for each heuristic invariably improves performance
significantly by 35% on average (see Figure VI.1) for the SDSC platform. Somewhat
surprisingly, the performance of each of the replication heuristics are similar, regardless
of which set of hosts is excluded. We attribute this to the fact that when replication is
done near the end of the application there are far more hosts than tasks and of these
hosts, several are fast and stable. So replication is done on the same set of relatively fast
hosts for each heuristic (about 20% of the hosts have clock rates greater than 2GHz),
124
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal
FCFSCRCR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure VI.1: Performance Of Heuristics Combined With Replication On SDSC Grid.
Number of tasksHeuristic 100 200 400EXCL-PRED-DUP-2 +2.6% +0.15% +3.4%EXCL-PRED-DUP-4 +6.7% -3.0% +2.0%EXCL-PRED-DUP-8 +4.3% -7.4% -4.1%
Table VI.1: Mean performance difference relative to EXCL-PRED-DUP when increasing
the number of replicas per task.
and excluding slow resources at this point is ineffectual.
We find that after a task is replicated once, replicating more often does not
improve performance. Table VI.1 shows the mean performance difference between EXCL-
PRED-PRED and EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, and EXCL-PRED-DUP-
8 for applications with 100, 200, and 400 tasks. The maximum mean improvement
relative to EXCL-PRED-DUP over all heuristics is 6.7%. The lack of performance im-
provement is partly due to the fact that replicating a task once dramatically decreases
the probability of failure since there are many fast hosts available near the end of ap-
plication execution. Thereafter, creating more replicas will not significantly reduce the
125
probability of failure. Moreover, if too many replicas are created, then a large fraction
of the hosts will be doing redundant work preventing useful work from being done, thus
degrading performance. This explains the performance degradation shown in Table VI.1
of the EXCL-PRED-DUP-4 and EXCL-PRED-DUP-8 heuristics.
Several of the trends described for the SDSC platforms match those trends
found on the DEUG, LRI, and UCB platform, which we summarize here and shows in
Appendix C. The performance improvement resulting from replication on the DEUG
platform is less than the improvement found with the SDSC platform because the DEUG
host clock rates are relatively homogeneous compared to the SDSC host clock rates.
Little improvement is found on the LRI platform because the hosts are both stable and
have homogeneous clock rates. Replication on the UCB platform results in high benefits
as the hosts are volatile and replication reduces the chance of failure. We conclude that
(proactive) replication can be useful either when there is a wide range of host clock rates
and/or the hosts are volatile.
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
PRI−CR−DUPEXCL−S.5−DUPEXCL−PRED−DUP
100 200 400Number of Tasks Per Application
Figure VI.2: Waste Of Heuristics Using Proactive Replication On SDSC Grid.
Despite the performance improvement when creating a single replica, the waste
in resources is significant (see Figure VI.2), and is 29% on average and as high as ∼90%.
In loaded desktop grids especially, such waste is unacceptable and would result in a
126
dramatic decrease in overall system throughput. We develop heuristics that use reactive
replication to reduce the level of waste in the next section.
VI.D Reactive Replication Heuristics
Up to now, we have considered heuristics that place all replicas of a task in the
queue initially as soon as the original task is scheduled, and this resulted in high waste.
In an effort to improve efficiency, we now consider heuristics that are discriminate in
deciding which tasks are replicated. We modify the EXCL-PRED heuristic to evaluate
certain criteria for each task before placing a replica in the queue, effectively delaying task
replication. EXCL-PRED-TO is similar to EXCL-PRED except it delays the creation
of replicas until the predicted application completion time passes. That is, whenever the
original task instance is scheduled, we associate with that task instance the predicted
application completion time. This completion time is determined using the makespan
predictor described in Section V.B.2, which uses the average effective compute rate per
host to predict when the application will complete. This predicted completion time is
then used as a “time-out” value; if by that time the task instance has not completed, we
create a replica and place it in the queue. This heuristic is optimistic in the sense that
it creates the replica only after it determines that the original task instance has failed
to complete by the predicted application completion time instead of replicating earlier.
The rationale is that we should not replicate tasks that have been scheduled to fast and
reliable hosts, and instead, we should only replicate when it is has been determined that
the execution of the task instance is delaying application completion, i.e., when the task
instance’s execution goes past the predicted completion time. The heuristic effectively
only replicates when it is close to the completion time of the application.
Another heuristic we consider is EXCL-PRED-TO-SPD, which replicates
more aggressively than EXCL-PRED-TO but less aggressively than EXCL-PRED-DUP.
EXCL-PRED-TO-SPD creates a replica only after the minimum task completion time
has expired, i.e., after task size/host clock rate seconds have expired. The reasoning
is that since most hosts are unloaded most of the time, each host should usually be
completely available to execute a task. In the case where a task instance is not completed
127
in its expected execution time (e.g., because the task execution was suspended multiple
times or host is slightly loaded), the heuristic assumes the task execution will delay
application completion and places a replica in the queue.
VI.D.1 Results and Discussion
The performance of EXCL-PRED-TO and EXCL-PRED-TO-SPD is similar to
the other more aggressive replication heuristics from Section VI.C (see Figure VI.3),
despite replicating tasks later during application execution and replicating tasks less
often; the mean difference in the average makespan between EXCL-PRED-DUP and
EXCL-PRED-TO is close to zero. This is due to fact that the heuristics only replicate a
task instance when it is determined that the executing task instance will delay application
completion. Moreover, when a task instance is replicated, there is usually a fast and
stable host to complete the task instance quickly and reliably. We discuss this in further
detail in Section VI.E.4.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal
FCFSEXCL−PRED−DUPEXCL−PRED−TOEXCL−PRED−TO−SPD
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure VI.3: Performance of reactive replication heuristics on SDSC grid.
Also, the performance of EXCL-PRED-TO and EXCL-PRED-TO-SPD is re-
markable because both heuristics significantly outperform EXCL-PRED-DUP with much
less waste (by as much as 86% on the SDSC platform). In all cases on the SDSC platform,
128
PlatformMetric SDSC DEUG LRI UCBMakespan +0.06% -8.7% -10.8% -17.5%Waste +86.2% +39% +71.4% +26.1%
Table VI.2: Mean performance difference and waste difference between EXCL-PRED-
DUP and EXCL-PRED-TO.
EXCL-PRED-TO achieves the lowest waste of all the heuristics shown in Figure VI.4,
and is less wasteful than EXCl-PRED-TO-SPD by 65% on average. Again, we attribute
the efficiency of EXCL-PRED-TO to the makespan predictor, which forces the sched-
uler to wait as long as possible (without significantly delaying application execution)
before replicating a task. The results show that reactive replication can achieve high
performance gains at relatively low resource waste.
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TOEXCL−PRED−TO−SPD
100 200 400Number of Tasks Per Application
Figure VI.4: Waste of reactive replication heuristics on SDSC grid.
Table VI.2 summarizes the mean makespan difference and mean difference of
waste of EXCL-PRED-DUP and EXCL-PRED-TO on all platforms. A positive per-
centage means that EXCL-PRED-TO did better than EXCL-PRED-DUP. In terms of
mean makespan, EXCL-PRED-TO performs about 8.7% and 10.8% worse than EXCL-
129
PRED-DUP on the DEUG and LRI platforms, respectively. Although EXCL-PRED-TO
performs slightly worse on these platforms, it is much less wasteful (on average 39% and
71.4% less wasteful on the DEUG and LRI platforms). This is partly because EXCL-
PRED-TO can adjust to the volatility of the platforms. Because EXCL-PRED-TO does
not replicate until a task delays execution, which is unlikely in the LRI scenario, there
is much less waste with EXCL-PRED-TO than EXCL-PRED-DUP. In the opposite sce-
nario, where the platform is volatile, EXCL-PRED-TO will replicate tasks more often.
Nevertheless, we find that EXCL-PRED-TO performs worse (17.5% on average) than
EXCL-PRED-DUP because tasks have a relatively high chance of failing on the UCB
platform. When EXCL-PRED-TO replicates a task instance that has timed out, there
is a relatively high probability that the replica itself will fail, and so the benefits of
using EXCL-PRED-TO are less on the relatively volatile UCB compared to the other
platforms. In contrast to EXCL-PRED-TO, EXCL-PRED-DUP replicates tasks imme-
diately as soon as the original task instance is assigned to a host (versus waiting until
the predicted application completion time) so there is a smaller chance that both task
instances will fail and delay application completion. At the same time, the waste of
EXCL-PRED-TO is significantly less than EXCL-PRED-DUP by 26.1% on average.
In summary, we find that EXCL-PRED-TO in general performs similar to
EXCL-PRED-DUP (within 10% on average across all platforms) while causing much
less waste (on average, 55% less across all platforms). This is because EXCL-PRED-TO
only replicates when a task instance will delay application completion, and because the
replica is most often scheduled on a relatively fast and stable host. The exception is on
the UCB platform, where the resources are so volatile that replicating a task as soon as
the original task instance is assigned results in faster task completion than if timeouts
are used; in this case, EXCL-PRED-DUP performs 17.5% better than EXCL-PRED-TO.
VI.E Hybrid Replication Heuristics
In the previous sections, we designed and evaluated proactive and reactive
replication heuristics that replicate tasks either proactively or reactively in an effort
to reduce the probability of task failure near the end of application execution. In this
130
section, we investigate a hybrid approach for replication that replicates proactively those
tasks that have high chance of failing, while replicating reactively those tasks that have
not completed by a predicted completion time. Clearly, just combining the proactive
replication heuristic EXCL-PRED-DUP and the reactive replication heuristic EXCL-
PRED-TO would not be beneficial as EXCL-PRED-TO achieved similar performance as
EXCL-PRED-DUP but with far less waste on most platforms; EXCL-PRED-DUP was
wasteful because it indiscriminately replicated all tasks once in order of those assigned
to the slowest hosts. In contrast, we use a more refined method of determining which
task to replicate and how much to replicate with our hybrid heuristic. Our approach is
to use the probability of task completion on the previous day to predict the probability
of task completion on the following day. Using these predicted probabilities, we replicate
tasks until the predicted probabilities of task completion go below some threshold. We
describe the heuristic in detail below.
The REP-PROB heuristic uses the history of host availability to make in-
formed decisions regarding replication. Specifically, the heuristic prioritizes each host
according to its predicted probability of completing a task by the projected application
completion time. We use random incidence (as discussed in Section III.E.4) with the
previous day’s host traces to determine the predicted probability of task completion. The
projected application completion time is determined using the same makespan predictor
as EXCL-PRED-TO described in Section V.B.2.
Also, REP-PROB prioritizes each task according to its probability of comple-
tion by the predicated makespan given the set of hosts it has been assigned to; the
task with the lowest probability of completion is replicated on the host with the highest
probability. Regarding how many task instances to create, the heuristic could create a
single replica as in the EXCL-PRED-DUP heuristic. But if the two task instances were
both scheduled on slow and unreliable hosts, then the probability of task completion
would remain low and the task would require more replicas. Instead, REP-PROB uses
the probability of completion to estimate how many task replicas to create in order to
ensure the probability of task completion is greater than some threshold.
131
VI.E.1 Feasibility of Predicting Probability of Task Completion
To evaluate the feasibility of such an approach, we examine the stationarity
of the probability that a task of a given size completes from day to day. Figure VI.5
shows the probability of task completion per day for tasks 5, 15, 35 minutes in length
for each of the platforms. The graphs show the probabilities for all five business days in
one week, staring with a Monday. We see that in each platform that the probability of
task completion is relatively constant, and deviates from the previous day by no more
than 10%. This provides evidence that the predicted values may be sufficiently close to
the actual.
Also, we calculate the prediction error of the probability of task completion for
each host from one day to the next. Figures VI.6(a), VI.6(b), VI.6(c), and VI.6(d) show
the CDF of prediction errors for the SDSC, DEUG, LRI, and UCB platforms. We find
in all of the platforms that at least 60% of the prediction errors are less than 25%. These
results combined with the evidence of host independence shown in Section III.E.5 made
us optimistic that we could compute the probability of task completion accurately.
VI.E.2 Probabilistic Model of Task Completion
To create an accurate probabilistic model, we first created a simple deterministic
finite automata (DFA) to understand and clarify the various states of a task during
execution (see Figure VI.7). Note that the concept of availability used in the figure
refers to exec availability. First, the task begins execution (state 1). If the host fails
before the task can complete, the task fails (state 2), and we must wait until the host
becomes available again before beginning task execution again (state 1). If the host is
available long enough for the task complete, task completes (state 3).
With this model, it became apparent that using a geometric distribution to
model the probability that a task completes in certain number of attempts would be
possible. By using a geometric distribution, we assume that each attempt to complete
a task instance on some host is independent of other attempts on the same host. In
particular, the probability of task completion can be computed using the following pa-
rameters:
132
0
0.2
0.4
0.6
0.8
1
Day
Pro
babi
lity
of T
ask
Com
plet
ion
08−S
ep−2
003
09−S
ep−2
003
10−S
ep−2
003
11−S
ep−2
003
12−S
ep−2
003
5 min15 min35 min
(a) SDSC
0
0.2
0.4
0.6
0.8
1
Day
Pro
babi
lity
of T
ask
Com
plet
ion
17−J
an−2
005
18−J
an−2
005
19−J
an−2
005
20−J
an−2
005
21−J
an−2
005
5 min15 min35 min
(b) DEUG
0
0.2
0.4
0.6
0.8
1
Day
Pro
babi
lity
of T
ask
Com
plet
ion
17−J
an−2
005
18−J
an−2
005
19−J
an−2
005
20−J
an−2
005
21−J
an−2
005
5 min15 min35 min
(c) LRI
0
0.2
0.4
0.6
0.8
1
Day
Pro
babi
lity
of T
ask
Com
plet
ion
07−M
ar−1
994
08−M
ar−1
994
09−M
ar−1
994
10−M
ar−1
994
11−M
ar−1
994
5 min15 min35 min
(d) UCB
Figure VI.5: Probability of task completion per day for several task lengths.
133
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prediction Error of Task Completion Rate Between Day X and X+1
Cum
ulat
ive
Fra
ctio
n
5 min15 min35 min
(a) SDSC
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prediction Error of Task Completion Rate Between Day X and X+1
Cum
ulat
ive
Fra
ctio
n
5 min15 min35 min
(b) DEUG
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prediction Error of Task Completion Rate Between Day X and X+1
Cum
ulat
ive
Fra
ctio
n
5 min15 min35 min
(c) LRI
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prediction Error of Task Completion Rate Between Day X and X+1
Cum
ulat
ive
Fra
ctio
n
5 min15 min35 min
(d) UCB
Figure VI.6: CDF of prediction errors of the probability of task completion from one
day to the next for 5, 15, 35 minute tasks on a dedicated 1.5GHz host
134
1TASK BEGINSEXECUTION
2TASK FAILS
HOST FAILSBEFORE
TASK COMPLETION
HOST BECOMESAVAILABLE
HOST AVAILABLELONG ENOUGH FOR TASK
TO COMPLETE
3TASK
COMPLETES
Figure VI.7: Finite automata for task execution.
D
XL L L
E[Time to task failure]E[Length of unavailability intverval]
TIME
Current time
Trial 4Trial 3Trial 2
A=4Trial 1
Figure VI.8: Timeline of task completion.
135
• H1,H2, ....,HN : the set of N heterogeneous and volatile hosts.
• W1, W2, ...., WT : the set of T tasks of an application to be scheduled.
• Ci: completion time of task Wi.
• ci,j : completion of time of task instance j of task Wi.
• ri: number of instances of task Wi.
• D: the desired completion time of the application, as determined by our makespan
predictor, for example.
• X: execution time of a task on a particular host.
• A: starting from the current time, the number of attempts, i.e., trials, possible on
a particular host to complete a task by time D,
• L: the length of time for each failed attempt on a particular host
• p: the probability of task completion (as computed by random incidence discussed
in Section III.E.4) for a particular host.
The parameters X,A, L, and p are all defined for a particular host Hm (and
should be written as XHm , AHm , LHm , and pHm , respectively), but for brevity we omit
the subscripts in our discussion below.
If a task is to complete by time D when executed on a particular host Hm, the
last attempt to complete a task must occur by time D−X. Thus, the number of attempts
A for task completion is given by b(D − X)/Lc + 1, where L is the time required for
each failed attempt (see Figure VI.8). In the DFA in Figure VI.7, L is the time the task
had been executing before failure just before entering state 2 from state 1 plus the time
before the host becomes available again, i.e., the length of the unavailability interval,
incurred when going from state 2 back to state 1. Ideally, L would be modelled by
the probability distribution of the task’s time to failure and the length of unavailability
intervals. However, constructing such a joint probability distribution is difficult as using
only a day’s worth of historical data results in a very sparse probability distribution over
multiple dimensions. So, as a simplification, we calculate L using the expected time to
136
task failure (which we can compute with random incidence) plus the expected length
of unavailability for a particular host (which we can derive from the traces). Then, the
probability that a task instance j of task Wi completes by time D can be estimated by:
P (ci,j ≤ D) =A∑
a=1
(1− p)a−1p, (VI.1)
which sums the probability that the task completes in the ath attempt.
In Section III.E.5, we gave evidence that exec availability is independent among
hosts (as shown in Section III.E.5) on certain platforms. Assuming that exec availability
is independent among hosts, the probability that a particular task Wi completes by time
D is estimated by:
P (Ci ≤ D) = P (minj(ci,j) ≤ D)
= 1− P (minj(ci,j) > D)
= 1− P (ci,1 > D)P (ci,2 > D) · · ·
P (ci,j > D) where 1 ≤ j ≤ ri
(VI.2)
Then, the probability that the application completes in time D can be estimated
by:
P (maxi(Ci) ≤ D) = P (C1 ≤ D)P (C2 ≤ D) · · ·
P (CT ≤ D)(VI.3)
So using the probability of completion for each host and desired completion
time, we can determine the amount of replication needed to achieve some minimum
probability threshold. Clearly, at any particular time during application execution, there
may not be enough hosts to replicate on in order to achieve the threshold. The heuristic
REP-PROB makes the best effort by replicating the task with lowest probability of
completion on the host with the highest, until there are no remaining hosts left; if a task
has no instances assigned, it is given the highest task priority to ensure that an instance
of each task is assigned before replicating.
137
While Equation VI.3 can be used to estimate the probability of application
completion in theory, in practice it is almost impossible to achieve given the high amount
of replication and number of hosts required. This can be shown by a simple back of the
envelope calculation to determine the number of instances per task required to achieve
some probability bound. That is, assume our application consists of 100 tasks to be
scheduled on the SDSC grid, and that our desired probability of application completion
P (maxi(Ci) ≤ D) is 80%. Achieving this threshold requires that each task is completed
with probability P (Ci ≤ D) = eln(.8)/100, assuming that each task is completed with
equal probability. If a task instance fails with probability 20% (a realistic number as
shown in Section III.E.4), it would require four task instances for each task, totalling
400 task instances for the application with 100 tasks. Since there are only ∼200 hosts
in the SDSC platform, computing all task instances at once is not possible for even a
relatively small application. Furthermore, waste of 300% is extremely high and would
reduce the effective system throughput considerably. We confirmed these conclusions
in simulation for a range of application sizes (100, 200, 400 tasks) and task sizes (5,
15, 35 minutes on a dedicated 1.5GHz host); each task is replicated so often that the
application rarely completes by the predicated makespan. So instead of trying to achieve
a probability threshold per application, REP-PROB makes the best effort to achieve a
probability threshold per task using Equation VI.2.
VI.E.3 REP-PROB Heuristic
A procedural outline of the REP-PROB heuristic is given below:
1. Predict the application completion time D using the makespan predictor described
in Section V.B.2.
2. Prioritize tasks according to the probability of task completion by time D estimated
by Equation VI.2.. Unassigned tasks have the highest priority. Tasks that have
timed out have the second highest priority.
3. Prioritize hosts according to the probability of completing a task by time D.
4. While there are tasks remaining in the queue:
138
(a) Assign an instance of the task with the lowest probability of completion to
the host with the highest probability.
(b) Assign a timeout D to that task. If the task has not been completed by time
D, the task will be given the second highest possible priority (corresponding
to “timed-out” tasks).
(c) Recompute the task’s probability of completion.
(d) Remove the task from the queue if its probability of completion is above 80% 1.
We hypothesize that REP-PROB should outperform EXCL-PRED-TO. REP-
PROB takes into account both host clock rate and host volatility when deciding which
task to replicate and which host to replicate on. As such, REP-PROB aggressively
replicates tasks that have a low probability of completion as soon as the original task
instance is assigned; this in turn reduces the chance that tasks scheduled on volatile
hosts will delay application completion. In contrast, EXCL-PRED-TO only replicates a
task if it has not been completed by the predicted makespan, and the replica is assigned
to a host based on its clock rate (disregarding host’s volatility). As such, tasks initially
assigned to volatile will not have replicas scheduled until late during the application
execution, and this may result in a delays in application completion. Also, replicas may
be assigned to volatile hosts (although the hosts may have relatively fast clock rates).
VI.E.4 Results and Discussion
Figure VI.9 shows the results for the SDSC platform for each application size,
while Table VI.3 shows the performance of REP-PROB relative to EXCL-PRED-TO
for all four platforms. A positive value in the table means that REP-PROB performed
that much better than EXCL-PRED-TO. (Figures for the other platforms are shown in
Appendix C.) Surprisingly, REP-PROB does not perform better than EXCL-PRED-TO
in the SDSC and DEUG platforms. In the one platform where REP-PROB does perform
significantly better than EXCL-PRED-TO, the performance difference on average is only
13%.
1We tested a range of thresholds from 50-90% and found that a threshold of 80% is the most adequatein terms of improving application performance.
139
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FIFO
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure VI.9: Performance of REP-PROB on SDSC grid.
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Tasks Per Application
Figure VI.10: Waste of REP-PROB on SDSC grid.
140
PlatformMetric SDSC DEUG LRI UCBMakespan -4% -11.1% +1.7% +13.1%Waste -140% -37.5% +3.3% -19.2%
Table VI.3: Mean performance and waste difference between EXCL-PRED-TO and
REP-PROB.
The performance of EXCL-PRED-TO is similar to REP-PROB for several rea-
sons. First, there is strong correlation between the probability of completion by a partic-
ular time and clock rate as shown in Section III.E.6. Since EXCL-PRED-TO replicates
tasks on the hosts with the highest clock rates, these hosts will tend to have the highest
probability of completion by the projected completion time. Second, a large fraction of
the hosts in each platform are relatively stable. Figures VI.12(a), VI.12(b), and VI.12(c)
show the CDF of task failure rates for each host in the platform. For example, for a
fifteen minute task, the fraction of hosts with failure rates less than 20% for the SDSC,
DEUG, LRI, and UCB platforms are ∼60%, 75%, 100%, and 50% respectively. So for
the small fraction of tasks that timeout, the replica will most likely be scheduled on a
relatively stable host (especially since EXCL-PRED-TO will choose the host with the
fastest clock rate, which is correlated with the probability of task completion) and the
resulting probability of the task will be dramatically lowered. For example, if the timed-
out task has 50% chance of failure and a replica is then scheduled on a host with a 20%
chance of completion, then the probability that the task will fail is a mere 10%. The fact
that EXCL-PRED-DUP-2, EXCL-PRED-DUP-4, EXCL-PRED-DUP-8 did not improve
performance on the SDSC platform supports this claim (see Section VI.C). Moreover, by
comparing the number of laggers caused by task failures between EXCL-PRED-TO and
REP-PROB, we see little improvement in the number of laggers when the REP-PROB
heuristic is used. Figure VI.11 shows the number of laggers for applications scheduled by
the EXCL-PRED-TO and REP-PROB heuristics, and we can see from this figure that
the number of laggers caused by failures is usually similar; on average, REP-PROB has
only .66 less laggers than EXCL-PRED-TO. (We also see in Figure VI.11 that the number
of laggers for EXCL-PRED-TO and REP-PROB exceeds the number of laggers corre-
sponding to EXCL-S.5 for applications with 100 tasks that are 5 minutes in length; at the
141
same time, the mean makespans of EXCL-PRED-TO and REP-PROB are 56% better
than mean makespan of EXCL-S.5 on average. This discrepancy is due to the fact that
the IQR’s for EXCL-PRED-TO and REP-PROB are shorter than EXCL-S.5’s, and so a
higher number of task instances are classified as laggers. So, when comparing the number
of laggers between one heuristic and another, one should also look at Figure VI.15, which
shows the mean makespans of each heuristic, to gain perspective.) Because there is not
a significant reduction in the number of laggers when the REP-PROB heuristic is used
and the mean makespans resulting from EXCL-PRED-TO and REP-PROB are similar,
the benefits of REP-PROB are dubious. Third, the (un)availability of one host with
respect to another can be correlated in some platforms and so the probability of task
completion computed is only a lower bound. The fact that availability of hosts in the
DEUG is correlated as shown in Section III.9 may be one reason why EXCL-PRED-TO
outperforms REP-PROB by about ∼11% on that particular platform.
142
Moreover, REP-PROB wastes significantly more resources (as much as 140%
more than EXCL-PRED-TO) without much gain in performance (see Table VI.3). REP-
PROB naturally replicates more than EXCL-PRED-TO when the heuristic replicates
tasks with low probabilities of completion. One reason that this does not result in sig-
nificant performance improvement could be because of mispredictions in the probability
of task completion. Although a significant fraction of predictions may be within 25%
of the actual value (as discussed in Section VI.E.1), any misprediction that leaves a
task assigned to a volatile host unreplicated could be costly for the application. Also,
our assumption that the series of attempts to complete a task instance on a particular
host are independent may not be valid; by observing our traces, we found that a short
availability interval is often followed by a another short availability interval.
Nevertheless, REP-PROB does perform slightly better than EXCL-PRED on
the UCB platform. Because all the hosts in the UCB platform have the same clock rates
and EXCL-PRED-TO prioritizes hosts only by clock rates, EXCL-PRED-TO cannot
distinguish a stable host from a volatile one. REP-PROB on the other hand will prioritize
the hosts by their predicted probability of completion, and have an advantage in this
case. But, the performance improvement is limited again because a large fraction of the
hosts in the UCB platform are relatively stable.
VI.E.5 Evaluating the benefits of REP-PROB
The “achilles heel” of EXCL-PRED-TO is the fact that it sorts only by clock
rates, and one can certainly construct pathological cases that make EXCL-PRED-TO
perform more poorly than REP-PROB. For example, one could imagine the scenario
where half the hosts are extremely volatile while the other half are extremely stable but
have slightly lower clock rates than the volatile hosts. In this case, EXCL-PRED-TO will
tend to schedule tasks to hosts with faster (albeit only slightly faster) clock rates, which
are also the most volatile; as a result, the tasks will tend to fail and delay application
completion. REP-PROB on the other hand will take into account host volatility and
schedule tasks to stable hosts.
To investigate this issue, we construct a new platform, half of which consists of
volatile hosts from the UCB platform. The clock rates of the UCB hosts are transformed
143
12
34
56
12
34
56
12
34
56
05 # of laggers
100
task
s pe
r ap
plic
atio
n
slow
hos
tfa
iled
task
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
12
34
56
12
34
56
12
34
56
05 # of laggers
200
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
12
34
56
12
34
56
12
34
56
05 # of laggers
400
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
Fig
ure
VI.11
:C
ause
ofLag
gers
(IQ
Rfa
ctor
of1)
onSD
SCG
rid.
1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D.5
→E
XC
L-P
RE
D-T
O.6→
RE
P-P
RO
B.
144
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Failure Rate
Cum
ulat
ive
Frac
tion
SDSCDEUGLRIUCB
(a) 5 min. task
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Failure Rate
Cum
ulat
ive
Frac
tion
SDSCDEUGLRIUCB
(b) 15 min. task
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Failure Rate
Cum
ulat
ive
Frac
tion
SDSCDEUGLRIUCB
(c) 35 min. task
Figure VI.12: CDF of task failure rates per host.
145
to follow a normal distribution with mean 1500MHz and standard deviation of 250MHz.
The other half of the new platform consists of stable hosts from the LRI cluster, which
is relatively homogeneous in terms of host clock rates. We then create a set of platforms
where we transform the clock rates of hosts from the LRI platform. Clearly, if the clock
rates of the stable LRI hosts are relatively very low, then it will be better to schedule
tasks to the volatile UCB hosts, and EXCL-PRED-TO will perform better then REP-
PROB. If the clock rates of the stable LRI hosts are higher than UCB hosts, then it
will be better to schedule tasks to the stable and fast LRI hosts, and again EXCL-
PRED-TO will outperform REP-PROB. However, when clock rates of the LRI hosts are
“slightly” less than the clock rates of the UCB hosts, then REP-PROB has a chance of
outperforming EXCL-PRED-TO.
Specifically, we transform the clock rates of LRI hosts by -33%, -15%, -6%, +6%,
+15%, and +33% relative to the mean clock rate of UCB hosts (1500MHz), and we refer
to the resulting platforms as UCB-LRI-n33, UCB-LRI-n15, UCB-LRI-n06, UCB-LRI-
p06, UCB-LRI-p15, and UCB-LRI-p33, respectively. Then we run both EXCL-PRED-
TO and REP-PROB on each platform and determine how each heuristic performs. We
observe that REP-PROB performs better by ∼13% or less than EXCL-PRED-TO for
only a limited set of the platforms with the range of -15% and -6% relative to the mean
UCB clock rate (see Figure VI.13).
This limited improvement in a relatively small set of hypothetical scenarios in-
dicates that REP-PROB will rarely outperform EXCL-PRED-TO, and any performance
difference is slight. Moreover, in practice on our real platforms, we find that REP-PROB
performs at most 13.1% better than EXCL-PRED-TO while causing 40% more waste
on average across all platforms. So, in general, we believe that EXCL-PRED-TO will
usually outperform or perform as well as REP-PROB, with the possibility of performing
slightly worse.
VI.F Estimating application performance
Estimates or bounds on the application makespan are useful for users submit-
ting applications. Using the results of our application simulations, we can give estimates
146
1000 1200 1400 1600 1800 2000−10
−5
0
5
10
15
Clock Rate Mode of Transformed LRI Hosts in Each Platform
Per
form
ance
Diff
eren
ce in
Per
cent
5 min. tasks15 min. tasks35 min. tasks
−33 −20 −6 +6 +20 +33Percent Deviation from mean clock rate of UCB hosts
Figure VI.13: Performance difference between EXCL-PRED-TO and transformed UCB-
LRI platforms
of makespan for our best heuristic EXCL-PRED-TO and provide lower confidence inter-
vals for an application executed on each platform.
Table VI.4 shows the mean makespan of EXCL-PRED-TO on the SDSC plat-
form as well as the lower confidence intervals (80%, 90%, 95%) relative to the mean
makespan, standard deviation, and median. On the SDSC platform, we found that the
lower 80% confidence interval for application makespan is remarkably tight as it is less
than 8% away from the mean for all task sizes and numbers. The mean 80% lower
confidence intervals for the DEUG, LRI, UCB, GIMPS, and LRI-WISC platforms are
20%, .18%, 9%, 4%, and 3% respectively from the respective means. This means that
one could use the 80% lower confidence interval in Table VI.4 to get a reasonably accu-
rate prediction of the makespan within 20% of the mean. (Nevertheless, the lower 95%
confidence are significantly wider and as much as 60%.)
From the statistics of the empirical simulation data shown Table VI.4, a user
could get an estimate of how long his/her application would take to execute on the SDSC
platform, and how much variance to expect. For example, an application with 200 tasks
that are 15 minutes in length on a dedicated 1.5GHz host should take about 50 minutes
to complete on the SDSC platform when scheduled with the EXCL-PRED-TO heuristic.
This estimate could take 10 minutes longer (with 80% confidence) than the predicted
mean.
147
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median
100 676 +0.05 +0.14 +0.44 194 613200 1087 -0.02 +0.08 +0.31 768 3811400 1752 +0.02 +0.12 +0.27 240 1713
(a) 5 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 1960 +0.07 +0.34 +0.59 618 1709200 3012 -0.02 +0.06 +0.19 356 2923400 4831 +0.03 +0.05 +0.10 524 4814
(b) 15 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 4037 +0.02 +0.08 +0.31 768 3810200 6824 -0.005 +0.03 +0.16 600 6750400 10892 +0.04 +0.05 +0.10 766 10882
(c) 35 min. tasks
Table VI.4: Makespan statistics of EXCL-PRED-TO for the SDSC platform. Lower
confidence intervals are w.r.t. the mean. The mean, standard deviation, and median are
all in units of seconds.
148
VI.G Related Work
VI.G.1 Task replication
The authors in [43] use a similar probabilistic model as described in Section VI.E
to analyze various replication issues. The platform model used in the study is similar to
ours in that the resources are shared, task preemption is disallowed, and checkpointing is
not supported. The application models were also similar; one model was based on tightly-
coupled applications, while the other was based on loosely-coupled application, which
consisted of task parallel components before each barrier synchronization. The authors
then assume that the probability of task completion follows a geometric distribution.
Despite the similarities in platform and application models, there were a number
of important differences between that study and our own. First, the results were based
on a discrete time model, where the unit of time is the length of the task l. That is,
if a task that began execution at time t fails, the task is started only after time t + l.
This assumption is made to ensure each “trial” is evenly spaced so that computing the
time to task completion is simplified. However, their assumption is problematic because
it places an unrealistic constraint on the time required to restart task execution. In
particular, in the case of a task failure, their model assumes that the expected time to
failure plus the expected period of unavailability) must equal the task length l and is thus
entirely dependent on the task length, which is a rare and improbable occurrence; The
second difference between that study and our own is that their platform model assumes
a homogeneous environment, and so their study does not consider the effect of using
hosts of different speeds when replicating.
The work in [50] examines analytically the costs of executing task parallel ap-
plications in desktop grid environments. The model assumes that after a machine is un-
available for some fixed number of time units, at least one unit of work can be completed.
Thus, the estimates for execution time are lower bounds. We believe the assumption is
too restrictive, especially since the size of an availability intervals can be correlated in
time [62]; that is, a short availability interval (which would likely cause task failure) will
most likely be followed by another short availability interval.
Other studies of task replication [88, 74, 87] have focused on detecting errors
149
and ensuring correctness. Although many of these types of security methods have been
deployed by current systems, most are ad-hoc and none are fail-proof. For example,
SETI@home recomputes tasks that have indicated a positive signal has been found on
a dedicated machine to prevent false positives. Another example is the work described
in [74] where the author develops methods to give probabilistic guarantees on result
correctness, using a credibility metric for each worker. The results however are built
upon dubious and unsupported assumptions of the probabilities of task result error rates.
Given the numerous sources of error (e.g., hardware/software malfunction, malicious
attacks), creating probabilistic models of error rates may not be possible.
VI.G.2 Checkpointing
Task checkpointing is another means of dealing with task failures since the task
state can be stored periodically either on the local disk or on a remote checkpointing
server; in the event that a failure occurs, the application can be restarted from the last
checkpoint. In combination with checkpointing, process migration can be used to deal
with CPU unavailability or when a “better” host becomes available by moving the pro-
cess to another machine. As discussed earlier in Section VI.A, remote checkpointing or
process migration is most likely infeasible in Internet environments, as the application
can often consume hundreds of megabytes of memory and bandwidth over the Internet
is often limited. (Although our heuristics are evaluated using traces gathered solely from
enterprise environments, the heuristics were designed using our platform and application
models discussed in Section IV.B.1 to also function in Internet environments. Design-
ing heuristics that assume process migration capabilities would make them no longer
applicable to Internet environments.)
We investigate the effect of local checkpointing on application makespan. Specif-
ically, we assume that the EXCL-PRED heuristic (which does no replication) is enabled
with local checkpointing capabilities, and we refer to this heuristic as EXCL-PRED-
CHKPT. We also enable the optimal scheduler with checkpointing abilities and refer to
the resulting algorithm as OPTIMAL-CHKPT.
We assume that each checkpointing heuristic checkpoints every two and a half
minutes, and the cost of checkpointing is 15 seconds. Also, we assume that the cost of
150
restarting a task after a checkpoint has occurred is 15 seconds. We tried a range of other
values for the frequency and cost of checkpointing, and restart costs, and found the same
trends.
0 50 100 150 200 250 3000
0.5
1
1.5
2
2.5
3x 10
4
Task Size (min on dedicated 1.5GHz)
Mak
espa
n (s
ec)
EXCL−PRED−DUPEXCL−PRED−CHKPTOPTIMALOPTIMAL−CHKPT
Figure VI.14: Performance of checkpointing heuristics on SDSC grid.
Figure VI.14 shows the mean makespan for applications with 100 tasks of sizes
ranging from 15 to 300 minutes executed on the SDSC platform. (We also executed
applications of other sizes but found that most could not complete during business hours.)
In addition to plotting the performance of the checkpoint-enabled heuristics EXCL-
PRED-CHKPT and OPTIMAL-CHKPT, we plot the performance of EXCL-PRED-
DUP and OPTIMAL (which is the performance resulting from the optimal schedule)
for comparison. Note that the optimal schedule for a platform without checkpointing
capabilities (determined by OPTIMAL) can be different from the optimal schedule for
a platform where checkpointing is enabled (determined by OPTIMAL-CHKPT). For
example, for an extremely long task, the OPTIMAL algorithm may not be able to
complete a task at all whereas the OPTIMAL-CHKPT will be able to use a series of
availability intervals since little (if any) progress in task execution is lost when the host
fails.
We find that EXCL-PRED-CHKPT performs at least five times worse than
EXCL-PRED-DUP. OPTIMAL-CHKPT performs slightly worse than OPTIMAL for
151
task sizes ranging from about 15 to 225 minutes; for task sizes larger than 225 minutes,
OPTIMAL-CHKPT outperforms OPTIMAL slightly.
The poor performance of EXCL-PRED-CHKPT is due to the fact that a task
is not reassigned when it is assigned to a slow host or when the host becomes unavailable
for task execution. When the host becomes unavailable for task execution, it is typically
unavailable for long periods of time relative to the execution time of the application.
In particular, the mean length of unavailability intervals for the SDSC, LRI, DEUG,
or UCB platforms are 75, 225, 21, and 7 minutes, respectively. As a result, task exe-
cution is is delayed by the amount of time required before the host becomes available
again for execution; for applications that require rapid turnaround, this is detrimen-
tal. OPTIMAL-CHKPT performs nearly as well as OPTIMAL because the omniscient
scheduler will avoid periods of exec unavailability, but it performs slightly worse for tasks
less than 225 minutes in length because of the overheads involved when checkpointing.
For task sizes greater than 225 minutes, OPTIMAL-CHKPT outperforms OPTIMAL
(without checkpointing enabled) as the costs of restarting a task from scratch due a exec
unavailability become higher than the overheads of using checkpointing. So while local
checkpointing is possible, we find that its benefits are limited for short-lived applications
given the relatively long lengths of unavailability intervals found in many real desktop
grid environments.
VI.H Summary
We studied a variety of approaches for improving performance by means of
replication. We used proactive, reactive, and hybrid approaches, and for each approach,
we examined the issues of which task to replicate, which host to replicate to, and how
much to replicate (see Table VI.5).
Our conclusion is that a reactive replication strategy that uses timeouts when
the execution time of a task goes past the predicted makespan is surprisingly superior to
more aggressive replication heuristics or heuristics that use dynamic historical informa-
tion to predict task completion rates. This conclusion can be explained by the fact that a
large portion of the host in each platform are stable, and that clock rates are correlation
152
Heuristic Which host Which task How manyreplicas
PRI-CR-DUP, clock rate clock rate x1EXCL-S.5-DUP,
EXCL-PRED-DUPEXCL-PRED-DUP-2 clock rate clock rate x2EXCL-PRED-DUP-4 clock rate clock rate x4EXCL-PRED-DUP-8 clock rate clock rate x8
EXCL-PRED-TO clock rate on timeout x1via predicated makespan
EXCL-PRED-TO-SPD clock rate on timeout x1according to clock rate
REP-PROB P (Ci ≤ D) P (Ci ≤ D) until above 80%
Table VI.5: Summary of replication heuristics.
strongly with task completion rates. Combining this with the fact that there are usually
many hosts relative to the number of tasks near the end of application execution (as
shown in Figure V.4), EXCL-PRED-TO demonstrates the best performance in terms of
reducing makespan and waste.
Figure VI.15 shows the mean makespan of EXCL-PRED-TO and REP-PROB
for the SDSC grid in addition to the best performing heuristics we examined in previous
chapters. We find that for the SDSC grid the fourth quartile of EXCL-PRED-TO is
on average 2.25 times shorter than the fourth quartile of EXCL-PRED, and the EXCL-
PRED-TO performs better than EXCL-PRED by a factor of 1.49 on average. Compared
to the optimal schedule, EXCL-PRED-TO performs within a factor of 1.7 on the SDSC,
DEUG, and LRI platforms, and within a factor of 2.6 on the UCB platform. In addition
to achieving the best or close to best performance, EXCL-PRED-TO almost always re-
sults in the least (or close to the least) waste of all the replication heuristics (achieving
a mean waste of 6%, 33%, 9%, 71% on the SDSC, DEUG, LRI, and UCB platforms, re-
spectively); the large performance benefits of using reactive replication are often achieved
with little waste.
153
01
23
45
60
12
34
56
01
23
45
60
5000
1000
0
100
task
s pe
r ap
plic
atio
n
duration (sec)1s
t2n
d3r
d4t
h5
min
task
s15
min
task
s35
min
task
s
01
23
45
60
12
34
56
01
23
45
60
5000
1000
0
200
task
s pe
r ap
plic
atio
n
duration (sec)
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
01
23
45
60
12
34
56
01
23
45
60
5000
1000
0
400
task
s pe
r ap
plic
atio
n
duration (sec)
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
Fig
ure
VI.15
:Len
gth
ofta
skco
mpl
etio
nqu
arti
les
onSD
SCG
rid.
0→
OP
TIM
AL.1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D5→
EX
CL-P
RE
D-T
O.6→
RE
P-P
RO
B.
Chapter VII
Scheduler Prototype
In this chapter, we describe our implementation of the best performing heuristic
EXCL-PRED-TO, and show that the scheduling model we used is feasible in a real
system. Our implementation of EXCL-PRED-TO is integrated with the open source
XtremWeb desktop grid software.
VII.A Overview of the XtremWeb Scheduling System
The architecture of the XtremWeb system matches the general architecture
of desktop grid systems described in Section II.B. We describe in further detail here
the components of the XtremWeb system that reside at the Application and Resource
Management Level since we modify these components for our scheduler.
After an application is submitted, the application manager periodically selects
a subset of tasks from a task pool and distributes them to a scheduler (or set of sched-
ulers). The scheduler is then responsible for the completion of tasks. Workers make a
request for work to the scheduler typically using Java RMI, although other methods of
communication through SSL and TCP-UDP are supported. The default scheduler in
XtremWeb schedules tasks to hosts in a FCFS fashion, i.e., schedules tasks to hosts in
the order in which they arrived. Upon completion, the worker will return the result to
the scheduler, which stores the result on the server’s disk and records the task completion
in the results database.
154
155
VII.B EXCL-PRED-TO Heuristic Design and Implemen-
tation
We replace the FCFS scheduler in Xtremweb with our EXCL-PRED-TO sched-
uler. This involves a number of changes to the XtremWeb system which we describe
below.
VII.B.1 Task Priority Queue
One potential hazard of replication is that the replicas could delay original task
instances from being executed. For example, suppose instances of a particular task are
replicated a high number of times and then placed in a work queue. Then suppose task
instances of a different task are placed in the work queue after instances of the first task.
Since task instances are assigned in the order that they are placed in the work queue,
the second task could “starve” as the workers are kept busy executing replicas of the
first task.
To reduce the chance of task starvation, our scheduler uses a two-level work
queue; we refer to the higher level queue as the primary queue and the lower level queue
as the secondary queue. When an application is submitted by the client, an instance of
each task is placed in the primary queue. For the EXCL-PRED-TO heuristic, a timeout
is associated with each original task instance when it is scheduled on a host. When
this time out expires, a task replica is placed in the secondary queue. When doing task
assignment, the scheduler will first schedule tasks in the primary queue before those in
the secondary queue in an effort to ensure that at least one instance of each task will
always be scheduled before any replicas.
To keep the number of replicas from growing too rapidly, only original task
instances are allowed to time out. Also, when the original task instance fails, a new
corresponding instance is placed in the primary queue; however, if a replica fails, nothing
more is done.
The task instance priority queues are implemented as fixed-sized lists within
the XtremWeb scheduler, and the lists act as a buffer between the database of tasks
and workers requesting task instances. Periodically, the primary priority queue is filled
156
with original task instances instantiated from tasks in the database. A task state thread
periodically checks the state of all the task instances in each priority queue. Given a
particular state, the task state thread causes the appropriate action to be taken. For
example, to implement the timeout mechanism, we set a timeout for each task instance
in the primary work queue. Periodically, the thread checks the state of each tasks and
if a task has timed out, it places a replica in the secondary queue.
VII.B.2 Makespan Predictor
As EXCL-PRED-TO depends on a makespan prediction, we use the formula
described in Section V.B.2 to predict the application’s makespan. This requires having a
predicted aggregate operations completed per second. This rate could be determined by
submitting either real or measurement tasks (consisting of some number of operations
per task) to all the workers for a short duration. For instance, XtremWeb records the
start and completion time each task instance. Given the number of operations per task,
a separate thread of the scheduler could be responsible for computing a daily average
over all hosts. Then, counting the number of tasks to be completed per application, the
thread could make a rough estimate of application completion and the main scheduling
thread could then use this estimate for the EXCL-PRED-TO heuristic. We found that
the aggregate operations per second remains relatively constant throughout time (see
Section V.B.2), and so this estimate could be accurate for several days.
Chapter VIII
Conclusion
VIII.A Summary of Contributions
Desktop grids are an attractive platform for executing large computations be-
cause they offer a high return on investment by using the idle capacity of an existing
computing infrastructure. Projects from a wide range of scientific domains have utilized
TeraFlops of computing power offered by hundreds of thousands of mostly desktop PC’s.
The applications in most of these projects are high-throughput, task parallel, and com-
pute bound. In this dissertation, we studied how to schedule an application that requires
rapid turnaround in an effort to broaden the types of applications executable in desktop
grid environments. To this end, we made the following contributions:
Measurement and characterization of real enterprise desktop grids. We char-
acterized four real desktop grid platforms using an accurate measurement technique that
captured performance exactly as what would be perceived by a real application. Using
this measurement data, we characterized the temporal structure of availability of each
platform as a whole and of individual hosts. Both the measurement data and character-
ization could be used to drive simulations or used as the basis for forming predicative,
explanatory, or generative models. With respect to modelling, we found a number of
pertinent statistics. For instance, we found that task failure rate is correlated with task
length and that availability is not correlated with host clock rates.
157
158
Resource prioritization and exclusion heuristics. We used this characterization
to develop novel and effective resource prioritization and resource exclusion heuristics for
scheduling short-lived applications. We found that using static clock rate information
to prioritize hosts can often improve performance; however, the performance of using
prioritization alone depends on the number of tasks in the application relative to the
number of hosts, and whether tasks are assigned to “poor” hosts. We then adapted our
prioritization heuristic to exclude “poor” hosts from the application execution. We found
that using a fixed threshold to filter hosts was beneficial but application performance was
dependent on the distribution of the clock rates. To lessen this dependence, we developed
a heuristic that used a predicted makespan to eliminate hosts from application execution.
While this heuristic was less sensitive to the clock rate distribution, it was less aggressive
in exclusion, and for smaller applications performed slightly more poorly. The benefit of
using the predicted makespan to eliminate hosts became more obvious after we combined
the heuristic with task replication.
Task replication heuristics. We studied the use of proactive, reactive, and hybrid
replication techniques by combining task replication with our best resource exclusion and
prioritization heuristics. We found that a heuristic that uses a makespan predictor with
reactive replication by means of timeouts is the most effective in practice; the makespan
predictor is essential for eliminating “poor” hosts and also for setting the timeouts of
each task such that waste is relatively low. The reason timeouts are so effective is that
platforms often have a large portion of relatively stable hosts. Because volatility is
negatively correlated with clock rates and our best replication heuristic prioritizes tasks
and hosts according to clock rates, the probability of failure is reduced dramatically after
replicating a task once. Surprisingly, this heuristic often achieves similar performance
with relatively less waste compared to other heuristics that replicate more aggressively
and/or use more dynamic information about the resources. Our best heuristic often
performs within a factor of 1.7 of optimal.
Scheduler prototype. We show the feasibility of our heuristic by implementing a
scheduler prototype in a real desktop grid system. This heuristic was incorporated into
the real open-source desktop grid system XtremWeb. We believe that the scheduler will
159
improve the performance of short-lived applications.
VIII.B Future Work
There are a number of ways in which this work can be extended in terms of
measurement and characterization, and the types of applications studied:
Characterization of Internet desktop grids. We designed our heuristics so that
they would be applicable and effective in Internet environments. However, because we did
not have traces of Internet desktop grids, we could not prove the heuristics’ effectiveness
in Internet environments. The collection of Internet desktop grid trace data is currently
being conducted by the Recovery Oriented Computing group at U.C. Berkeley [17].
Given that data, we would be able to evaluate our heuristics on Internet desktop grids.
Characterization of memory and network connectivity. Clearly, applications
use other resources in addition to the CPU. As an extension to our characterization of
exec availability, it would be useful to characterize other resource usage data, such as
memory allocation or network traffic. This would improve the accuracy of our platform
model.
Scheduling applications with dependencies. Another interesting class of appli-
cations is the class with dependencies among tasks. It would be interesting to use our
characterization data to study the costs and benefits of running applications with task de-
pendencies. The fact that hosts in some desktop grid environments appear independent
could simplify performance modelling of such applications. We believe our probabilistic
model of task completion described in Chapter VI will aid in the analysis of scheduling
applications with task dependencies.
Scheduling multiple applications on the same desktop grid. A desktop grid
application does not always have the luxury of using the entire platform exclusively. It
would be useful to investigate the scenario where multiple applications are competing for
the same set of resources. Given that the costs of using desktop resources for users that
160
submit applications can be quite low, applications are often very large; at the same time
there may be users that require rapid application turnaround. How to balance system
throughput and response time while promoting fairness among users is an interesting
research direction. Also, the performance of EXCL-PRED-TO was dependent on the
existence of stable hosts in the platform, and so if in the scenario of multiple application
those hosts are being used by other applications, then REP-PROB heuristic may in fact
prove beneficial compared to EXCL-PRED-TO.
Toward these ends, we believe that the work in this thesis will be a helpful
stepping stone for future desktop grid research.
Appendix A
Defining the IQR Factor
A.A IQR Sensitivity
161
162
12
34
56
12
34
56
12
34
56
0510 # of laggers
100
task
s pe
r ap
plic
atio
n
slow
hos
tfa
iled
task
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
12
34
56
12
34
56
12
34
56
0510 # of laggers
200
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
12
34
56
12
34
56
12
34
56
0510 # of laggers
400
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
Fig
ure
A.1
:C
ause
ofLag
gers
(IQ
Rfa
ctor
of.5
)on
SDSC
Gri
d.1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D.5→
EX
CL-P
RE
D-T
O.6→
RE
P-P
RO
B.
163
12
34
56
12
34
56
12
34
56
05 # of laggers
100
task
s pe
r ap
plic
atio
n slow
hos
tfa
iled
task
5 m
in ta
sks
15 m
in ta
sks
35 m
in ta
sks
12
34
56
12
34
56
12
34
56
05 # of laggers
200
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
12
34
56
12
34
56
12
34
56
05 # of laggers
400
task
s pe
r ap
plic
atio
n5
min
task
s15
min
task
s35
min
task
s
Fig
ure
A.2
:C
ause
ofLag
gers
(IQ
Rfa
ctor
of1.
5)on
SDSC
Gri
d.1→
FC
FS.
2→
PR
I-C
R.3→
EX
CL-S
.5.
4→
EX
CL-P
RE
D.5
→E
XC
L-P
RE
D-T
O.6→
RE
P-P
RO
B.
Appendix B
Additional Resource Selection
and Exclusion Results and
Discussion
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure B.1: Performance of resource selection heuristics on the DEUG grid
On the UCB platform, the host clock rates are all identical. So EXCL-S.5 will
exclude all the hosts, and no results corresponding to EXCL-S.5 are shown. EXCL-PRED
and CR perform worse than FCFS mainly because they prioritize resources according to
164
165
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure B.2: Performance of resource selection heuristics on the LRI grid
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
CREXCL−S.5EXCL−PRED
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure B.3: Performance of resource selection heuristics on the UCB grid
166
the clock rates where as FCFS prioritizes the resources according the the time of arrival
in the queue. So FCFS will tend to assign tasks to resources that have been available
the longest (which tended to have a longer probability of task completion) whereas
PRI-CR will assigned tasks to hosts randomly. While FCFS outperforms PRI-CR in
this particular scenrio, we find that in all the other platforms PRI-CR outperforms or
performs as well as FCFS.
Appendix C
Additional Task Replication
Results and Discussion
C.A Proactive Replication
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
PRI−CRPRI−CR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.1: Performance of proactive replication heuristics on DEUG grid.
Similar to the replication heuristics described in the previous section, EXCL-
PRED-DUP-TIME and EXCL-PRED-DUP-TIME-SPD are wasteful in their use of re-
sources (see Figure C.4). The reason is that tasks are always replicated, regardless of
167
168
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
PRI−CRPRI−CR−DUPEXCL−S.5EXCL−S.5−DUPEXCL−PREDEXCL−PRED−DUP
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.2: Performance of proactive replication heuristics on LRI grid.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
PRI−CRPRI−CR−DUPEXCL−PREDEXCL−PRED−DUP
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.3: Performance of proactive replication heuristics on UCB grid.
169
the speed or reliability of the host to which the original task instance was first assigned.
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
OPTIMALCR−DUPEXCL−S.5−DUPEXCL−PRED−DUPEXCL−PRED−DUP−TIMEEXCL−PRED−DUP−TIME−SPD
100 200 400
0
0.2
0.4
0.6
0.8
1
Number of Tasks Per Application
Figure C.4: Waste of proactive replication heuristics with EXCL-PRED-DUP-TIME and
EXCL-DUP-TIME-SPD.
Waste for heuristic EXCL-S.5-DUP is not shown in Figure C.7 because the
hosts in UCB all had the same clock rates and so the heuristic excluded all hosts.
C.B Reactive Replication
C.C Hybrid Replication
170
5 15 35 5 15 35 5 15 350
0.2
0.4
0.6
0.8
1
1.2
1.4
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
PRI−CR−DUPEXCL−S.5−DUPEXCL−PRED−DUP
100 200 400
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Tasks Per Application
Figure C.5: Waste of proactive replication heuristics on DEUG grid.
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
PRI−CR−DUPEXCL−S.5−DUPEXCL−PRED−DUP
100 200 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Tasks Per Application
Figure C.6: Waste of proactive replication heuristics on LRI grid.
171
5 15 35 5 15 35 5 15 350
0.5
1
1.5
2
2.5
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
PRI−CR−DUPEXCL−PRED−DUP
100 200 400
0
0.5
1
1.5
2
2.5
Number of Tasks Per Application
Figure C.7: Waste of proactive replication heuristics on UCB grid.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal
OPTIMALEXCL−PRED−DUPEXCL−PRED−DUP−TIMEEXCL−PRED−DUP−TIME−SPDEXCL−PRED−DUP2EXCL−PRED−DUP2−TIMEEXCL−PRED−DUP2−TIME−SPDEXCL−PRED4−DUPEXCL−PRED4−DUP−TIMEEXCL−PRED−DUP4−TIME−SPDEXCL−PRED8−DUPEXCL−PRED−DUP8−TIMEEXCL−PRED−DUP8−TIME−SPD
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.8: Performance of proactive replication heuristics when varying replication level
on SDSC grid.
172
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FIFO
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.9: Performance of reactive replication heuristics on DEUG grid.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.10: Performance of reactive replication heuristics on LRI grid.
173
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.11: Performance of reactive replication heuristics on UCB grid.
5 15 35 5 15 35 5 15 350
0.2
0.4
0.6
0.8
1
1.2
1.4
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Tasks Per Application
Figure C.12: Waste of reactive replication heuristics on DEUG grid.
174
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Tasks Per Application
Figure C.13: Waste of reactive replication heuristics on LRI grid.
5 15 35 5 15 35 5 15 350
0.5
1
1.5
2
2.5
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TO
100 200 400
0
0.5
1
1.5
2
2.5
Number of Tasks Per Application
Figure C.14: Waste of reactive replication heuristics on UCB grid.
175
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FIFO
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.15: Performance of hybrid replication heuristic on DEUG grid.
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.16: Performance of hybrid replication heuristic on LRI grid.
176
5 15 35 5 15 35 5 15 350
1
2
3
4
5
6
Task length (minutes on a dedicated 1.5GHz host)
Ave
rage
mak
espa
n re
lativ
e to
opt
imal FCFS
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
1
2
3
4
5
6
Number of Tasks Per Application
Figure C.17: Performance of hybrid replication heuristic on UCB grid.
5 15 35 5 15 35 5 15 350
0.2
0.4
0.6
0.8
1
1.2
1.4
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Tasks Per Application
Figure C.18: Waste of hybrid replication heuristic on DEUG grid.
177
5 15 35 5 15 35 5 15 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of Tasks Per Application
Figure C.19: Waste of hybrid replication heuristic on LRI grid.
5 15 35 5 15 35 5 15 350
0.5
1
1.5
2
2.5
3
Task length (minutes on a dedicated 1.5GHz host)
Was
te In
Ter
ms
of P
erce
nt o
f Tas
ks R
eplic
ated
EXCL−PRED−DUPEXCL−PRED−TOREP−PROB
100 200 400
0
0.5
1
1.5
2
2.5
3
Number of Tasks Per Application
Figure C.20: Waste of hybrid replication heuristic on UCB grid.
178
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 548 +0.18 +0.31 +0.37 119 530200 816 +0.22 +0.27 +0.41 201 802400 1355 +0.23 +0.31 +0.42 309 1321
(a) 5 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 1597 +0.24 +0.28 +0.33 375 1495200 2341 +0.20 +0.29 +0.39 571 2308400 3902 +0.21 +0.29 +0.41 875 3924
(b) 15 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 3917 +0.20 +0.28 +0.33 902 3853200 5875 +0.18 +0.26 +0.36 1286 6126400 9725 +0.17 +0.23 +0.27 1731 9710
(c) 35 min. tasks
Table C.1: Makespan statistics for the DEUG platform. Lower confidence intervals are
w.r.t. the mean. The mean, standard deviation, and median are all in units of seconds.
179
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 694 +0.21 +0.21 +.21 131 643200 959 +0.17 +0.17 +0.17 105 969400 1699 +0.18 +.32 +.32 285 1587
(a) 5 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 1875 +0.23 +0.23 +0.23 350 1652200 2692 +0.14 +0.14 +0.14 276 2666400 4740 +0.20 +0.31 +0.31 787 4589
(b) 15 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 4289 +0.22 +0.22 +0.22 843 3758200 6218 +0.12 +0.12 +0.12 577 6080400 10806 +0.19 +0.31 +0.31 1830 10579
(c) 35 min. tasks
Table C.2: Makespan statistics for the LRI platform. Lower confidence intervals are
w.r.t. the mean. The mean, standard deviation, and median are all in units of seconds.
180
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 773 +0.10 +0.25 +0.39 140 720200 1234 +0.06 +0.12 +0.29 223 1273400 1707.27 +0.02 +0.14 +0.22 172 1657
(a) 5 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 2520 +0.16 +0.27 +0.34 514 2455200 3108 +0.09 +0.21 +0.31 429 2853400 4919 +0.04 +0.10 +0.18 448 4736
(b) 15 min. tasks
Makespan statisticsTask number Mean 80% c.i. 90% c.i. 95% c.i. std. dev. median100 6085 +0.16 +0.29 +0.37 1327 6001200 7989 +0.13 +0.20 +0.27 1272 7816400 12054 +0.08 +0.14 +0.19 1278 11782
(c) 35 min. tasks
Table C.3: Makespan statistics for the UCB platform. Lower confidence intervals are
w.r.t. the mean. The mean, standard deviation, and median are all in units of seconds.
Bibliography
[1] Cell Computing. www.cellcomputing.net.
[2] Climateprediction.net. http://climateprediction.net.
[3] Distributed.net. www.distributed.net.
[4] EINSTEN@home. http://einstein.phys.uwm.edu.
[5] LHC@home. http://athome.web.cern.ch/athome.
[6] The UCLA Internet Report Surveying the Digital Future. Technical report, UCLACenter for Communication Policy, January 2003.
[7] D. Abramson, J. Giddy, I. Foster, and L. Kotler. High Performance ParametricModeling with Nimrod/G: Killer Application for the Global Grid ? In Proceedingsof the International Parallel and Distributed Processing Symposium, May 2000.
[8] A. Acharya, G. Edjlali, and J. Saltz. The Utility of Exploiting Idle Workstations forParallel Computation. In Proceedings of the 1997 ACM SIGMETRICS InternationalConference on Measurement and Modeling of Computer Systems, pages 225–234,1997.
[9] Y. Amir and A. Wool. Evaluating quorum systems over the Internet. In 26thSymposium on Fault-tolerant Computing (FTCS96), June 1996.
[10] D. Anderson. Personal communication, April 2002.
[11] T. E. Anderson, D. E. Culler, and D. A. Patterson. A case for now (networks ofworkstations). IEEE Micro, 15(1):54–64, 1995.
[12] R.H. Arpaci, A.C. Dusseau, A.M. Vahdat, L.T. Liu, T.E. Anderson, and D.A.”Patterson. The Interaction of Parallel and Sequential Workloads on a Network ofWorkstations. In Proceedings of SIGMETRICS’95, pages 267–278, May 1995.
[13] S. Baker, R. Lanctot, S. Koenig, and S. Wargo. Home PC Portrait. Technicalreport, PC Data, Inc., Reston, VA, 2000.
[14] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-LevelScheduling on Distributed Heterogeneous Networks. In Proc. of Supercomputing’96,Pittsburgh, 1996.
181
182
[15] R. Bhagwan, S. Savage, and G. Voelker. Understanding Availability. In In Proceed-ings of IPTPS’03, 2003.
[16] Ranjita Bhagwan, Kiran Tati, Yu-Chung Cheng, Stefan Savage, and Geoffrey M.Voelker. Total recall: System support for automated availability management. InNSDI, pages 337–350, 2004.
[17] Resource Measurement Via BOINC. http://roc.cs.berkeley.edu/projects/boinc.
[18] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Dis-tributed file System Deployed on an Existing Set of Desktop PCs. In Proceedingsof SIGMETRICS, 2000.
[19] G. Bosilca, F. Cappello, A. Dijilali, G. Fedak, T. Herault, and Mangiette F. Perfor-mance Evaluation of Sandboxing Techniques for Peer-to-Peer Computing. Technicalreport, LRI-NDRS and Paris-Sud University, February 2002.
[20] J. Brevik, D. Nurmi, and R. Wolski. Quantifying Machine Availability in Networkedand Desktop Grid Systems. Technical Report CS2003-37, Dept. of Computer Scienceand Engineering, University of California at Santa Barbara, November 2003.
[21] B. Calder, A. Chien, J. Wang, and D. Yang. The Entropia Virtual Machine forDesktop Grids. Technical Report CS2003-0773, University of California at SanDeigo, October 2003.
[22] B. Calder, A. A. Chien, J. Wang, and D. Yang. The Entropia Virtual Machine forDesktop Grids. In Proceedings of the First ACM/USENIX Conference on VirtualExecution Environments (VEE’05), June 2005.
[23] The Compute Against Cancer project. http://www.computeagainstcancer.org/.
[24] P. Cappello, B. Christiansen, M. Ionescu, M. Neary, K. Schauser, and D. Wu.Javelin: Internet-Based Parallel Computing Using Java. In Proceedings of the SixthACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,1997.
[25] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman. Heuristics for SchedulingParameter Sweep Applications in Grid Environments. In Proceedings of the 9thHeterogeneous Computing Workshop (HCW’00), pages 349–363, May 2000.
[26] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS ParameterSweep Template: User-Level Middleware for the Grid. In Proceedings of SuperCom-puting 2000 (SC’00), Nov. 2000.
[27] A.J. Chakravarti, G. Baumgartner, and M. Lauria. The Organic Grid: Self-Organizing Computation on a Peer-to-Peer Network. In Proceedings of the In-ternational Conference on Autonomic Computing (ICAC ’04), May 2004.
[28] A. Chien, B. Calder, S. Elbert, and K. Bhatia. Entropia: Architecture and Perfor-mance of an Enterprise Desktop Grid System. Journal of Parallel and DistributedComputing, 63:597–610, 2003.
183
[29] A. A. Chien. Personal communication, December 2003.
[30] J. Chu, K. Labonte, and B. Levine. Availability and locality measurements of peer-to-peer file systems. In Proceedings of ITCom: Scalability and Traffic Control in IPNetworks, July 2003.
[31] Condor Statistics. http://www.cs.wisc.edu/condor/map.
[32] P. Dinda. The Statistical Properties of Host Load. Scientific Programming, 7(3–4),1999.
[33] P. Dinda. A Prediction-Based Real-Time Scheduling Advisor. In Proceedings ofthe International Parallel and Distributed Processing Symposium (IPDPS’02), April2002.
[34] P. Dinda. Online Prediction of the Running Time of Tasks. Cluster Computing,5(3):225–236, July 2002.
[35] P. Dinda and D. O’Hallaron. An Evaluation of Linear Models for Host Load Pre-diction. In Proceedings of the The Eighth IEEE International Symposium on HighPerformance Distributed Computing, page 10, 1999.
[36] Fred Douglis and John Ousterhout. Transparent process migration: Design alterna-tives and the sprite implementation. Software Practice and Experience, 21(8):757–785, 1991.
[37] G. Fedak, C. Germain, V. N’eri, and F. Cappello. XtremWeb: A Generic GlobalComputing System. In Proceedings of the IEEE International Symposium on ClusterComputing and the Grid (CCGRID’01), May 2001.
[38] The Fight Aids At Home project. http://www.fightaidsathome.org/.
[39] The Berkeley Open Infrastructure for Network Computing. http://boinc.berkeley.edu/.
[40] I. Foster and C. Kesselman, editors. The Grid, Blueprint for a New ComputingInfrastructure, chapter Chapter 8: Medical Data Federation: The Biomedical Infor-matics Research Network. Morgan Kaufmann, 2nd edition, 2003.
[41] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New ComputingInfrastructure. Morgan Kaufmann Publishers, Inc., San Francisco, USA, 1999.
[42] James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, and Steven Tuecke.Condor-g: A computation management agent for multi-institutional grids. Clus-ter Computing, 5(3):237–246, 2002.
[43] G. Ghare and L. Leutenegger. Improving Speedup and Response Times by Repli-cating Parallel Programs on a SNOW. In Proceedings of the 10th Workshop on JobScheduling Strategies for Parallel Processing, June 2004.
[44] The great internet mersene prime search (gimps). http://www.mersenne.org/.
184
[45] M. Harchol-Balter and A. Downey. Exploiting Process Lifetime Distributions forDynamic Load Balancing. In Proceedings of the 1996 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling of Computer Systems, pages13–24, 1996.
[46] A Hsu. Personal communication, March 2005.
[47] S.A. Hupp. The “Worm” Programs – Early Experience with Distributed Compu-tation. Communications of the ACM, 3(25), 1982.
[48] Y. Kee, D. Logothetis, R. Huang, H. Casanova, and Andrew A. Chien. EfficientResource Description and High Quality Selection for Virtual Grids. In IEEE Con-ference on Cluster Computing and the Grid (CCGrid 2005), 2005.
[49] S. M. Larson, C. D. Snow, M. Shirts, and V. S. Pande. Folding@Home andGenome@Home: Using distributed computing to tackle previously intractable prob-lems in computational biology. Computational Genomics, 2003.
[50] S. Leutenegger and X. Sun. Distributed Computing Feasibility in a Non-DedicatedHomogeneous Distributed System. In Proc. of SC’93, Portland, Oregon, 1993.
[51] Y. Li and M. Mascagni. Improving performance via computational replication on alarge-scale computational grid. In Proc. of the IEEE International Symposium onCluster Computing and the Grid (CCGrid’03), May 2003.
[52] M. Litzkow, M. Livny, and M. Mutka. Condor - A Hunter of Idle Workstations. InProceedings of the 8th International Conference of Distributed Computing Systems(ICDCS), 1988.
[53] V. M. Lo, D. Zhou, D. Zappala, Y. Liu, and S. Zhao. Cluster Computing on the Fly:P2P Scheduling of Idle Cycles in the Internet. In The 3rd International Workshopon Peer-to-Peer Systems (IPTPS’04), Feb 2004.
[54] O. Lodygensky, G. Fedak, V. Neri, F. Cappello, D. Thain, and M. Livny. XtremWeband Condor: Sharing Resources Between Internet Connected Condor Pool. In Pro-ceedings of the IEEE International Symposium on Cluster Computing and the Grid(CCGRID’03) Workshop on Global Computing on Personal Devices, May 2003.
[55] O. Lodygensky, G. Fedak, V. Nri, A. Cordier, and F. Cappello. Auger & XtremWeb: Monte Carlo computation on a global computing platform. In Proceedings ofComputing in High Energy and Nuclear Physics (CHEP2003), March 2003.
[56] D. Long, A. Muir, and R. Golding. A Longitudinal Survey of Internet Host Relia-bility. In 14th Symposium on Reliable Distributed Systems, pages 2–9, 1995.
[57] J. Lopez, M. Aeschlimann, P. Dinda, L. Kallivokas, B. Lowekamp, andD. O’Hallaron. Preliminary report on the design of a framework for distributedvisualization. In Proceedings of the International Conference on Parallel and Dis-tributed Processing Techniques and Applications (PDPTA’99), pages 1833–1839, LasVegas, NV, June 1999.
185
[58] U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modelingthe characteristics of rigid jobs. J. Parallel & Distributed Comput., 11(63), 2003.
[59] W. Mendenhall and T. L. Sincich, editors. Statistics for Engineering and the Science.Prentice Hall, 1995.
[60] editor. Mitchell-Kernan, Claudia I. Science & engineering indicators. Technicalreport, National Science Board, Washington, D.C. USA, 2000.
[61] M. Mutka. Considering deadline constraints when allocating the shared capacity ofprivate workstations. Int. Journal in Computer Simulation, 4(1):41–63, 1994.
[62] M. Mutka and M. Livny. The available capacity of a privately owned workstationenvironment . Performance Evaluation, 4(12), July 1991.
[63] J. Nabrzyski, J. Schopf, and J. Weglarz, editors. Grid Resource Management, chap-ter 26. Kluwer Press, 2003.
[64] Sagnik Nandy, Larry Carter, and Jeanne Ferrante. Guard: Gossip used for au-tonomous resource detection. In IPDPS, 2005.
[65] D. Nurmi, J. Brevik, and R. Wolski. Model-based Checkpoint Scheduling for VolatileResource Environments. Technical Report CS2004-25, Dept. of Computer Scienceand Engineering, University of California at Santa Barbara, 2004.
[66] A. J. Olson. Personal communication, April 2002.
[67] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Distributed ResourceDiscovery on PlanetLab with SWORD. In Proceedings of the ACM/USENIX Work-shop on Real, Large Distributed Systems (WORLDS), December 2004.
[68] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat. Design and Implemen-tation Tradeoffs for Wide-Area Resource Discovery. In 14th IEEE Symposium onHigh Performance Distributed Computing (HPDC-14), July 2005.
[69] V. Pande. Personal communication, December 2003.
[70] J. Pruyne and M. Livny. A Worldwide Flock of Condors : Load Sharing amongWorkstation Clusters . Journal on Future Generations of Computer Systems, 12,1996.
[71] Reuters. Worldwide PC Shipments Seen Rising Slightly. http://news.cnet.com/investor/news/newsitem/0-9900-1028-20866911-0.html/, February 2003.
[72] Sean C. Rhea, Patrick R. Eaton, Dennis Geels, Hakim Weatherspoon, Ben Y. Zhao,and John Kubiatowicz. Pond: The oceanstore prototype. In FAST, 2003.
[73] K. D. Ryu. Exploiting Idle Cycles in Networks of Workstations. PhD thesis, 2001.
[74] L. Sarmenta. Sabotage-tolerance mechanisms for volunteer computing systems. InProceedings of IEEE International Symposium on Cluster Computing and the Grid,May. 2001.
186
[75] L. Sarmenta and S. Hirano. Bayanihan: Building and Studying Web-Based Volun-teer Computing Systems Using Java. Future Generation Computer Systems, 15(5-6):675–686, 1999.
[76] S. Saroiu, P.K. Gummadi, and S.D. Gribble. A measurement study of peer-to-peerfile sharing systems. In Proceedinsg of MMCN, January 2002.
[77] Serial section electron tomography: a method for three-dimensional reconstructionof large structures. Soto ge, young sj, martone me, deerinck tj, lamont s, carragherbo, hama k, ellisman mh. Neuroimage, 1(3):230–43, June 1994.
[78] Current statistics. http://setiathome.berkeley.edu/stats.html.
[79] Technical news report 2003. http://setiathome.ssl.berkeley.edu/tech_news03.html.
[80] S. Smallen, H. Casanova, and F. Berman. Tunable On-line Parallel Tomography.In Proceedings of SuperComputing’01, Denver, Colorado, Nov. 2001.
[81] S. Son and M. Livny. Recovering internet symmetry in distributed computing. InProceedings of the 3rd International Symposium on Cluster Computing and the Grid,Tokyo, Japan, May 2003.
[82] E. J. Sorin and V. Pande. Empirical Force-Field Assessment: The Interplay BetweenBackbone Torsions and Noncovalent Term Scaling. Computational Chemistry, 2005.
[83] E. J. Sorin and V. Pande. Exploring the Helix-Coil Transition via All-atom Equi-librium Ensemble Simulations. Biophysical Journal, 2005.
[84] D. Spence and T. Harris. XenoSearch: Distributed Resource Discovery in theXenoServer Open Platform. In 12th IEEE International Symposium on High Per-formance Distributed Computing (HPDC’03), June 2003.
[85] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrish-nan. Chord: A scalable peer-to-peer lookup service for internet applications. InProceedings of the ACM SIGCOMM ’01 Conference, San Diego, California, August2001.
[86] W. T. Sullivan, D. Werthimer, S. Bowyer, J. Cobb, G. Gedye, and D. Anderson.A new major SETI project based on Project Serendip data and 100,000 personalcomputers. In Proc. of the Fifth Intl. Conf. on Bioastronomy, 1997.
[87] Michela Taufer, C. An, A. Kerstens, and Charles L. Brooks III. Predictor@home: A”protein structure prediction supercomputer” based on public-resource computing.In IPDPS, 2005.
[88] Michela Taufer, David Anderson, Pietro Cicotti, and Charles L. Brooks III. Homo-geneous redundancy: a technique to ensure integrity of molecular simulation resultsusing public computing. In IPDPS, 2005.
[89] Top 500 list. http://www.top500.org/sublist/stats/index.php?list=2004-11-30&type=archtype&submit=1.
187
[90] United Devices Inc. http://www.ud.com/.
[91] Vijay Pande. Private communication, 2004.
[92] R. Wolski, N. Spring, and J. Hayes. Predicting the CPU Availability of Time-sharedUnix Systems. In Peoceedings of 8th IEEE High Performance Distributed ComputingConference (HPDC8), August 1999.
[93] R. Wolski, N. Spring, and J. Hayes. The Network Weather Service: A DistributedResource Performance Forecasting Service for Metacomputing. Journal of FutureGeneration Computing Systems, 15(5–6):757–768, 1999.
[94] P. Wyckoff, T. Johnson, and K. Jeong. Finding Idle Periods on Networks of Work-stations. Technical Report CS761, Dept. of Computer Science, New York University,March 1998.
[95] S. Zagrovic and V. Pande. Structural correspondence between the alpha-helix andthe random-flight chain resolves how unfolded proteins can have native-like proper-ties. Nature Structural Biology, 2003.