Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
GRID RESOURCE AVAILABILITY PREDICTION-BASED
SCHEDULING AND TASK REPLICATION
BY
BRENT ROOD
BS, State University of New York at Binghamton, 2005
MS, State University of New York at Binghamton, 2007
DISSERTATION
Submitted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in Computer Science
in the Graduate School ofBinghamton University
State University of New York2011
c©Copyright by Brent Rood 2011
All Rights Reserved
Accepted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in Computer Science
in the Graduate School ofBinghamton University
State University of New York2010
May 13, 2011
Dr. Michael J. Lewis, Chair and Faculty AdvisorDepartment of Computer Science, Binghamton University
Dr. Madhusudhan Govindaraju, MemberDepartment of Computer Science, Binghamton University
Dr. Kenneth Chiu, MemberDepartment of Computer Science, Binghamton University
Dr. Yu Chen, Outside ExaminerDepartment of Electrical and Computer Engineering, Binghamton University
iii
Abstract
The frequent and volatile unavailability of volunteer-based Grid computing resources challenges
Grid schedulers to make effective job placements. The manner in which host resources become
unavailable will have different effects on different jobs, depending on their runtime and their ability
to be checkpointed or replicated. A multi-state availability model can help improve scheduling
performance by capturing the various ways a resource may be available or unavailable to the Grid.
This paper uses a multi-state model and analyzes a machine availability trace in terms of that
model. Several prediction techniques then forecast resource transitions into the model’s states.
This study analyzes the accuracy of proposed predictors, which outperform existing approaches.
Later chapters propose and study several classes of schedulers that utilize the predictions, and
a method for combining scheduling factors. Scheduling results characterize the inherent tradeoff
between job makespan and the number of evictions due to resource unavailability, and demonstrate
how prediction-based schedulers can navigate this tradeoff under various scenarios. Schedulers can
use prediction-based job replication techniques to replicate those jobs that are most likely to fail.
The proposed prediction-based replication strategies outperform others, as measured by improved
makespan and fewer redundant operations. Multi-state availability predictor can provide information
that allows distributed schedulers to be more efficient — as measured by a new efficiency metric —
than others that blindly replicate all jobs or some static percentage of jobs. PredSim, a simulation-
based framework, facilitates the study of these topics.
iv
Dedication
I dedicate this dissertation to all the people in this world who have cared for me and taken me
into their hearts. To my mom. For watching me grow and helping me when I fell down. For giving
me this incredible feeling that she will always be there for me. I hope she knows what that means
to me, and that I’ll always be there for her.
To Marisa. Who’s gentleness and smile helped me through the hardest of times. Her friendship
and understanding taught me what it really means to love. I will forever carry our moments with
me in my heart.
To Holly. For showing me the beauty of a sunset and for always lending an ear when the world
seemed so big and the road seemed so long. For teaching me how the love of a friend is the brightest
light in this darkness.
And when you have reached the mountain top, then you shall begin to climb.
And when the earth shall claim your limbs, then shall you truly dance.
- Khalil Gibran
v
Acknowledgements
It is no small exaggeration to say that this work would not exist if it were not for the guidance
and insight of my advisor, Michael J. Lewis. I would like to acknowledge him for inspiring me to
reach and strive for novelty. For showing me the thrill of publication and the tempering beauty of
failure. For pushing me to better myself and in a very profound way, changing the course of my life.
vi
Contents
List of Tables ix
List of Figures x
1 Introduction 11.1 Characterizing Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Predicting Resource Unavailability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Prediction-based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Prediction-based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Load-adaptive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Multi-functional Device Characterization and Scheduling . . . . . . . . . . . . . . . . 91.7 Simulation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work 122.1 Resource Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 MFD characterization and Burstiness . . . . . . . . . . . . . . . . . . . . . . 132.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Availability Model and Analysis 193.1 Availability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Condor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Unavailability Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.3 Grid Application Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Trace Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Condor Pool Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 Classification Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Multi-State Availability Prediction 324.1 Prediction Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Predictor Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Prediction Method Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.2 Analysis of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Prediction-Based Scheduling 395.1 Scheduling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Reliability Performance Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Prediction Based Scheduling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Scoring Technique Performance Comparison . . . . . . . . . . . . . . . . . . . 435.3.2 Prediction Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Multi-State Prediction Based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 47
vii
5.4.1 Multi-State Prediction Based Scheduling . . . . . . . . . . . . . . . . . . . . . 475.4.2 Predictor Quality vs. Scheduling Result . . . . . . . . . . . . . . . . . . . . . 485.4.3 Scheduling Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 49
6 Prediction-Based Replication 526.1 Replication Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Static Replication Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Prediction-based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.1 Replication Score Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3.2 Load Adaptive Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.3 Load Adaptive Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Load Adaptive Replication 647.1 Load Adaptive Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Prediction based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3 Load Adaptive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3.1 Replication Strategy List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3.2 Replication Index Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.3.3 SR Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Sliding Replication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4.1 Prediction-based Replication Comparison . . . . . . . . . . . . . . . . . . . . 727.4.2 SR Under Various Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Multi-Functional Device Utilization 758.1 Multi-Functional Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.2 Unconventional Computing Environment Challenges . . . . . . . . . . . . . . . . . . 778.3 Characterizing State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.4 Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4.1 Overall State Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.4.2 Transitional Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.4.3 Durational Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.5 A Metric for Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5.1 Burstiness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5.2 Degree of Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.5.3 Burstiness Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.6 Burstiness-Aware Scheduling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 878.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878.6.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9 Simulator Tool 939.1 Simulation Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1.1 Simulation Based Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.1.2 Distributed System Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 PredSim Capabilities and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.3 PredSim System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.3.1 Input Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.3.2 User Specified Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029.3.3 PredSim Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4 PredSim Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10 Conclusions 11210.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11210.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
viii
Bibliography 117
ix
List of Tables
3.1 Machine Classification Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Machine Classification Scheduling Results . . . . . . . . . . . . . . . . . . . . . . . . 31
9.1 Prediction Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039.2 Scheduling Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049.3 Replication Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.4 Job Queuing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
x
List of Figures
1.1 Thesis flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Multi-state Availability Model: Each resource resides in and transitions between fouravailability state, depending on the local use, reliability, and owner-controlled sharingpolicy of the machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Machine State over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Total Time Spent in Each Availability State . . . . . . . . . . . . . . . . . . . . . . . 243.4 Number of Transitions versus Hour of the Day . . . . . . . . . . . . . . . . . . . . . 243.5 Number of Transitions versus Day of the Week . . . . . . . . . . . . . . . . . . . . . 253.6 State Duration versus Hour of the Day . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7 State Duration versus Day of the Week . . . . . . . . . . . . . . . . . . . . . . . . . 263.8 Machine Count versus State Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 273.9 Machine Count versus Average Number of Transitions Per Day . . . . . . . . . . . . 283.10 Failure Rate vs. Job Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Transitional Predictor Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Weighting Technique Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Prediction Accuracy by Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 As schedulers consider performance more than reliability (by increasing TradeoffWeight W), makespan decreases, but the number of evictions increases. . . . . . . . 42
5.2 The resource scoring decision’s effectiveness depends on the length of jobs. For longerjobs, schedulers should emphasize speed over predicted reliability. . . . . . . . . . . . 44
5.3 Scoring decision effectiveness depends on job checkpointability. Schedulers that treatcheckpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer moreevictions, but mostly for checkpointable jobs; therefore, makespan remains low. . . . 45
5.4 Schedulers can improve results by considering predictions for intervals longer thanthe actual job lengths. This is especially true for job length, between approximately48 and 72 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 TRF and Ren MTTF show similar trends for both makespan and evictions. For anyparticular makespan, TRF causes fewer evictions, and for any number of evictions,TRF achieves improved makespan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all Pre-diction Length Weights. TRF also produces a shorter average job makespan than allRen predictors, except the one day predictor (which produces 45% more evictions). . 48
5.7 TRF-Comp produces fewer job evictions than all comparison schedulers (with theexception of the Pseudo-optimal scheduler) for jobs up to 100 hours long. . . . . . . 49
5.8 TRF-Comp produces fewer job evictions than all comparison schedulers (with theexception of the Pseudo-optimal scheduler) for loads up to 40,000 jobs . . . . . . . . 50
6.1 Checkpointability Based Replication: Makespan (left) and extra operations (right)for a variety of replication strategies, two of which are based on checkpointability ofjobs, across four different load levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Replication Score Threshold’s effect on replication effectiveness - Low Load . . . . . 586.3 Replication Score Threshold’s effect on replication effectiveness - Medium High Load 60
xi
6.4 Replication strategy performance across a variety of system loads . . . . . . . . . . . 616.5 Load adaptive replication techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Replication Score Threshold’s effect on replication effectiveness across various loads . 677.2 Load adaptive replication with random replication probability . . . . . . . . . . . . . 677.3 Sliding Replication’s load adaptive approach . . . . . . . . . . . . . . . . . . . . . . . 687.4 Sliding Replication list comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5 Sliding Replication function comparison . . . . . . . . . . . . . . . . . . . . . . . . . 717.6 Sliding Replication index analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.7 Static vs. adaptive replication performance comparison . . . . . . . . . . . . . . . . 737.8 Sliding Replication under the Grid5000 workload . . . . . . . . . . . . . . . . . . . . 747.9 Sliding replication under the NorduGrid workload . . . . . . . . . . . . . . . . . . . 74
8.1 MFD count per floor at a large pharmaceutical company . . . . . . . . . . . . . . . . 768.2 MFD count per floor at a division of a large computer company. . . . . . . . . . . . 778.3 Trace data from four MFDs in a large corporation. The devices are Available over
98% of the time, and requests for local use are bursty and intermittent. . . . . . . . 788.4 MFD Availability Model: MFDs may be completely Available for Grid jobs (when
they are idle), Partially Available for Grid jobs (when they are emailing, faxing, orscanning), Busy or unavailable (when they are printing), or Down. . . . . . . . . . . 80
8.5 MFD state transitions by day of week . . . . . . . . . . . . . . . . . . . . . . . . . . 838.6 MFD state transitions by hour of day . . . . . . . . . . . . . . . . . . . . . . . . . . 838.7 MFD state duration by day of week . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.8 MFD state duration by hour of day . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.9 Number of occurrences of each state duration . . . . . . . . . . . . . . . . . . . . . . 848.10 Burstiness versus Day of Week and Hour of Day . . . . . . . . . . . . . . . . . . . . 878.11 Comparison of MFD schedulers as Grid job length increases . . . . . . . . . . . . . . 898.12 Determination of Grid job lengths with 95% confidence intervals for given failure rates 918.13 Comparison of MFD schedulers as Grid job load increases . . . . . . . . . . . . . . . 91
9.1 PredSim System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.2 PredSim execution time versus the number of jobs executed . . . . . . . . . . . . . . 1109.3 PredSim execution time versus the number of schedulers investigated . . . . . . . . . 111
xii
Chapter 1
Introduction
The functionality, composition, utilization, and size of large scale distributed systems continue to
evolve. The largest Grids and test beds under centralized administrative control—including Ter-
aGrid [74], EGEE [21], Open Science Grid (OSG) [26], and PlanetLab [49]—vary considerably in
terms of the number of sites and the extent of the resources contained by each site. Various peer-to-
peer (P2P) [6] and Public Resource Computing (PRC) [4] systems allow users to connect and donate
their individual machines, and to administer their systems themselves. Cloud-based systems are also
coming into the mainstream consisting of sets of distributed resources for use or rental by remote
users. Amazon EC2 [77] is a prominent cloud-based solution in which users can rent computational
hardware. Eucalyptus is a private cloud computing implementation designed to operate on computer
clusters behind web-based portals [48]. Hybrids of these various Grid models, cloud-based systems
and middleware will continue to emerge as well [1] [22]. Future Grids will contain dedicated high
performance clusters, individual less-powerful machines, and even a variety of alternative devices
such as PDAs, sensors, and instruments. They will run Grid middleware, such as Condor [42] and
Globus [24], that enables a wide range of policies for contributing resources to the Grid.
This continued increase in functional heterogeneity will make Grid scheduling—mapping jobs
onto available constituent Grid resources—even more challenging, partly because different resources
will exhibit varying unavailability characteristics. For example, laptops may typically be turned on
and off more frequently, and may join and leave the network more often. Individual workstation
and desktop owners may turn their machines off at night, or shut down and restart to install new
software more often than a server’s system administrator might. Even the same kinds of resources
will exhibit different availability characteristics. CS department research clusters may be readily
available to the Grid, whereas a small cluster from a physics department may not allow remote Grid
jobs to execute unless the cluster is otherwise idle. Site autonomy has long been recognized to be
an important Grid attribute [39]. In the same vein, even owners of individual resources will exhibit
1
different usage patterns and implement different policies for how available they make their machines.
If Grid schedulers do not account for these important differences, they will make poor mapping
decisions that will undermine their effectiveness. Schedulers that know when, why, and how the re-
sources become unavailable, can be much more effective, especially if this knowledge is coupled with
information about job characteristics. For example, long-running jobs that do not implement check-
pointing may require highly available host machines. Checkpointable jobs that require heavyweight
checkpoints may prefer resources that are reclaimed by users, rather than those that unexpectedly
become unavailable, thereby making possible an on demand checkpoint. This is less important for
jobs that produce lightweight checkpoints that can be produced on shorter notice. Easily replicable
processes without side effects might perform nearly as well on the less powerful and more intermit-
tently available resources, leaving the more available machines for jobs that cannot deal as well with
machine unavailability (at least under moderate to high contention for Grid resources).
This thesis describes building Grid schedulers that use the unavailability and performance char-
acteristics of target resources when making scheduling decisions, a goal that requires progress in
several directions. The thesis is organized as follows. Chapter 3 describes a multi-state availabil-
ity model and examines a University of Notre Dame trace in terms of that model to establish the
volatility and to characterize the availability of the underlying resources. Chapter 4 examines several
prediction techniques to forecast the future states of a resource’s availability. Chapter 5 investigates
using different approaches to schedule jobs with different characteristics based on those predictions,
by attempting to choose resources that are least likely to become unavailable. Chapter 6 then ex-
plores the efficacy of using the availability predictors in a new way; they are tested to determine how
helpful they can be in deciding which jobs to replicate with the goal of replicating those jobs most
likely to experience resource unavailability. Chapter 7 investigates how replication techniques can
adapt to changing system load by showing how it can effect replication policy selection. Chapter 8
investigates predicting and scheduling in the unconventional environment of multi-functional devices
located in industry settings. Lastly, Chapter 9 describes the simulator tool that produced the results
in this work.
1.1 Characterizing Availability
To build Grid schedulers that consider the unavailability characteristics of target resources, resources
must be traced to uncover why and how they become unavailable. As described in Chapter 2, most
availability traces focus primarily on when resources become unavailable, and detail the patterns
2
based purely on availability and unavailability. This study identifies different types of unavailability,
and analyzes the patterns of that unavailability.
It is also beneficial to classify resources in a way that availability predictors can take advantage of,
and design and build such predictors. Knowing that Grid resources become unavailable differently
and even predicting when and how they do is important only to the extent that schedulers can
exploit this information. Therefore, schedulers must be built that use both availability predictions
(which are based on unavailability characteristics) and application characteristics.
Section 3.1 identifies four resource availability states, dividing unavailable resources into separate
categories based on why they are unavailable. This allows resources that have abruptly become
unreachable to be differentiated from resources whose user has reclaimed them. Events that trigger
resources to transition between availability states are described. Applications are then classified
by characteristics that could help schedulers, especially in terms of those machines’ unavailability
characteristics.
Section 3.2 describes traces that expose unavailability characteristics more explicitly than current
traces. The results presented identify the availability states that resources inhabit, as well as the
transitions that resources make between these states. Section 3.2 reports sample data from four
months of traces from the University of Notre Dame’s Condor pool.
To summarize, the resource characterization contributions include
• a new way of analyzing trace data, which better exposes resource unavailability characteristics,
• sample data from one such Condor-based trace, and
• a new classification approach that enables effective categorization of resources into availability
states.
These results make a strong case for studying how a Grid’s constituent resources become un-
available, for using that information to predict how individual resources may behave in the future,
and for building that information into future Grid schedulers, especially those that will operate in
increasingly eclectic and functionally heterogeneous Grids.
1.2 Predicting Resource Unavailability
An availability predictor, which forecasts the availability of candidate resources for the period of time
that an application is likely to run on them, can help schedulers make better application placement
decisions. Clearly, selecting resources that will remain available for the duration of application
3
runtimes would improve the performance of both individual applications and of the Grid as a whole.
Resource categorization work demonstrates the potential for using past history to organize resources
into categories that define their availability behavior, and then using schedulers that take advantage
of this resource categorization to make better scheduling decisions [59]. However, due to the low
granularity of the classes, the static nature of each machine’s class and the fact that the classes do
not recognize patterns in a machine’s availability, on-demand availability prediction proved to be an
improvement over the classification approach. Importantly, this work uses a “multi-state” availability
model, which captures not just when and whether a resource becomes unavailable, but how it does
so. This distinction allows schedulers to make decisions that improve overall performance, when
they are configured to also consider application characteristics. For example, if some particular
resource is expected to become unavailable because its owner reclaims the machine for local use
(as opposed to the machine becoming unreachable unexpectedly), then that machine may be an
attractive candidate for scheduling an application that can take an on-demand checkpoint.
Chapter 4 describes investigations on predicting the availability of resources. Several kinds
of predictors are described which use different criteria to predict resource availability. The best
performing predictor considers how resources transition between the proposed model’s unavailability
states, and outperforms other approaches in the literature by 4.6% in prediction accuracy. As shown
in later chapters, simulated scheduling results indicate that this accuracy difference significantly
decreases both makespan and operations lost due to eviction.
To summarize, the resource availability prediction contributions include
• introducing new multi-state availability prediction algorithms that effectively forecast future
resource availability behavior for the proposed multi-state availability model,
• evaluating the most effective configurations of the predictors in this set, and
• producing significant prediction accuracy increases across a variety of prediction lengths com-
pared to existing prediction approaches.
These results demonstrate the feasibility of predicting the future availabilities of Grid resources
which can in turn be used by schedulers to make better job placement decisions.
1.3 Prediction-based Scheduling
Schedulers that effectively predict the availability of constituent resources, and use these predictions
to make scheduling decisions, can improve application performance. Ignoring resource availability
4
characteristics can lead to longer application makespans due to wasted operations [19]. Even more
directly, it can adversely affect application reliability by favoring faster but less available resources
that cannot complete jobs before becoming unavailable or being reclaimed from the Grid by their
owners. Unfortunately, performance and reliability vary inversely [19]; favoring one necessarily
undermines the other.
For this reason, Grid schedulers must consider both reliability and performance in making
scheduling decisions. This requirement is true for current Grids, and will become more impor-
tant as resource characteristics become increasingly diverse. Resource availability predictors can be
used to determine resource reliability and this in turn can be combined with measures of resource
performance when selecting resources. Since performance depends on competing resource load dur-
ing the lifetime of the application, schedulers must make use of load monitors and predictors [80],
which can be centralized, or can be distributed using various dissemination approaches [20].
In scheduling for performance and reliability, it is critical to investigate the effect on both ap-
plication makespan and the number of wasted operations due to application evictions, which can
occur when a resource executing an application becomes unavailable. Evictions and wasted op-
erations are important because of their direct “cost” within a Grid economy, or simply because
they essentially deny the use of the resource by another local or Grid application. The proposed
approach to Grid scheduling involves analyzing resource availability history and predicting future
resource (un)availability, monitoring and considering current load, storing static resource capability
information, and considering all of these factors when placing applications. For scheduling, the best
availability predictor (Chapter 4) is used to investigate the effects of weighing resource speed, load,
and reliability in a variety of ways, to decrease makespan and increase application reliability.
Chapter 4 also investigates different approaches to schedule applications with varying character-
istics. In particular, results are presented for scheduling checkpointable jobs to consider speed and
load more heavily than reliability. This is done because their eviction involves fewer wasted oper-
ations when compared with non-checkpointable jobs due to their ability to save state and resume
execution elsewhere.
The simulation-based performance study presented here begins by establishing the inherent per-
formance/reliability scheduling tradeoff for a real world environment. The performance of several
different schedulers that consider both reliability and performance is then characterized. Subsequent
sections develop and explore the idea of varying the requested prediction duration by demonstrating
its effect on application execution performance. The study focuses on scheduling for the two compet-
ing metrics of performance and reliability by characterizing the effects of (i) treating checkpointable
5
jobs differently from “non-checkpointable” jobs, and (ii) varying the checkpointability, length, and
number of jobs. These approaches are compared with the only other multi-state availability predic-
tor and scheduler, designed by Ren et al. [55]. Section 5.4.1 characterizes the relative performance
and configurability of the two competing approaches. Results show that Ren’s scheduler causes
up to 51% more evicted jobs while simultaneously increasing average job makespan by 18% when
compared with the best performing scheduler proposed here.
To summarize, the prediction-based scheduling contributions include
• establishing the performance/reliability scheduling tradeoff with real world traces,
• introducing several schedulers that consider both reliability and performance,
• varying prediction duration and showing it’s effect on job execution performance,
• characterizing the effect of scheduling checkpointable/non-checkpointable jobs differently, and
• evaluating the proposed schedulers with a variety of job loads, lengths and checkpointabilities
compared with other related work scheduling approaches.
The scheduling results presented here demonstrate the feasibility of using resource availability
predictions to aid schedulers in choosing reliable resources and to decrease average job makespan.
1.4 Prediction-based Replication
Chapter 6 investigates the efficacy of using the availability predictors in a completely new way,
namely to decide which jobs to replicate. The basis for this idea is that checkpointing and replication
are two of the primary tools for dealing with resource unavailability. The system cannot generally
add checkpointability; the application programmer must do that. So whereas schedulers can (and
should) take advantage of knowing which jobs are checkpointable (Chapter 5 explores this idea),
they cannot proactively add reliability by increasing the checkpointability of the job mix.
The system can, however, replicate some jobs in an attempt to deal with possible resource
unavailability. Replicating a job can benefit performance in one of two ways. First, jobs are scheduled
onto resources using imperfect ranking metrics that may or may not reflect how fast a job will run
on a machine. Therefore, by starting a job to run simultaneously on more than one resource, the
application makespan is determined by the earliest completion time among all replicas. As this can
depend on unpredictable load and usage patterns, the replica can potentially improve the makespan
of the job. Second, replicated job executions can also help deal with unavailability; then when one
6
resource becomes unavailable, the adverse effect on performance of the jobs it runs can be reduced
if replicas complete without experiencing resource unavailability.
Replication does not come without a cost, however. Within a Grid economy, it is likely that
applications will need to pay for Grid use on a per job, per process, or per operation basis. Therefore,
replication can cost extra, assuming redundant jobs are counted separately (which seems likely).
Replication can also have an adverse indirect effect, as some of the highest ranked resources could
be used for executing replicas, leaving only “worse” (by whatever metric the scheduler uses to rank
jobs) resources for subsequent jobs.
For this reason, Chapter 6 tests the hypothesis that availability predictors can help select the
right jobs to replicate. This approach improves overall average job makespan, reduces the redundant
operations needed for the same improved makespan, or both. Chapter 6 explores several replication
strategies, including strategies that make copies of all jobs, some fixed percentage of jobs, long
jobs, jobs that are not checkpointable, and several combinations. These strategies are tested under
a variety of loads using two availability traces [29][59]. Chapter 6 explores the effectiveness of
replicating only the jobs that are mapped onto resources that fall below some reliability threshold.
That is, the predictor is asked to forecast the likelihood that a resource will complete a particular
job without becoming unavailable. If this predicted probability value is too low, the job is replicated.
If the resource is deemed reliable enough, the job is not replicated.
To summarize, the prediction-based replication contributions include
• describing a new metric, replication efficiency, to analyze replication effectiveness,
• analyzing a replication strategy for using availability predictions to determine when to replicate
jobs. The results investigate the replication strategy’s effect on both makespan and replication
efficiency,
• describing how system load effects replication effectiveness, and
• proposing three different static load adaptive replication techniques, which consider the system
load and the desired metric to decide which replication strategy to use.
Prediction-based replication results show that using resource availability predictions to repli-
cate those jobs that are most likely to experience unavailability can increase job performance and
efficiency.
7
1.5 Load-adaptive Replication
Choosing whether to replicate a job is further complicated by the current load on the system.
As system load increases, selective replication strategies that make fewer replicas produce larger
makespan improvements than more aggressive strategies [62]. The replication strategy described in
Section 6.3.3 consists of a static set of four replication techniques ranging from least selective to
most selective, allowing the current load on the system to determine which replication technique to
use. The four replication strategies are chosen statically based on performance data from a single
synthetic workload. This technique does not perform as well when faced with different real world
workloads. Chapter 7 proposes a replication strategy that adapts to changing system load and
improves on the previously proposed technique.
Section 7.3 introduces a new load adaptive strategy, Sliding Replication (SR). SR first determines
the current load on the system and uses this value as input to a function that outputs an index into
an ordered list of replication strategies. The selected replication strategy determines whether the
job should be replicated. As load changes, the replication strategy changes. This approach attempts
to dynamically discover the most appropriate replication technique given the current load. Real
world resource availability and workload traces are used to test SR’s performance versus previously
proposed load adaptive and non-adaptive replication techniques. The results show that SR produces
comparable average job makespan improvements while decreasing the number of wasted operations,
as measured by replication efficiency (see Section 6.3). SR improves efficiency by an average of
347% when compared with a non-load adaptive approach and by 70% when compared with a less
sophisticated load adaptive approach.
To summarize, the load-adaptive replication contributions include
• showing how the changing system load of real-world job traces effects replication policy per-
formance,
• proposing a load-adaptive replication approach, Sliding Replication, which measures system
load and selects the most appropriate replication policy based on the current system load,
• investigating how varying the parameters of Sliding Replication effects its performance, and
• testing SR’s performance against a related work replication approach in two real-world resource
trace environments and with two real-world job traces.
Load adaptive replication results demonstrate how varying system load affects replication policy
8
performance. Load-adaptive replication can adapt to changing load by selecting the replication
policy most appropriate for the current system load.
1.6 Multi-functional Device Characterization and Scheduling
The traces and workloads for the results summarized in Sections 1.1 through 1.5 have all been
derived from environments that embody this type of setting. To explore the applicability of the
proposed approaches in other environments, unconventional computing environments are leveraged,
which here are defined as collections of devices that do not include general purpose workstations or
machines. In particular, the following facts are exploited:
1. Special purpose instruments and devices are increasingly being outfitted to include processors,
memory, and peripherals that make them capable of supporting high performance computing,
not just for their intended purpose.
2. Large corporations often deploy dozens, hundreds, and even over a thousand such devices
within a single floor, building, or campus, respectively. Moreover, these devices are connected
to one another with high speed networks.
3. The fast processors and large amounts of memory are often required to adequately process peak
loads, not for sustained long running applications. Combined with the inherent characteristics
of the local job request mix, this leaves the devices idle for a significant percentage (95% to
98%) of time.
These characteristics provide potential to federate and harvest the unused cycles of special pur-
pose devices for compute jobs that require high performance platforms. Chapter 8 investigates the
viability of using Multi-Function Devices (MFDs) as a Grid for running computationally intensive
tasks under the challenging constraints described above.
To summarize, the MFD characterization and scheduling contributions include
• developing a multi-state availability model and burstiness metric to capture the dynamic and
preemptive nature of jobs encountered on MFDs,
• gathering an extensive MFD availability trace from a representative environment and analyzing
it to determine the states in the proposed model,
• proposing an online and computationally efficient test for burstiness using the trace, and
9
• showing, through trace-based simulations, that MFDs can be used as computational resources
and demonstrating the scheduling improvements that can be achieved by considering resource
burstiness in making job placements
This MFD characterization work demonstrates the viability of using idle MFDs as computation
resources. In particular, statistical confidence intervals are computed on the lengths of Grid jobs
that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for the
bursty environment. On average across all loads depicted, the MFD scheduling technique proposed
comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average
of 18.3% makespan improvement over a random scheduler.
1.7 Simulation Tool
All the previous research topics require a platform on which to examine their behavior. Current
simulation-based approaches are adequate for some of the experiments that do not use multi-state
prediction, but a system with the capabilities to test multi-state prediction based scheduling was not
available. PredSim is a new experimental platform capable of handing resources that exhibit different
types of unavailability. PredSim consists of a variety of input, user specified and core components.
These components provide a framework for testing prediction, scheduling and replication approaches.
The key contributions of PredSim include
• providing functionality for multi-state based resource availability prediction, scheduling, task
replication and job execution management in a purpose-built environment,
• accepting and combining workload and resource availability traces in multiple formats and
supplying facilities for synthetically creating both jobs and resources,
• providing a “Predictor Accuracy” mode designed to isolate and determine a prediction algo-
rithm’s accuracy across a variety of circumstances, and
• including a library of included predictors, schedulers, replication policies and job queuing
strategies with the flexibility for additional modules.
PredSim has been a helpful tool in researching fault tolerant distributed computing approaches.
10
Figure 1.1: Thesis flow chart
1.8 Summary
This thesis explores the idea of using resource reliability information to improve distributed system
performance. Figure 1.1 demonstrates the logical flow of this work. Resource traces provide availabil-
ity behavior characterization and demonstrate underlying resource volatility. This characterization
information motivates prediction approaches that forecast resource availability. Schedulers use the
prediction information, alongside performance information, to select resources for application exe-
cution. Resource availability predictions choose which jobs are most suitable for replication. These
replication approaches can incorporate current system load in determining which jobs to replicate.
The results presented in this thesis demonstrate that schedulers that judiciously use resource
reliability information provided by availability predictors can improve system performance by se-
lecting resources that are least likely to become unavailable for application execution, by matching
an application’s characteristics to the availability behavior of resources, and by efficiently choosing
those jobs that can most benefit from replication.
11
Chapter 2
Related Work
Related work resides in five broad categories, namely (i) resource tracing and characterization, (ii)
burstiness and MFD scheduling (iii) prediction of resource state (availability and load), (iv) Grid
scheduling, especially approaches that consider reliability and availability, and (v) job replication.
This chapter is organized accordingly.
2.1 Resource Characterization
Condor [42] is the best known cycle stealing system. The ability to utilize idle resources with minimal
technical and administrative overhead has enabled long-term popularity and success. Resource
management systems like Condor form the basis for the resource characterization work in Chapter 3.
Several papers describe traces of workstation pools. Archarya et al. [2] take three 14 day traces
at three university campuses, and analyze them for availability over time, and for the fraction of
time a cluster of k workstations is free. Bolosky et al. [9] trace more than 51,000 machines at
Microsoft, and report the uptime distribution, the available machine count over time, and temporal
correlations among uptime, CPU load per time of the week, and lifetime. Arpaci et al. [8] describe
job submission by time of day and length of job, report machine availability according to time of
day, and analyze the cost of process migration.
Other traces address host availability and CPU load in Grid environments. Kondo et al. [34][36]
analyze Grid traces from an application perspective, by running fixed length CPU bound tasks. The
authors analyze availability and unavailability durations during business and non-business hours,
in terms of time and number of operations [34]. They also show the cumulative distribution of
availability intervals versus time and number of operations [36]. The authors derive expected task
failure rates but state that the traces are unable to uncover the specific causes of unavailability
(e.g. user activity vs. host failure). The Condor traces used in this work borrow from this general
12
approach.
Ryu and Hollingsworth [64] describe fine grained cycle stealing (FGCS) by examining a Network
Of Workstations (NOW) trace [8] and a Condor trace for non-idle time and its causes (local CPU
load exceeding a threshold, local user presence, and the time the Grid must wait after a user has
left before utilizing the resource). Anderson and Fedak [5] analyze BOINC hosts for computational
power, disk space, network throughput, and number of hosts available over time (in terms of churn,
lifetime, and arrival history). They also briefly study on-fraction, connected-fraction, active-fraction,
and CPU efficiency. Finally, Chun and Vahdat [16] report PlanetLab all pairs ping data, detailing
resource availability distribution and mean time to failure (MTTF).
Unlike the work described above, the traces investigated in this work are designed to uncover the
causes of unavailability. The trace data is analyzed and used to classify machines into the proposed
availability states. Individual resources are analyzed for how they enter the various states over time
(by both day and hour). Less extensive traces are intended primarily to motivate the approach to
availability awareness and prediction in Grid scheduling proposed in this work.
2.1.1 MFD characterization and Burstiness
As mentioned above, Archarya et al. [2], Bolosky et al. [9] and Arpaci et al. [8] analyze availability of
workstations over time in various academic and industry settings. But in aforementioned classes of
resources, the dynamics exhibited by intermittent job arrivals (burstiness) and preemption is lacking
or not explicitly analyzed - this is something unique to this thesis.
Kondo et al. [34][36], Ryu and Hollingsworth [64] and Anderson and Fedak [5] examine resource
behavior but state that the traces are unable to uncover the specific causes of unavailability (e.g.
user activity vs. host failure). Unlike the work described above, the traces gathered for this work
are designed not only to uncover the causes of unavailability, but also to examine them in the envi-
ronment of MFDs. Burstiness and preemptive job arrivals in workstation pools have received little
attention in the literature possibly because such resources are seldom suitable for distributed com-
puting, especially parallel computing. In the literature, the volatility of any type of computational
resource is mitigated by modeling the future availability of resources with predictors (and using such
predictors for inline scheduling as described below). The approach proposed in Chapter 8 character-
izes the magnitude of the disturbances from intermittent preemptive job arrivals towards Grid jobs
and normalizes them across a pool of resources.
13
2.2 Prediction
Prediction algorithms have been used in many aspects of computing, including the prediction of
future CPU load and resource availability. The most notable prediction system for Grids and
networked computers is Network Weather Service (NWS) [80]. NWS uses a large set of linear
models for host load prediction, and combines them in a mixture-of-experts approach that chooses
the best performing predictor. The RPS toolkit [18] uses a set of linear host load predictors. One
such common linear technique is called BM(p), or Sliding Window, which records the states occupied
in the last N intervals and averages those values.
Computer and system hardware failure has been extensively studied in previous work [66][69][68].
Schroeder et al. analyze nine years of failure data from the Los Alamos National Laboratory including
more than 23,000 machine failures. The authors study the root causes of failure and the mean
time to failure of the resources [68]. Sahoo et al. investigate system failures from a network of
400 heterogeneous servers for patterns and distribution information [66]. Gibson et al. analyze disk
replacement data from high-performance computing sites and Internet sites for mean time till failure
data distribution [69].
It is not the purpose of this work to specifically target hardware failures. Instead, this approach
analyzes and predicts the reasons machines become unavailable to the Grid, including resource
failure, user presence and high local CPU load. Unavailability due to resource failure, while not
uncommon, is less frequent than other forms of unavailability. For this reason, the prediction
techniques proposed here forecast all types of unavailability, not just hardware failure unavailability.
Importantly, this work attempts to predict the future occurrence of individual instances of resource
unavailability rather than fitting failure data to probability distributions.
In availability prediction, Ren et al. [55] [54] [56] use empirical host CPU utilization and resource
contention traces to develop the only other multi-state model, prediction technique, and multi-state
prediction based scheduler for resource availability. Their multi-state availability model includes five
states, three of which are based on the CPU load level (which resides in one of three zones); the two
other states indicate memory thrashing and resource unavailability. This model neither captures
user presence on a machine, nor allows for differentiation between graceful and ungraceful process
eviction. For prediction, the authors count the transitions from available to each state during the
previous N days to produce a Markov chain for state transition predictions. These transition counts
determine the probability of transitioning to each state from the available state. The Transitional
N-Day Equal Weight (TDE) predictor (Chapter 4) uses a similar approach, but weighs each day’s
14
transitions separately by first computing the probabilities for each of the N days and then combining
them. This is in contrast to the Ren predictor, which sums all the transitions through the N days
and then computes the probability according to their algorithm. Experiments show that the TDE
method more effectively combines many days without allowing a single day’s behavior to dominate
the average. Also, Ren only counts the number of transitions to each state from available (i.e.
three counters), uses these counts to determine the probability of exiting available for each state,
then computes the probability of staying available as one minus the sum of the three exit state
probabilities. In contrast, the TDE approach counts the number of times a resource completes a
period of time equal to the prediction length during resource history analysis. This additional state
counter is summed with the non-available state counters and then each state count is divided by the
total count to produce the four probabilities. This technique allows the predictor to more accurately
determine how often a resource completes the time interval required by the application. To further
differentiate the approaches proposed in this work, the most successful scheduling predictor examines
the most recent N hours of behavior instead of the time interval in question on the previous N days.
The proposed predictors also employ several transition weighting schemes for further prediction
improvement and enhanced scheduling results. These differences significantly influence scheduling
results.
Mickens and Noble [45] [44] [46] use variations and combinations of saturating counters and
linear predictors, including a hybrid approach similar to NWS’s mixture-of-experts, to predict the
likelihood of a host being available for various look ahead periods. A Saturating Counter predictor
increases a resource-specific counter during periods of availability, and decreases it during unavail-
ability. A History Counter predictor gives each resource 2N saturating counters, one for each of the
possible availability histories dictated by the last N availability sampling periods. Predictions are
made by consulting the applicable counter value associated with the availability exhibited in the last
N sampling periods.
Pietrobon and Orlando [50] use regressive analysis of past job executions to predict whether a
job will succeed. Nurmi et al. [47] model machine availability using Condor traces and an Inter-
net host availability dataset, attempt to fit Weibull, hyper-exponential, and Pareto models to the
availability duration data, and evaluate them in terms of “goodness-of-fit” tests. They then provide
confidence interval predictions for availability durations based on model-fitting [16]. Similarly, Kang
and Grimshaw filter periodic unavailabilities out of resource availability traces and then apply sta-
tistical models to the remaining availability data [30]. Finally, several machine learning techniques
use categorical time-series data to predict rare target events by mining event sets that frequently
15
precede them [76] [78] [65].
Chapter 4’s prediction approach differs from the current work by developing novel state transition
counting schemes, by proposing and investigating numerous methods for determining which parts
of a resource’s historical data to analyze, and by using its predictions to forecast a unique set of
availability states. Several transition weighting techniques are developed and are applied to these
algorithms to further increase prediction accuracy. Multi-state availability prediction is used to
forecast the occurrence of the model’s states, and shows that the improved accuracy provided by
the proposed predictors produces significantly improved scheduling results.
2.3 Scheduling
Most Grid scheduling research attempts to decrease job makespan or to increase throughput. For
example, Kondo et al. [32] explore scheduling in a volunteer computing environment. Braun et al. [11]
explore eleven heuristics for mapping tasks onto a heterogeneous system, including min-min, max-
min, genetic algorithms and opportunistic load balancing. Cardinale and Casanova [13] use queue
length feedback to schedule divisible load jobs with minimal job turnaround time. GrADS [17] uses
performance predictions from NWS, along with a resource speed metric, to help reduce execution
time.
Fewer projects focus on scheduling for reliability. Kartik and Murphy [31] calculate the optimal
set of processor assignments based on expected node failure rates, to maximize the chance of task
completion. Qin et al. [51] investigate a greedy approach for scheduling task graphs onto a hetero-
geneous system to reduce reliability cost and to maximize the chance of completion without failure.
Similarly, Srinivasan and Jha [72] use a greedy approach to maximize reliability when scheduling
task graphs onto a distributed system.
Unfortunately, scheduling only for reliability undermines makespan, and scheduling only on the
fastest or least loaded machines can be detrimental due to the performance ramifications of resource
unavailability. Dogan and Ozguner [19] develop a greedy scheduling approach that ranks resources
in terms of execution speed and failure rate, weighing performance and reliability in different ways.
Their work does not use availability predictions, and assumes that each synthetic resource follows
a Poisson failure probability with no load variation, and that machines which become unavailable
never restart. All of these assumptions are removed in the work described in this thesis, which uses
availability and load traces from real resources.
Amin et al. [3] use an objective function to maximize reliability while still meeting a real time
16
deadline. They search a scheduling table for a set of homogeneous non-dedicated processors to
execute tandem real-time tasks. These authors also assume a constant failure rate per processor.
Some distributed scheduling techniques use availability prediction to allocate tasks. Kondo et
al. [33] examine behavior on the previous weekday to improve the chances of picking a host that will
remain available long enough to complete a task’s operations. Ren et al. [55] also examine scheduling
jobs with their Ren N-day predictor. The Ren MTTF scheduler first calculates each resource’s mean
time to failure (MTTFi) by adding the probabilities of exiting available for time t, as t goes from 0
to infinity. It then calculates resource i’s effective task length (ETLi) as:
ETLi = MTTFi · CRi · (1 − Li)
CRi is resource i’s clock rate and Li is its predicted average CPU load. The algorithm then
selects the resource with the smallest ETL from the resources with MTTF values that are larger
than the ETL. If no such resources exist, it selects the resource with the minimum job completion
time, considering resource availability. The scheduling approach in Chapter 5 is most similar to
Ren et al.’s, and differs in the following ways: (i) it considers how and why a resource may become
unavailable, and attempts to exploit varying consequences of different kinds of unavailability, (ii) it
schedules checkpointable and non-checkpointable jobs differently, to improve overall performance,
and (iii) it explicitly analyzes and schedules for the tradeoff between performance and reliability.
2.4 Replication
Replication and checkpoint-restart are widely studied techniques for improving fault tolerance and
performance. Data replication makes and distributes copies of files in distributed file sharing systems
or data Grids [53][37]. These techniques strive to give users and jobs more efficient access to data by
moving it closer, and to mitigate the effects of resource unavailability. Some work considers replica
location when scheduling tasks. Santo-neto et al. [67] schedule data-intensive jobs, and introduce
Storage Affinity, a heuristic scheduling algorithm that exploits a data reuse pattern to consider data
transfer costs and ultimately reduce job makespan.
Task replication makes copies of jobs, again for both fault tolerance and performance. Li et
al. [40] strive to increase throughput and decrease Grid job execution time, by determining the
optimal number of task replicas for a simulated and dynamic resource environment. Their analytical
model determines the minimum number of replicas needed to achieve a certain task completion
probability at a specified time. They compare dynamic rescheduling with replication, and extend
17
the replication technique to include an N-out-of-M scheduling strategy for Monte Carlo applications.
Similarly, Litke et al. [41] present a task replication scheme for a mobile Grid environment. They
model resources according to a Weibull reliability function, and estimate the number of task replicas
needed for certain levels of fault tolerance. The authors use a knapsack formulation for scheduling,
to maximize system utilization and profit, and evaluate their approach through simulation.
Silva et al. [70] investigate scheduling independent tasks in a heterogeneous computational Grid
environment, without host speed, load, and job size information; the authors use replication to
cope with dynamic resource unavailability. Workqueue with Replication (WQR) first schedules all
incoming jobs, then uses the remaining resources for replicas. The authors use simulation to compare
WQR with various maximum amounts of replicas (1x, 2x, 3x, etc), to Dynamic FPLTF [43] and
Sufferage [15] through simulation. Angalano et al. [7] later extend this work with a technique called
WQR Fault Tolerant (WQR FT), which adds checkpointing to the algorithm. In WQR a failed task
is abandoned and never restarted, whereas WQR FT adds automatic task restart to keep the number
of replicas of each task above a certain threshold. Tasks may use periodic checkpoints upon restart.
Fujimoto et al. [25] develop RR, a dynamic scheduling algorithm for independent coarse grained
tasks; RR defines a ring of tasks that is scanned in round robin order to place new tasks and
replicas. The authors compare their technique with five others, concluding that RR performed next
to the best without knowledge of resource speed or load, even when compared with techniques that
utilize such information.
Others investigate the relationship between checkpoint-restart and replication. Weissman [79]
develops performance models for quantitatively comparing the separate use of the two techniques
in a Grid environment. Similarly, Ramakrishnan et al. [52] compare checkpoint-restart and task
replication by first analytically determining the costs of each strategy and then provide a framework
that enables plug and play of resource behavior to study the effects of each fault tolerant technique
under various parameters.
Importantly, none of the related work uses on demand individual resource availability prediction
or application characteristics such as length and checkpointability to determine if a task replica
should be created. A mix of checkpointable and non-checkpointable jobs are used, and the sim-
ulations are driven by real world resource traces that capture resource unavailability. This thesis
introduces a new metric for studying the efficiency of replication strategies. Proposed replication
techniques are then shown to improve upon results obtained by existing techniques.
18
Chapter 3
Availability Model and Analysis
Resources in non-dedicated Grids oscillate between being available and unavailable to the Grid.
When and how they do so depends on the availability characteristics of the machines, the policies
of resource owners, the scheduling policies and mechanism of the Grid middleware, and the charac-
teristics of the Grid’s offered job load. Section 3.1 identifies four availability states and Section 3.2
analyzes a trace to uncover how resources behave according to that model.
3.1 Availability Model
This section identifies four availability states, and several job characteristics that could influence
their ability to tolerate resource faults. This discussion focuses on an analysis of Condor [42], but
the results translate to any system with Condor’s basic properties of (i) non-dedicated distributed
resource sharing, and (ii) a mechanism that allows resource owners to dictate when and how their
machines are used by the Grid. Section 3.1.1 first discusses the Condor system as a motivation for
the availability model, Section 3.1.2 then presents an availability model, and Section 3.1.3 discusses
how different types of Grid applications respond to the types of availability described in the model.
3.1.1 Condor
Condor [42] harnesses idle resources from clusters, organizations, and even multi-institutional Grid
environments (via flocking and compatibility with Globus [24]) by integrating resource management,
monitoring, scheduling, and job queuing components. Condor can automatically create process
checkpoints for migration. Condor manages non-dedicated resources, and allows individual owners
to set their own policies for how and when they are used, as described below. Default policies
dictate the behavior of resources in the absence of customized user policies, and attempt to minimize
Condor’s disturbance of local users and processes.
19
CPU ThresholdExceeded
User Present
Available toGrid
DownMachine becomes unreachable
User returns to machine
Local load on machine increases
User departs
Local load decreases
Machine becomes reachable
Figure 3.1: Multi-state Availability Model: Each resource resides in and transitions between fouravailability state, depending on the local use, reliability, and owner-controlled sharing policy of themachine.
By default, Condor starts jobs only on resources that have been idle for 15 minutes, that are not
running another Condor job, and whose local load is less than 30%. Running jobs remain subject to
Condor’s policies. If the keyboard is touched or if CPU load from local processes exceeds 50% for 2
minutes, Condor halts the process but leaves it in memory, suspended (if its image size is less than
10 MB). Condor resumes suspended jobs after 5 minutes of idle time, and when local CPU load falls
below 30%. If a job is suspended for longer than 10 minutes or if its image exceeds 10 MB, Condor
gives it 10 minutes to gracefully vacate, and then terminates it. Condor may also evict a job for a
higher priority job, or if Condor itself is shut down.
Condor defaults dictate that jobs whose checkpoints exceed 60 MB checkpoint every 6 hours;
those with larger images checkpoint every 12 hours. By default, Condor delivers checkpoints back
to the machine that submits the job.
3.1.2 Unavailability Types
Condor’s mechanism suggests a model that encompasses the following four availability states, de-
picted in Figure 3.1:
• Available: An Available machine is currently running with network connectivity, more than
15 minutes of idle time, and a local CPU load of less than the CPU threshold. It may or may
not be running a Condor job.
20
• User Present: A resource transitions to this state if the keyboard or mouse is touched, indi-
cating that the machine has a local user.
• CPU Threshold Exceeded: A machine enters this state if the local CPU load increases above
some owner-defined threshold, due to new or currently running jobs.
• Down: Finally, if a machine fails or becomes unreachable, it directly transitions to Down.
These states differentiate the types of unavailability. In the context of this work, resource un-
availability refers to the User Present, CPU Threshold Exceeded and Down states. If a job has been
suspended for 15 minutes or the machine is shutdown, this is called a graceful transition to unavail-
able; a transition directly to Down is ungraceful. This model is motivated by Condor’s mechanism,
but can reflect the policies that resource owners apply. For example, if an owner allows Condor jobs
even when the user is present, the machine never enters the User Present state. Increasing the local
CPU threshold decreases the time spent in the CPU Threshold Exceeded state, assuming similar
usage patterns. The model can also reflect the resource’s job rank and suspension policy by showing
when jobs are evicted directly without first being suspended.
3.1.3 Grid Application Diversity
Grid jobs vary in their ability to tolerate faults. A checkpointable job need not be restarted from
the beginning if its host resource transitions gracefully to unavailable. Only Condor jobs that run
in the “standard universe” [75] are checkpointable. This requires re-linking with Condor compile,
which does not allow jobs with multiple processes, interprocess communication, extensive network
communication, file locks, multiple kernel level threads, files open for both reading and writing, or
Java or PVM applications [75]. These restrictions demonstrate that only some Grid jobs will be
checkpointable. Another important factor is job runtime; Grid jobs may complete in a few seconds,
or may require many hours or even days [8]. Longer jobs will experience more faults, increasing the
importance of their varied ability to deal with them.
Grid resources will have different characteristics in terms of how long they stay in each availability
state, how often they transition between the states, and which states they transition to. Different
jobs will behave differently on different resources. If a standard universe job is suspended and then
eventually gracefully evicted, it could checkpoint and resume on another machine. An ungraceful
transition requires using the most recent periodic checkpoint. A job that is not checkpointable must
restart from the beginning, even when gracefully evicted.
21
The point is that job characteristics, including checkpointability and expected runtime, can
influence the effectiveness of scheduling those jobs on resources that behave differently according to
their transitions between the availability states identified in Section 3.1.2.
3.2 Availability Analysis
The characterization and analysis of Grid resources can be used to provide insight into resource
behavior for multiple purposes. The data can be used by researchers to better understand resource
behavior so that more realistic availability models can be built for simulation based experimentation.
More importantly for this work, the gathered traces can be directly used to drive simulations to test
various prediction, scheduling and replication strategies. Furthermore, analysis of the availability
patterns can identify trends and behaviors that can be exploited to create more accurate availability
predictors and scheduling strategies.
3.2.1 Trace Methodology
For this study, a four month Condor resource pool trace at the University of Notre Dame was
accessed, organized, and analyzed. The trace consists of time-stamped CPU load (as a percentage)
and idle time (in seconds). Condor records these measurements approximately every 16 minutes,
and makes them available via the condor status command. Idle times of zero imply user presence,
and the absence of data indicates that the machine was down.
The data recorded by Condor precludes determining whether the machine was shut down or
the machine directly transitioned to Down, because intentional shutdown and unexpected machine
failure appear the same in the data. Since the Down state is relatively uncommon, this study
conservatively assumes that all transitions to the Down state are ungraceful. Also, a machine is
considered to be in the user state after the user has been present for 5 minutes; this policy filters out
short unavailability intervals that lead to application suspension, not graceful eviction. A machine
is considered to be in the CPU Threshold Exceeded state if its local (i.e. non-Condor) load is
above 50%. Otherwise, a machine that is online with no user present and CPU load below 50%
is considered Available. This includes machines currently running Condor jobs, which are clearly
available for use by the Grid. On SMP machines, this study follows Condor’s approach of treating
each processor as a separate resource.
The trace data was processed and analyzed to categorize resources according to the states pro-
posed in the multi-state model. Again, the goal is to identify trends that enable multi-state avail-
22
0 50 100250
300
350
400
450
Machine Availability vs. Time
Days
Num
ber
of A
vaila
ble
Mac
hine
s
0 50 1000
20
40
60
80
100
User Presence vs. Time
Days
Num
ber
of M
achi
nes
With
Use
rs P
rese
nt0 50 100
20
40
60
80
100
120
140
CPU Threshold Exceeded vs. Time
Days
Num
ber
of M
achi
nes
with
CP
UT
hres
hold
Exc
eede
d
0 50 100
20
40
60
80
100
Down vs. Time
DaysN
umbe
r of
Dow
n M
achi
nes
Figure 3.2: Machine State over Time
ability prediction and scheduling.
3.2.2 Condor Pool Characteristics
First, the pool of machines is examined as a whole. This enables conclusions about the machines
aggregate behavior.
Pool State over Time
Figure 3.2 depicts the number of machines in each availability state over time. Gaps in the data
indicate brief intervals between the four months’ data. The data shows a diurnal pattern; availability
peaks at night, and recesses during the day, but rarely below 300 available machines. This indicates
workday usage patterns and a policy of leaving machines turned on overnight, exhibited by the
diurnal User Present pattern. The number of machines that occupy the CPU Threshold Exceeded
state is less reflective of daily patterns. Local load appears to be less predictable than user presence,
and the number of Down machines also does not exhibit obvious patterns.
Figure 3.3 depicts the overall percentage of time spent by the collective pool of resources in
each availability state. The Available state encompasses 62.9% of the total time, followed by Down
(29.7%), User Present (3.6%) and CPU Threshold Exceeded (3.7%).
23
Available
Down
UserCPUPercentage of time spent in each state (total)
Down
User
CPU
Percentage of non−idle time spent in each state (total)
Figure 3.3: Total Time Spent in Each Availability State
0 5 10 15 20 250
1000
2000
3000
4000
5000Number of Transitions to Available vs. Hour of Day
Hour of Day
Num
ber
of T
rans
ition
s to
Ava
ilabl
e
0 5 10 15 20 250
1000
2000
3000
4000
5000
6000Number of Transitions to User Present vs. Hour of Day
Hour of Day
Num
ber
of T
rans
ition
s to
Use
r P
rese
nt
0 5 10 15 20 25200
400
600
800
1000
1200
1400
Number of Transitions to CPU ThresholdExceeded vs. Hour of Day
Hour of Day
Num
ber
of T
rans
ition
s to
CP
UT
hres
hold
Exc
eede
d
0 5 10 15 20 250
50
100
150
200
250
300
350Number of Transitions to Down vs. Hour of Day
Hour of Day
Num
ber
of T
rans
ition
s to
Dow
n
Figure 3.4: Number of Transitions versus Hour of the Day
Daily and Hourly Transitions and Durations
This section examines how often resources transition between states, and how long they reside in
those states. The data is reported as a function of both the day of the week and the hour of the
day. State transitions can affect the execution of an application because they can trigger application
suspension, lead to checkpointing and migration, or even require restart. Figure 3.4 shows the
number of transitions to each state by hour of day.
Again, the numbers of transitions to both the Available and User Present states show clear
diurnal patterns. Users return to their machines most often at around 3pm, and most seldom at 5am.
Machines also frequently become available at Midnight and 10AM. Transitions to CPU Threshold
24
M T W T F S S4000
6000
8000
10000
12000Total Number of Transitions to Available vs. Day of Week
Day of Week
Num
ber
of T
rans
ition
s to
Ava
ilabl
e
M T W T F S S2000
4000
6000
8000
10000
12000Total Number of Transitions to User Present vs. Day of Week
Day of Week
Num
ber
of T
rans
ition
s to
Use
r P
rese
nt
M T W T F S S2000
2200
2400
2600
2800
3000
3200
3400
Total Number of Transitions to CPUThreshold Exceeded vs. Day of Week
Day of Week
Num
ber
of T
rans
ition
s to
CP
UT
hres
hold
Exc
eede
d
M T W T F S S0
100
200
300
400
500
600
Total Number of Transitions to Downvs. Day of Week
Day of Week
Num
ber
of T
rans
ition
s to
Dow
nFigure 3.5: Number of Transitions versus Day of the Week
Exceeded and Down states also seem to exhibit similar (but slightly less regular) patterns.
Figure 3.5 reports the daily pattern of machine transitions among states. Transitions to Available,
User Present and CPU Threshold Exceeded fall from a highs at the beginning of the week to a low
on the weekend. Down transitions show less regular patterns.
Figure 3.6 investigates state duration and its patterns. It plots the average duration that re-
sources remain in each state, beginning at each hour of the day. Expectedly, the average duration
of availability is at its lowest in the middle of the day, since this is when users are present. In-
terestingly, CPU Threshold Exceeded and Down durations both reach their minimum at mid-day,
meaning that whereas machines are not typically completely free, they also are not running CPU
intensive non-Condor jobs.
Down duration is expectedly higher at night, when users abandon their machines until morning.
The duration of the CPU Threshold Exceeded state is at its lowest during the day, most likely due
to short bursts of daytime activity and longer, non-Condor jobs being run overnight.
Finally, Figure 3.7 examines the weekly behavior of the duration that resources reside in each
state. The average length of time spent in the Available state increases throughout the week with
the exception of Thursday while User Present duration decreases during the week. CPU Threshold
Exceeded duration seems to be less consistent. The Down state duration exhibits fairly consistent
behavior with the exception of Thursday.
25
0 5 10 15 20 2511
11.5
12
12.5Average Availability Duration vs. Hour of Day
Hour of Day
Avg
. Ava
ilabl
e D
urat
ion
(Hou
rs)
0 5 10 15 20 250
0.05
0.1
0.15
0.2Average User Present Duration vs. Hour of Day
Hour of Day
Avg
. Use
r P
rese
nt D
urat
ion
(Hou
rs)
0 5 10 15 20 250.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Average CPU Threshold Exceeded Durationvs. Hour of Day
Hour of Day
Avg
. CP
U T
hres
hold
Exc
eede
dD
urat
ion
(Hou
rs)
0 5 10 15 20 256.92
6.94
6.96
6.98
7
7.02Average Down Duration vs. Hour of Day
Hour of DayA
vg. D
own
Dur
atio
n (H
ours
)
Figure 3.6: State Duration versus Hour of the Day
M T W T F S S10
10.5
11
11.5
12Average Availability Duration vs. Day of the Week
Day of Week
Avg
. Ava
ilabl
e D
urat
ion
(Hou
rs)
M T W T F S S0.2
0.4
0.6
0.8
1
1.2Average User Present Duration vs. Day of the Week
Day of Week
Avg
. Use
r P
rese
nt D
urat
ion
(Hou
rs)
M T W T F S S0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
Average CPU Threshold Exceeded Durationvs. Day of the Week
Day of Week
Avg
. CP
U T
hres
hold
Exc
eede
dD
urat
ion
(Hou
rs)
M T W T F S S
0.7
0.8
0.9
1
1.1
1.2
1.3Average Down Duration vs. Day of the Week
Day of Week
Avg
. Dow
n D
urat
ion
(Hou
rs)
Figure 3.7: State Duration versus Day of the Week
26
0 500 1000 1500 20000
100
200
300
400
500
600Machine Count vs. Average Availability Duration
Average Availability Duration (Hours)
Mac
hine
Cou
nt
0 10 20 30 40 50 600
100
200
300
400
500
600Machine Count vs. Average User Present Duration
Average User Present Duration (Hours)
Mac
hine
Cou
nt
0 50 100 150 2000
100
200
300
400
500
600
Machine Count vs. Average CPU ThresholdExceeded Duration
Average CPU Threshold Exceeded Duration (Hours)
Mac
hine
Cou
nt
0 1000 2000 3000 40000
100
200
300
400
500
600Machine Count vs. Average Down Duration
Average Down Duration (Hours)
Mac
hine
Cou
nt
Figure 3.8: Machine Count versus State Duration
Individual Machine Characteristics
This section examines the characteristics of individual resources in the pool. Figure 3.8 and Fig-
ure 3.9 plot cumulative machine counts such that an (x, y) point in the graph indicates that y
machines exhibit x or fewer (i) average hours per day in a state (Figure 3.8) or (ii) number of tran-
sitions per day to a state (Figure 3.9). The first aspect explored is how long, on average, each node
remains in each state.
Figure 3.8 shows the distribution according to the average time spent in each of the four states.
For Availability, the data suggests three distinct categories: 144 resources average 9.4 hours or less,
328 average between 7.5 and 195 hours, and the top 31 resources average more than 195 hours.
For User Present, 173 resources average less than 30 minutes of user presence; of those, 14 have
no user activity. The remaining 330 resources have longer (over 30 minutes) average durations of
user presence. For CPU Threshold Exceeded, 49 have no local load occurrences, 40 resources have
short bursts of 60 minutes or less, 353 resources have longer average usage durations (but less than 6
hours), and the remaining 61 resources remain in their high usage state for over 6 hours on average.
Average time in the Down state shows that 288 resources have durations less than 25 hours; 96
never transition to the Down state. The top 38 resources have Down durations exceeding 1,700
hours (mostly offline).
27
0 2 4 6 80
100
200
300
400
500Machine Count vs. Avg. Number of Transitions to Available
Avg. Number of Transitions to Available per Machine
Mac
hine
Cou
nt
0 2 4 60
100
200
300
400
500Machine Count vs. Avg. Number of Transitions to User Present
Avg. Number of Transitions to User Present per Machine
Mac
hine
Cou
nt
0 1 2 30
100
200
300
400
500
Machine Count vs. Avg. Number of Transitions toCPU Threshold Exceeded
Avg. Number of Transitions to CPU ThresholdExceeded per Machine
Mac
hine
Cou
nt
0 0.05 0.1 0.15 0.2 0.250
100
200
300
400
500
Machine Count vs. Avg. Number ofTransitions to Down
Avg. Number of Transitions to Down per Machine
Mac
hine
Cou
nt
Figure 3.9: Machine Count versus Average Number of Transitions Per Day
Figure 3.9 examines the average number of resource transitions to each state per resource per
day. 78.7% of resources transition to Available fewer than twice per day. The top 2% transition
over 4.5 times on average. As reported earlier, 14 machines (2.7%) have no user activity. On the
other hand, 399 machines (79.3%) infrequently experience local users becoming present (less than 2
transitions to User Present per day). Users frequently come and go on the remaining 104 resources
(20.6%), with some of these experiencing up to 7 transitions to User Present on average per day.
CPU Threshold Exceeded transitions are bipolar with 468 (93%) machines having 1 or fewer tran-
sitions to CPU Threshold Exceeded per day (49 of those 400 have none). The remaining machines
average more than 1 transition and peak at 3.7 transitions each day. Finally, Down is relatively
uncommon with 96 resources (19%) never entering the Down state, 341 (67.7%) transitioning to
Down less than once on average, and only the top 5 machines (1%) having two transitions or more
on average.
Machine Classification
This section classifies resources by considering multiple state characteristics simultaneously. This
approach organizes resources based on which type of applications they would be most useful in
executing. In particular, the average availability duration dictates whether the application is likely to
complete; how a machine transitions to unavailable (gracefully or ungracefully) determines whether
28
Average Num of Graceful TransitionsHigh Low
Average Availability DurationUngraceful Transitions
High Low High Low
High 0 0 44 90Medium High 0 0 30 104Medium Low 22 18 47 47
Low 98 32 3 2
Table 3.1: Machine Classification Information
the job can checkpoint before migrating.
Machines are classified based on average availability duration and the number of graceful and
ungraceful transitions to unavailable per day. Table 3.1 reports the number of machines that fit these
categories. The thresholds for the Availability categories are 65.5 (High), 39.7 (Medium High), and
6.5 (Medium Low) hours. More than 1.6 graceful transitions to unavailable, per day, are considered
to be High, and two or more ungraceful transitions during the four months is considered High.
Roughly 36% of machines have Medium High or High availability durations with both a low
number of graceful and ungraceful transitions, making them highly available and unlikely to expe-
rience any kind of job eviction. In contrast, about 24% of resources exhibit low availability and a
high number of graceful transitions; about 75% of these have high ungraceful transitions. The low
availability group would be most appropriate for shorter running checkpointable applications. The
last significant category (approximately 8.7%) has Medium Low availability and few graceful and
ungraceful transitions. This group lends itself to shorter running applications with or without the
ability to checkpoint.
3.2.3 Classification Implications
This section demonstrates the implications of Table 3.1’s resource classification distribution. To test
how applications would succeed or fail (a failed application is defined as an application that expe-
riences resource unavailability before it completes execution) on resources in the different classes,
a series of 100 jobs is simulated with the job runtimes being equally distributed between approx-
imately 5 minutes to 12 hours (on a machine running at 2500 MIPS). The simulation used each
resource’s local load at each point in the trace, along with the MIPS score that Condor calculates.
Each job was begun at a random available point in each resource’s trace and then the execution was
simulated. This was done 1000 times for each job duration on each resource.
Results are included for the resources classified as (i) low availability with high graceful and
29
2 4 6 8 10
x 107
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Graceful Failure Rate vs. Job Duration
Job Duration (Instructions)
Fai
lure
Rat
e (%
)
2 4 6 8 10
x 107
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Ungraceful Failure Rate vs. Job Duration
Job Duration (Instructions)
Fai
lure
Rat
e (%
)
HA−LG−LUGMHA−LG−LUGMLA−LG−HUGLA−HG−HUG
Figure 3.10: Failure Rate vs. Job Duration
ungraceful transitions (18% of the machines), (ii) medium-low availability with low graceful transi-
tions, but high ungraceful transitions (8.7%), (iii) medium-high availability with few graceful and
ungraceful transitions (19.3%), and (iv) high availability, also with few graceful and ungraceful tran-
sitions (16.7%). These classes represent a broad spectrum of the classification scheme and 62.7%
of the total resources. They appear in bold in Table 3.1, and are labeled (i) LA-HG-HUG, (ii)
MLA-LG-HUG, (iii) MHA-LG-LUG, and (iv) HA-LG-LUG in Figure 3.10.
Figure 3.10 shows that graceful failure rates climb more rapidly for less available resources. The
low availability resources also have a much higher ungraceful failure rate versus the high and medium
high availability resources, which have a very low ungraceful failure rate.
These results are important because of the diversity of Grid applications, as described in Sec-
tion 3.1.3. Checkpointable applications can tolerate graceful evictions by checkpointing on demand
and resuming elsewhere, without duplicating work. However, these same applications are more sen-
sitive to ungraceful evictions; the amount of lost work due to an ungraceful transition to the Down
state depends on the checkpoint period. On the other hand, applications that do not checkpoint are
equally vulnerable to graceful or ungraceful transitions; both require a restart. Therefore, resources
in different classes should host different types of applications.
Intuitively, to make the best matches under high demand for resources, a scheduler should match
the job duration with the expected availability duration of the machine; machines that are typically
available for longer durations should be reserved, in general, for longer running jobs. And non-
checkpointable jobs should run on resources that are not likely to transition to Down ungracefully.
Finally, machines with high rates of ungraceful transition to Down should perhaps be reserved for
highly replicable jobs.
30
Random Scheduler Classifier Scheduler ImprovementCompleted Jobs 1883.2 1936.0 2.27%
Gracefully Evicted Jobs 1244.0 492.0 60.45%Ungracefully Evicted Jobs 28.6 14.3 50.0%
Avg. Execution time (secs.) 71209.0 48656.0 31.6%
Table 3.2: Machine Classification Scheduling Results
A second simulation demonstrates the importance of the classes. 1000 jobs were created of ran-
dom length (equally distributed between 5 minutes and 12 hours), half of which were checkpointable
with a period of six hours. Two schedulers are tested — a scheduler that maps jobs to resources
randomly, and one that prioritized classes according to Table 3.1. The simulator iterates through
the trace, injecting an identical random job set and scheduling it with each scheduler. Job execution
considers each resource’s dynamic local load and MIPS score. When a resource that is executing a
job becomes unavailable in the trace, the job is rescheduled. If the job is checkpointable and the
transition is graceful, the job takes a fresh checkpoint and migrates. An ungraceful transition to
unavailability requires restarting from the last successful periodic checkpoint. Non-checkpointable
jobs must always restart from their beginning.
The classifier used months one and two to classify the machines; schedulers ran in months three
and four. Each subsequent month was simulated three times and the average values were taken
for that month. Table 3.2 reports the sum of those averages across both months. Jobs meet more
evictions and therefore take an average of 31.6% longer when the scheduler ignores the unavailability
characteristics.
31
Chapter 4
Multi-State Availability Prediction
Predicting the future availability of a resources can be extremely useful for many purposes. First,
resource volatility can have a negative impact on applications executing on those resources. If the
resource on which an application is executing becomes unavailable, the application will have to
restart from the beginning of its execution on another resource. This wastes valuable cycles and
increases application makespan. Prediction can allow the scheduler to choose resources that are
least likely to become unavailable and avoid these application restarts. Second, replication can
also be used to mitigate task failure by creating a copy of an application. Since the system can
only support a finite number of replicas, it is useful to determine which applications are likely
to experience resource unavailability via prediction and replicate those jobs that are most likely to
experience resource unavailability. This can increase the efficiency of the system. Third, when storing
data on distributed systems, resource unavailability can be predicted to ensure that data remains
accessible even in the presence of resource unavailability. This work studies using predictions for
both scheduling and replication in later chapters. This chapter first proposes prediction strategies
in Section 4.1, then analyzes their accuracy in various configurations in Section 4.2.1 and finally
compares these predictors accuracy with related work predictors in Section 4.2.2.
4.1 Prediction Methodology
A multi-state prediction algorithm takes as input a length of time (e.g. estimated application execu-
tion time), and uses a resource’s availability history to predict the probabilities of that resource next
exiting the Available state into each of the non-available states, and to remain available through-
out the interval; these probabilities sum to 100%. The proposed availability predictor outputs four
probabilities, one each for entering Figure 3.1’s User Present, CPU Threshold Exceeded and Down
states next, and one for the probability of completing the interval without leaving the Available
32
state (Completion). It is often useful to think of highest computed probability for a given prediction
as its surety because the higher the largest transition probability the more sure the predictor is of
its forecast.
Since a prediction algorithm examines a resource’s historical availability data, the two important
aspects that differentiate predictors are (i) when during the resource’s history the predictor analyzes
and (ii) how the predictor analyzes that section of availability history. The approaches presented here
are classified along these two dimensions; by combining when and how historical data is analyzed, a
range of availability prediction strategies is created.
In determining when the predictor should analyze, this study takes two different approaches.
Previous work [54] advocates examining a resource’s behavior on previous days during the same
interval. Therefore, the first approach investigated examines a resource’s past N days of availability
behavior during the interval being predicted. This approach is called N-Day or Day. Another
approach examines a resource’s most recent N hours of activity immediately preceding the prediction
(N-Recent).
The proposed predictors analyze the segments of resource availability history in two ways. The
first makes predictions based on the resource’s state durational behavior by calculating the percentage
of time spent in each state during the analysis period. The rationale is that the state that is occupied
the majority of the time could be the most likely next state. The second and more successful approach
considers a resource’s transitional behavior by counting the number of transitions from Available
to each of the other states. Also, for every interval of time a resource was available, the predictor
counts how many times the resource could have completed the requested job duration (e.g. a ten
hour availability interval and a two hour job means five completions). The probabilities for each
exit state (as well as the completion state) are calculated by summing each state’s transition count
and dividing the total number of transitions for each state by the total number of transitions. For
both the durational and transitional approaches, when combining different sections of availability
(e.g. when combining the N days’ probabilities together), the probabilities for each state are then
averaged.
The Transitional schemes weigh all transitions equally (Equal weighting scheme); weighing transi-
tions according to when the transition occurred is also investigated. The predictors weigh transitions
according to the time of day (Time) in response to previous observations that this correlates with
future behavior (Chapter 3) [59]. This means that for the Time weighting scheme, the closer a
transition time is to the time of day at which the prediction occurs, the higher the weight of that
transition. In other words, if a prediction is made at 3pm, then transitions that occur closest to
33
3pm today and on other days have the largest weights. Lastly, Freshness weighting gives a higher
weight to transitions that occur closer to the time of the prediction and likewise, a smaller weight
to events that occurred further in the past.
Section 2.2 evaluates related work predictors for comparison. Since linear regression [18] is such a
prevalent method in many disciplines of prediction, this study includes both Sliding Window Single
and Sliding Window Multi. Because states are categorical, they must be converted to numerical
values. The following conventions are used. In the single state version of sliding window (SW
Single), the Available state is assigned to 1 and any other state as 0. In the multi-state version (SW
Multi), the Available state is assigned to 1, User Present as 2, CPU Threshold Exceeded as 3 and
Down as 5. Counter based predictors such as those used for resource availability prediction [45] are
evaluated, including Saturating Counter and History Counter. Just as in [45], a 2 bit saturating
counter is updated every hour. The Completion predictor, which always predicts the machine will
complete the interval in the Available state is used as a baseline comparison. Finally, Ren’s N-Day
multi-state availability predictor (called simply Ren) is implemented for comparison (see Section 2.2
for a full description of the Ren predictor) [55].
4.2 Predictor Accuracy Comparison
This section evaluates the predictors for their accuracy and focuses its investigation on predictors
that analyze a resource’s history with the transitional method analyzing both the N-Day and the
N-Recent hours approaches for determining when during a resource’s history to examine. Results
for the durational method are excluded due to its inferior performance. This study uses six months
of availability data from the 603 node Condor pool at the University of Notre Dame as its data set.
This includes two additional months of data not reported in Chapter 3. This trace was taken during
the months of January through June of 2007 and is analyzed in Section 3.2. Again, measurements
for CPU load, resource availability and user presence were taken every 15 minutes for each resource,
and the states of availability are inferred according to the proposed multi-state model.
Each predictor performed an identical set of 500,000 predictions on a random resources at a
random time during the six month trace. Predictor accuracy is defined as the ratio of correct
predictions to the total number of predictions. A correct prediction is one for which the machine is
predicted to exit on a certain non-available state and it does, or for which the machine is predicted
to remain available throughout the interval, and it does.
34
0 2 4 6 8 10 12 14 160.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9Prediction Accuracy vs. Number of Days Analyzed
Number of Days
Pre
dict
ion
Acc
urac
y (%
)
TDE: 12hrTDE: 25hrTDE: 60hrTDE: 120hrRen: 12hrRen: 25hrRen: 60hrRen: 120hr
Figure 4.1: Transitional Predictor Accuracy
4.2.1 Prediction Method Analysis
Figure 4.1 depicts the accuracy of the Transitional N-Day with Equal transition weights (TDE)
predictor and the Ren N-Day predictor in relation to the number of days each analyze for predictions
with lengths uniformly distributed between five minutes and the number of hours indicated in the
legend. Recall that TDE examines the resource’s state transition behavior (Transitional) during
the requested prediction interval starting at the same time during each of the previous N days
(N-Day). It then calculates the state exit probabilities based on the number of transitions from
Available to each of the states separately for each day, combining the resulting probabilities for
each day. For predictions of lengths between five minutes and 12, 25, 60 and 120 hours, the graph
shows that TDE exhibits an increase in accuracy when analyzing any number of days in comparison
with Ren. For example, for predictions between 5 minutes and 25 hours, TDE’s accuracy increases
through 16 days, at which point it peaks at 78.3%. In this case, Ren peaks at 73.7% and then falls
abruptly when considering more than one day. Notice that Ren’s predictor decreases in accuracy
when analyzing more days whereas TDE can better incorporate new information into its prediction.
For all prediction lengths, TDE initially increases in accuracy, then levels off when acquiring more
35
0 50 100 150 200 250 3000.745
0.75
0.755
0.76
0.765
0.77
0.775
0.78Prediction Accuracy vs. Number of Hours Analyzed
Number of Hours
Pre
dict
ion
Acc
urac
y (%
)
TRETRFTRT
Figure 4.2: Weighting Technique Comparison
information (considering more days) whereas Ren becomes less accurate the more days it considers.
Overall, TDE is 4.6% more accurate than the Ren predictor.
Figure 4.2 examines the Transitional Recent hours (TR) predictor configured to analyze the
most recent N hours of availability with various transition weighting (Equal, Freshness and Time
of day) schemes for predictions between five minutes and 25 hours (other lengths are not examined
due to space constraints). Again, this predictor examines the resource’s state transition behavior
(Transitional) but in contrast to TDE, does so for the last N-hours before the time that the prediction
is made (Recent). For all weighting schemes, as the predictor considers more hours, accuracy
increases dramatically at first, then the increases slow down, reach a maximum, then slowly decrease.
The Freshness (TRF) weighting scheme provides the best performance, reaching the highest accuracy
of 77.3% when examining the past 48 hours of behavior. TRF weights each transition t (W (t))
according to the following formula:
W (t) = M · (Tt/L)
M is the “recentness weight” set by the user (the default value is five), Tt is the length of time
36
10 20 30 40 50 60 70 80 90 100 110
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Prediction Accuracy vs. Prediction Length
Prediction Duration (Hours)
Pre
dict
ion
Acc
urac
y (%
)
TRFTDERen 1−dayCompletionSaturating CounterHistory CounterSliding Window SingleSliding Window Multi
Figure 4.3: Prediction Accuracy by Duration
that elapsed from the beginning of the analysis interval to the transition t, and L is the total analysis
interval length. In the next section, analysis focuses on the TDE and TRF predictors due to their
high accuracy (78.3% and 77.3% respectively).
4.2.2 Analysis of Related Work
This section compares the accuracy of the best performing predictors, TDE and TRF, to sev-
eral existing approaches from the literature, namely the Saturating and History Counter predic-
tors [45] [44] [46], the Multi-State and Single State Sliding Window predictors [18], the Ren predic-
tor [55] [54] [56], and the Completion predictor.
Figure 4.3 depicts predictor accuracy versus prediction length, for predictions up to 120 hours.
The Counter-based predictors, Completion predictor, and Sliding Window predictors perform sim-
ilarly to one another, compete well with TRF, TDE and Ren for predictions up to 20 hours, and
then sharply decline, never leveling off. In contrast, the TRF, TDE, and Ren predictors decrease
initially and then level off; the rate of decrease for Ren is somewhat larger as the prediction length
increases past 60 hours. The Ren predictor initially has lower accuracy, with accuracy decreasing
37
even faster than the Completion predictor for prediction durations shorter than 19 hours. During
these short intervals, the Ren predictor is up to 5.6% less accurate than the TDE predictor, and
5.2% less accurate than TRF.
Figure 4.3 demonstrates that for predictions shorter than 19 hours, TDE is the most accurate,
accounting for its 1% increase in accuracy over TRF in Section 4.2 for predictions between five
minutes and 25 hours. However, TRF becomes the most accurate predictor for predictions longer
than 42 hours, reaching an accuracy increase of 9.8% over Ren and 3.1% over TDE for predictions of
120 hours. Chapter 5’s scheduling results demonstrate that this large difference in accuracy for longer
predictions is critical. Chapter 5 focuses its prediction-based scheduling analysis on TRF because of
the improved schedules it produces. Predictors that perform better for long term predictions lead
to the best scheduling results when used with the schedulers proposed in Chapter 5, even when
scheduling shorter jobs. As the results demonstrate in Section 5.4.2, this explains why using the
TRF predictor decreases makespan and the number of job evictions when compared to using the
Ren predictor, even when using an identical prediction-based scheduler.
38
Chapter 5
Prediction-Based Scheduling
Resource selection — choosing the resource on which to execute a given task — is a complex
scheduling problem. Resource selection can depend on the length of the job, it’s ability to take a
checkpoint, the load on the system, the availability of resources, the resource’s reliability, and possible
time-constraints on application completion [61][58]. Additionally, the performance metric can vary
from reduced job makespan or increased job reliability, to an increased overall system efficiency
and throughput. These factors lead to a complex scheduling environment involving many different
tradeoffs. The resource selection problem is further complicated by resource volatility. Unexpected
resource unavailability can have a large impact on scheduling performance. Schedulers that are
aware of this resource volatility can make more informed decisions on job placement, avoiding less
reliable resources.
This chapter focuses on dealing with resource volatility through informed job placement, to reduce
job evictions by avoiding resource unavailability, which in turn can decrease average job makespan.
To that end, this chapter investigates scheduling jobs with the aid of resource availability predictions
such as those introduced in Chapter 4. This chapter is organized as follows. First, Section 5.1
explains the simulator setup and scheduling methodology. Section 5.2 demonstrates the inherent
tradeoff between scheduling for reliability and scheduling for performance, Section 5.3 investigates
various prediction-based scheduling heuristics and their performance across various job lengths and
checkpointabilities. Lastly, Section 5.4 compares the prediction based scheduler’s results to various
prediction-based and non-prediction-based scheduling approaches across job lengths.
5.1 Scheduling Methodology
This work investigates various scheduling techniques through simulation-based experiments that vary
the number of jobs, their characteristics, and the schedulers’ scoring techniques. The simulations in
39
this chapter are driven by the Notre Dame Condor pool (approximately 600 nodes) six month trace.
A certain number of jobs are simulated executing on these machines, each utilizing its own recorded
availability and load measurements. The simulation based experiments presented here create and
insert each application at a random time during the six month simulation, such that application
injection is uniformly distributed across the six months. The simulation assigns each application
a duration (i.e. a number of operations needed for completion); application duration is uniformly
distributed between two durations such as between five minutes and 25 hours (the duration is an
estimate based on the execution of the application on a resource with an average MIPS speed with
no load). This uniform distribution of job insertion times and durations allows the experiments
to test the quality of the predictor and scheduler combination with equal weight for all prediction
durations and prediction start times, without emphasizing a particular duration or start time. In
all simulations in this chapter, 25% of the applications are checkpointable; that is, they can take an
on demand checkpoint in five minutes if the user returns or if the local CPU load goes above 50%,
leading to eviction and rescheduling. The remaining 75% non-checkpointable jobs need to start over
upon eviction.
Each machine’s MIPS score (as defined by Condor’s benchmarking tool), and the load and avail-
ability state information contained in the trace, influence the simulated running of the application.
During each simulator tick, the simulator calculates the number of operations completed by each
working resource, and updates the records for each executing application. If a resource executing
an application leaves the Available state (as per the trace), effectively evicting the application, the
executing application joins the back of the application queue for rescheduling. The number of op-
erations that remain to be completed for this evicted application depends on the checkpointability
of the application and the type of eviction (Section 3.1.3). All waiting applications (scheduled or
rescheduled) remain queued until they reach the head of the queue, at which point they are resched-
uled. During each simulator tick, the scheduler places applications off the head of the queue until
the next application cannot be scheduled (e.g. because no resources are available).
To facilitate resource score based scheduling (e.g. prediction-based scheduling as well as other
scheduling heuristics), a scheduling algorithm is defined called Prediction Product Score (PPS) Sched-
uler. The PPS Scheduler scores each available resource and maps the job at the head of the queue
onto the resource with the highest score (Ties are broken arbitrarily.) The resource scoring policy
(how each resource is scored) defines that scheduler’s placement policy. Various resource scoring
approaches are examined throughout the experiments presented here.
Scheduling quality is analyzed according to average job makespan and evictions. Makespan
40
is the time from submission to completion; average makespan is calculated across all jobs in the
simulation. For evictions, the simulator counts the total number of times that jobs need to be
rescheduled because a resource running a job transitions from available to one of the unavailable
states.
5.2 Reliability Performance Relationship
This section establishes the inherent tradeoff between scheduling for reliability and for performance
for a real world environment. In the context of this work, resource reliability refers to the ability
of a resource to consistently complete applications without experiencing unavailability and is used
informally. It is not used formally as in other publications [51]. Schedulers that favor performance
consider the static capability of target machines, along with current and predicted load conditions,
to decrease makespan. Other schedulers may instead consider the past reliability and behavior of
machines to predict future availability and to increase the number of jobs that complete successfully
without interruption due to machines going down or owners reclaiming resources. Schedulers may
consider both factors, but cannot in general optimize simultaneously for both performance-based
metrics like makespan, and reliability-based metrics like the number of evictions or the number
of operations that must be re-executed due to eviction (i.e. “operations lost”) [19]. In fact, the
interplay between these two metrics is often quite complex. Choosing a reliable resource can decrease
job evictions and by virtue of that, decrease job makespan (increase performance). Along the same
lines, choosing a fast resource can lead to a shorter job execution time, reducing the likelihood of
experiencing a job eviction (increase reliability). In the real world, there is a large variety in the
reliability and performance of resources and often a scheduler cannot find both a high performance
and high reliability resource. In these cases, a scheduler may have to choose between a slow reliable
resource or a fast unreliable resource or anywhere in between. Section 5.3 explores heuristics for
dealing with this scheduling problem. This section demonstrates the tradeoff itself.
To investigate the reliability-performance tradeoff, 6000 simulated applications are executed on
these machines according to the methodology outlined in Section 5.1. Application durations are
uniformly distributed between five minutes and 25 hours. The PPS chooses from among resources
for application execution (Section 5.1) and scores each available resource according to the following
expression (the resource with the highest score is chosen for application execution):
RSi = (1 − W ) · Pi[COMPLETE] + W · (MIPSi/MIPSmax) · (1 − Li)
41
0 0.25 0.5 0.75 110
15
20
25
30
35
Average Makespan vs. Tradeoff Weight
Tradeoff Weight
Mak
espa
n (H
ours
)
0 0.25 0.5 0.75 1500
1000
1500
2000
2500
Total Job Evictions vs. Tradeoff Weight
Tradeoff Weight
Evi
ctio
ns
Figure 5.1: As schedulers consider performance more than reliability (by increasing Tradeoff WeightW), makespan decreases, but the number of evictions increases.
• Pi[COMPLETE] is resource i’s predicted probability of completing the job interval without
failure, according to the TRF predictor (Section 4.2),
• MIPSi is the resource’s processor speed,
• MIPSmax is the highest processor speed of all resources (for normalization), and
• Li is the resource’s current processor load.
Figure 5.1 depicts the effect of varying the Tradeoff Weight and hence the relative influence of
reliability or performance in scheduling. As performance is considered more prominently, makespan
decreases but the number of evictions increases. In the middle of the plot, a tradeoff weight of 0.5 does
achieve makespan within 6.7% of the lowest makespan on the curve, while simultaneously coming
within 18.1% of the fewest number of evictions. Nevertheless, the makespan slope is uniformly
negative, and the evictions slope is uniformly positive.
5.3 Prediction Based Scheduling Techniques
This section explores in more detail a wider range of resource ranking strategies that consider resource
performance and reliability. Section 5.3.1 studies the effect of considering various combinations of
CPU speed, load, and reliability; the section also introduces the idea of scheduling checkpointable
jobs differently from non-checkpointable jobs. The experiments then describe how the best per-
forming scheduler from Section 5.3.1 behaves as the length of the interval for which the predictor
forecasts is varied (Section 5.3.2).
42
5.3.1 Scoring Technique Performance Comparison
In this Section, the PPS scheduler is configured with a range of resource scoring approaches, which
utilize some or all of the following factors.
• CPUSpeed(MIPS) : the resource’s Condor MIPS rating
• CurrentLoad(L) : the machine utilization information at scheduling time
• CompletionProbability(P [COMPLETE]) : The predicted probability of completing the pro-
jected job execution time without becoming unavailable. This is one minus the sum of the
probabilities of the three unavailability states
• UngracefulEvictionProbability(Pi[UNGRACE]) : The predicted probability of exiting job
execution directly to Down (with no chance for a checkpoint).
• Checkpointability : Whether the job could take an on-demand checkpoint before being evicted
from a machine (CKPT) or not (NON CKPT)
Considering whether or not the scoring system incorporates each of the first three criteria defines
eight different kinds of scoring, ranging from considering none of the criteria (Random scheduling),
to including them all. Then, if checkpointable jobs are scheduled differently from non checkpointable
jobs, many more possible approaches emerge. There is also the matter of how to incorporate each
factor into the score. Based on intuition (some combinations make more sense than others) and
background experimental results, the PPS scheduler (Section 5.1) is configured with the following
resource scoring approaches. The expression provides the formula for computing each resource’s
score, RSi
• S0 : MIPSi · (1 − Li)
• S1 : Pi[COMPLETE]
• S2 : MIPSi · Pi[COMPLETE]
• S3 : MIPSi · (1 − Li)Pi[COMPLETE]
• S7 : CKPTMIPSi·(1−Li)(1−Pi[UNGRACE]) NON−CKPTMIPSi·(1−Li)Pi[COMPLETE]
• S1 : CKPTMIPSi · (1 − Li) NON − CKPTMIPSi · (1 − Li) · Pi[COMPLETE]
43
20 40 60 80 100 1200
0.5
1
1.5
2
Makespan vs. Job Length
Job Length (Hours)
Mak
espa
n
20 40 60 80 100 120−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0Evictions vs. Job Length
Job Length (Hours)
Evi
ctio
ns
S0S1S3S7S9
Figure 5.2: The resource scoring decision’s effectiveness depends on the length of jobs. For longerjobs, schedulers should emphasize speed over predicted reliability.
S0, S1, S2, and S3 schedule checkpointable jobs the same way they schedule non-checkpointable
jobs with S0 considering speed, S1 considering reliability, and S2 and S3 considering both. However,
the fact that checkpointable jobs react differently to various types of resource unavailability suggests
that scheduling them differently could improve overall Grid performance. S7 and S9 are considered
for this reason.
This section executes simulations for the same environment and setup as in Section 5.1, but
varies the maximum job length from 12 to 120 hours. Pi[COMPLETE] (an availability prediction)
is made by TRF (Section 4.2) and used as input to the resource scoring decisions that utilize it. The
results obtained and presented in the remainder of the paper are based on simulating each scenario
once. However, the consistency and repeatability of the results has been verified through repetition.
Figure 5.2 plots average makespan vs. maximum job length, and evictions vs. maximum job
length for all of the proposed resource scoring decisions as percentage difference versus S0. Percent-
ages are used for comparison throughout this chapter. The percentage difference of one scheduler
versus a chosen baseline scheduler is plotted for each particular metric (e.g. operations lost or evic-
tions). Percentages are used to emphasize the differences in scheduling quality and to facilitate visual
comparison of performance, given the large range of y values. The text includes the magnitude of
the results (e.g. the actual number of evictions) in several places.
S1, which considers only reliability, performs significantly worse than all the others in terms
of makespan for all maximum job lengths, but also for number of evictions for longer jobs. This
demonstrates that schedulers must consider resource performance when scheduling. The figure also
shows that makespans are relatively similar for the other scheduling approaches, but that S3 does
the best job of avoiding evictions. For example, for jobs up to 84 hours, S3 obtains 5417 evictions
44
0 50 100−0.2
0
0.2
0.4
0.6
0.8Makespan vs. Checkpointability
Checkpointability (%)
Mak
espa
n
S0S1S3S7S9
0 50 100−0.6
−0.4
−0.2
0
0.2Evictions vs. Checkpointability
Checkpointability (%)
Evi
ctio
ns0 50 100
−0.5
−0.4
−0.3
−0.2
−0.1
0Operations Lost vs. Checkpointability
Checkpointability (%)
Ope
ratio
ns L
ost
0 50 100−0.5
−0.4
−0.3
−0.2
−0.1
0Percent of Total Operations Lost vs. Checkpointability
Checkpointability (%)P
erce
nt o
f Tot
al O
pera
tions
Figure 5.3: Scoring decision effectiveness depends on job checkpointability. Schedulers that treatcheckpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer more evictions, butmostly for checkpointable jobs; therefore, makespan remains low.
whereas S1, S7 and S9 obtain approximately 6450 evictions (19% more). For a shorter job length of
6 hours, S3 causes 860 evictions and S7 and S9 cause about 1070 evictions increasing the disparity
to 24%. These results demonstrate that schedulers must consider both reliability and performance
while scheduling, to simultaneously produce fewer evictions and shorter job makespan.
Figure 5.1’s results are for 25% checkpointable jobs. When that percentage varies, the value of
considering reliability in the scoring function changes, as does the value of scoring checkpointable
and non-checkpointable jobs differently.
Figure 5.3 plots makespan, evictions, operations lost, and percentage of total operations lost,
as the percentage difference compared with S0. As the percentage of checkpointable jobs increases,
makespan difference increases more quickly for S1 and S3, compared to schedulers S7 and S9, which
schedule non-checkpointable jobs on more reliable resources. Figure 5.3 shows that the difference in
the number of evictions does increase more for S7 and S9; these evictions, however, are increasingly
for checkpointable jobs, so the effect on makespan is small. When a small percentage of jobs are
checkpointable, schedulers do well to consider reliability (like S3). The hybrid approaches achieve
balance between reliability and performance, at each extreme.
The remainder of this chapter considers only the case where 25% jobs are checkpointable, and
therefore primarily uses S3, which performs similarly to S7 and S9 for this checkpointability mix.
45
0 50 100−0.1
0
0.1
0.2
0.3
0.4Makespan vs. Job Length
Job Length (Hours)
Mak
espa
n
0 50 100−0.5
0
0.5
1
1.5Evictions vs. Job Length
Job Length (Hours)
Evi
ctio
ns
0.250.5123Comp
Figure 5.4: Schedulers can improve results by considering predictions for intervals longer than theactual job lengths. This is especially true for job length, between approximately 48 and 72 hours.
5.3.2 Prediction Duration
In tests described so far, the scheduler asks the predictor about behavior over a prediction interval
that is intended to match application runtime. A prediction for the next N hours may not necessarily
reflect the best information for scheduling an N hour job. This section investigates the effect of
scheduling an N hour job using predictions made for the next (M x N) hours, where M is the
“interval multiplier.” M is set at 0.25, 0.5, 1, 2, 3, with 0.25 plotted as a baseline, along with
two “hybrid” multipliers. The “Comparison” (Comp or TRF-Comp) scheduler uses M=3 for jobs
less than 28 hours, and M=0.25 for longer jobs. The “Performance” (Perf or TRF-Perf) scheduler
ignores reliability on jobs less than 28 hours (instead selecting the fastest, least loaded resources);
for longer jobs, it uses M=0.25.
Figure 5.4 shows that larger multipliers perform better for jobs up to 28 hours in duration. How-
ever for longer jobs, making predictions for intervals longer than the application runtime increases
the number of evictions and the average job makespan. In fact, the longer the job, the smaller
the optimal multiplier. This observation motivated the design of the hybrid scheduler, which does
well in keeping evictions low, through 80-hour jobs, and makespan low, especially for longer jobs.
For example, for short jobs of length 6 hours, M=0.25 produces 1173 evictions, M=1 produces 878
evictions and M=3 produces 754 evictions. For 84 hour jobs, M=0.25 has 80,118 evictions, M=1
has 84,662 and M=3 has 87,433. These figures support the conclusion that the longer the job, the
shorter the optimal prediction length multiplier should be.
46
5 10 15
13
13.5
14
14.5
15
15.5
16
Average Job Makespan vs. Ren Days Analyzed
Number of Days
Ave
rage
Job
Mak
espa
n (H
ours
)
5 10 15
1600
1800
2000
2200
2400
Evictions vs. Ren Days Analyzed
Number of Days
Evi
ctio
ns
Ren MTTF
1 2 3 4 5 6 7
12.5
13
13.5
14
14.5
15
Average Job Makespan vs. Prediction Length Weight
Prediction Length Weight
Ave
rage
Job
Mak
espa
n (H
ours
)
1 2 3 4 5 6 71100
1200
1300
1400
1500
1600
1700
Evictions vs. Prediction Length Weight
Prediction Length Weight
Evi
ctio
ns
TRF
Figure 5.5: TRF and Ren MTTF show similar trends for both makespan and evictions. For anyparticular makespan, TRF causes fewer evictions, and for any number of evictions, TRF achievesimproved makespan.
5.4 Multi-State Prediction Based Scheduling
This section compares the proposed predictor and scheduler combination (the PPS scheduler utiliz-
ing the TRF predictor, PPS-TRF) with Ren’s multi-state prediction based scheduler (described in
Section 2.3) [56], in terms of the ability to tradeoff reliability and performance. To isolate the effect
of the predictor from that of the scheduler, PPS-TRF is compared with Ren’s predictor using the
PPS scheduler.
5.4.1 Multi-State Prediction Based Scheduling
This section explores the Transitional Recent-hours Freshness-weighted (TRF) predictor using the
S3 prediction-based scheduler (Section 5.3.1) and Ren’s MTTF scheduler (see Section 2.3 for a
full description of Ren MTTF). The number of days that Ren’s predictor uses is varied, while
simultaneously varying TRF’s interval multiplier (Section 5.3.2) for comparison. Both of these
parameters allow each scheduler to trade off reliability and performance.
Figure 5.5 shows the average makespan and number of evictions obtained by TRF and Ren
MTTF as the interval multiplier varies for TRF and the number of days analyzed varies for Ren
47
2 4 6
0
0.2
0.4
0.6
0.8
Average Job Makespan vs. Prediction Length Weight
Prediction Length Weight
Ave
rage
Job
Mak
espa
n
2 4 60
0.5
1
1.5
Evictions vs.Prediction Length Weight
Prediction Length Weight
Evi
ctio
ns
Ren 1
Ren 2
Ren 4
Ren 8
Ren 12
Ren 16
TRF
Figure 5.6: TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all PredictionLength Weights. TRF also produces a shorter average job makespan than all Ren predictors, exceptthe one day predictor (which produces 45% more evictions).
MTTF. Varying each parameter allows that scheduler to trade off performance (lower makespan)
for reliability (fewer evictions). In selecting a point for Ren MTTF on one curve, however, TRF
does better in terms of the other metric. For example, for approximately 1500 evictions, TRF has
average makespan that is 27% lower. And for makespan of 13 hours, TRF has 52% fewer evictions.
5.4.2 Predictor Quality vs. Scheduling Result
To better understand how predictors affect scheduling, PPS is tested with both the TRF predictor
and with Ren N-Day [55][54] for a variety of interval multipliers (Prediction Length Weights). 6000
jobs are simulated ranging from 5 minutes to 25 hours in duration, 25% of which are checkpointable.
The Ren predictor’s N value is varied, as is the prediction length weights (i.e. multipliers, as outlined
in Section 5.3.2). The weights are varied between 1 and 7, and follow the simulation setup explained
in Section 5.1. The PPS scheduler with the S3 resource scoring approach is used for resource selection
(Section 5.3.1).
Figure 5.6 illustrates the percentage difference in makespan, vs. TRF. For all multiplication
factors, TRF produces the fewest job evictions. Both predictors produce the fewest evictions with
the 7x multiplier; Ren 2-Day produces 1,340 evictions and TRF produces 1,103 (21% fewer). For the
1x multiplier, Ren 12-Day (1,534 evictions) comes closest to TRF’s 1,420 evictions (within 8%). For
makespan, Ren 1-Day beats TRF by at most 18% lower makespan. However, Ren 1-Day sacrifices
reliability and produces about 45% more evictions. Thus, TRF reduces the number of evicted jobs
and can obtain the highest job reliability. Ren 1-Day coupled with the PPS scheduler can obtain a
48
0 50 100−0.2
−0.1
0
0.1
0.2
0.3Makespan vs. Job Length
Job Length (Hours)
Mak
espa
n
0 50 1000
0.5
1
1.5Evictions vs. Job Length
Job Length (Hours)
Evi
ctio
ns
TRF−Comp.
TRF−Perf.
Ren MTTF−1
Ren MTTF−4
Ren MTTF−8
Ren MTTF−16
History Counter
Sliding Window
0 50 100−2
0
2
4
6Makespan vs. Job Length
Job Length (Hours)
Mak
espa
n
0 50 100−2
0
2
4
6Evictions vs. Job Length
Job Length (Hours)E
vict
ions
TRF−Comp.TRF−Perf.RandomCPU−SpeedPseudo−optimalS0
Figure 5.7: TRF-Comp produces fewer job evictions than all comparison schedulers (with the ex-ception of the Pseudo-optimal scheduler) for jobs up to 100 hours long.
shorter makespan than TRF, but only with a large loss in job reliability.
5.4.3 Scheduling Performance Comparison
This section further compares the best performing scheduling approaches, TRF-Comp and TRF-
Perf (using the S3 resource scoring approach), to other scheduling methods. To more thoroughly
understand the characteristics of these schedulers in a variety of conditions, tests are performed
with a diverse set of job lengths. Results are reported for simulating 6,000 jobs, 25% of which are
checkpointable, over the six month Notre Dame trace. The jobs range from five minutes to the job
length indicated on the x-axis.
Figure 5.7’s top two graphs compare the TRF schedulers to Ren-MTTF (with a variety of days
analyzed) and to the History and Sliding Window prediction-based schedulers. The graphs plot the
percentage difference in both makespan and the number of evictions, vs. TRF-Comp. The History
and Sliding Window predictors utilize the Comp with the S3 resource scoring scheduler as well.
TRF-Comp maintains comparable makespan as job length increases, peaking at roughly 11%
higher than Sliding Window, History, TRF-Perf and Ren MTTF-1 for jobs of up to 40 hours in
length. For this same length, TRF-Comp achieves 60% fewer evictions than the next most reliable
scheduler, Ren MTTF-1. For all job lengths up to 40 hours, TRF-Comp achieves at least 15% fewer
49
1 2 3 4 5
−0.1
0
0.1
0.2
Makespan vs. Job Load
Number of Jobs (x 104)
Mak
espa
n
1 2 3 4 5
0
0.5
1
Evictions vs. Job Load
Number of Jobs (x 104)
Evi
ctio
ns
TRF−Comp.
TRF−Perf.
Ren MTTF−1
Ren MTTF−4
Ren MTTF−8
Ren MTTF−16
History Counter
Sliding Window
1 2 3 4 5
0
1
2
3
Makespan vs. Job Load
Number of Jobs (x 104)
Mak
espa
n
1 2 3 4 5−1
0
1
2
3
4
Evictions vs. Job Load
Number of Jobs (x 104)E
vict
ions
TRF−Comp.TRF−Perf.RandomCPU−SpeedPseudo−optimalS0
Figure 5.8: TRF-Comp produces fewer job evictions than all comparison schedulers (with the ex-ception of the Pseudo-optimal scheduler) for loads up to 40,000 jobs
evictions when compared with the most reliable scheduler, Ren MTTF-16; average job makespan
simultaneously decreases by 20% (27.3 hours versus 32.9 hours). TRF-Comp also decreases the
number of evictions by at least 57% compared with all other schedulers, for jobs up to 6 hours long
(355 evictions versus 557). TRF-Perf comes within 1% of the shortest makespan (Sliding Window)
for shorter lengths, and achieves the shortest makespan for jobs of 80+ hours.
The bottom two graphs in Figure 5.7 compare TRF-Comp and TRF-Perf with non-prediction-
based scheduling approaches including a Pseudo Optimal scheduler which selects the available re-
source that will execute the application in the smallest execution time, without failure, based on
omnipotent future knowledge of resource availability. When all machines would become unavail-
able before completing the application, the Pseudo Optimal Scheduler chooses the fastest available
resource in terms of MIPS speed. For average job makespan, TRF-Comp follows the non-optimal
schedulers and produces the fewest evictions for all job lengths, by at least 110%.
Figure 5.8 investigates the effect of varying the load on the system (the number of jobs). The job
length is fixed to be uniformly distributed between five minutes and 25 hours and vary the number
of jobs injected into the system over the six month trace from 100 to 50,000. Again, the top two
graphs compare the TRF schedulers to other prediction-based schedulers including Ren-MTTF. The
figured demonstrates that TRF-Comp. produces fewer evictions across all loads except in the high
50
load case of 50,000 jobs. TRF-Perf. ties the lowest average job makespan across all loads, improving
on TRF-Comp. by approximately 10%. The bottom two graphs compare the two TRF schedulers
with traditional scheduling approaches. TRF-Comp. produces the lowest number of evictions across
all loads with the exception of the Pseudo-optimal scheduler. Lastly, TRF and the other traditional
schedulers produce comparable makespan improvements across all loads.
51
Chapter 6
Prediction-Based Replication
Prediction-based resource selection is an important tool for dealing with resource unavailability. In
choosing the “best” resources for executing each job, the scheduler can greatly reduce the number
of job evictions and their effect. However, encountering job evictions due to unforeseen resource
volatility is virtually inevitable. Job replication can be used to reduce the detrimental effect of
resource unavailability and benefit performance in two ways. First, replication can be used to
mitigate the effects of job eviction through creating copies of jobs such that even when one copy
of a job is evicted, one its replicas may still complete without encountering resource unavailability.
Since evicted jobs will have to find another resource on which to execute, possibly restarting its
execution from the beginning (non-checkpointable jobs), replicas which do not experience resource
unavailability can reduce the job’s makespan. Second, when jobs are scheduled using imperfect
resource ranking heuristics, future resource performance cannot always be anticipated. For example,
the local load on a resource may suddenly fluctuate which results in the job currently executing on
that resource taking longer to complete than anticipated at the time of scheduling. Replication can
be used to allow a job to execute on multiple resources such that a resource which completes a
replica of the job before the original job could complete will provide a shorter makespan and earlier
completion time. In these two ways, replication can be seen as a scheduling tool to not only deal
with unpredictable resource performance but also aid in handling volatile resource availability.
One of the main difficulties with replication is the cost associated with it. Replication has two
main costs in this context. First, because there are a limited number of total resources in any
distributed system, creating replicas of jobs will reduce the available resource pool and the replicas
may take desirable resources away from waiting non-replicated jobs. This can lead to situations in
which the desirable resources (fast, available, etc) in a system can be taken by a few jobs and their
replicas leaving only less desirable resources for future incoming jobs. Second, in situations such as
Grid economies, creating replicas can have a real world cost where a user may have to pay per job,
52
per process or per cycle. These two costs mean that users must be judicious in their selection of
which jobs to replicate.
This chapter therefore sets out to test the hypothesis that the TRF availability predictor (Chap-
ter 4) can help select the right jobs to replicate, and can improve overall average job makespan,
reduce the redundant operations needed for the same improved makespan, or both. The replica-
tion techniques are tested in a range of system loads and job lengths while measuring average job
makespan and the number of extra operations performed (overhead). This chapter tests two classes
of replication techniques:
• Static techniques (Section 6.2) replicate based on the characteristics of the job being scheduled
• Prediction-Based techniques (Section 6.3) consider forecasts about the future availability and
load of resources
This chapter first investigates static replication techniques which replicate based on the charac-
teristics of the job being scheduled. These techniques are then improved by incorporating availability
predictions made on the resources chosen to execute a job in order to make better job replication
decisions. Each replication techniques is analyzed in terms of a new metric, that of efficiency.
Prediction accuracy is analyzed for it’s effect the quality of the replication decisions. Lastly, Sec-
tion 6.3.2 investigates system load’s effects the performance of the proposed replication techniques.
Section 6.3.3 proposes three static load adaptive replication techniques which consider overall system
load and desired metric in determining which replication strategy to utilize (Chapter 7 goes on to
introduce an improved load adaptive replication approach).
All replication experiments in this chapter use the PPS Scheduler described in Chapter 5.1,
augmented to support replication. Upon placing each job, the scheduler uses the replication policy
to determine how many replicas to make. In these replication experiments, the scheduler makes 0 or
1 replicas. When a task or its replica completes, the system terminates other copies, freeing up the
resources executing them. Schedulers whose replication policies require an availability prediction
use the TRF predictor, unless specified otherwise.
Scheduling quality is analyzed according to average job makespan, extra operations, and replica-
tion efficiency, which is defined and discussed later. Task lengths are defined in terms of the number
of operations needed to complete them; extra operations refers to the number of additional opera-
tions that the system performs for a job, including all “lost” operations due to eviction, and any
operations performed by replicas (or initially scheduled jobs) that do not ultimately finish because
some other copy of that job finished first.
53
6.1 Replication Experimental Setup
Replication results are based on two traces, the six month Notre Dame Condor availability trace
of 600 nodes analyzed in Section 3.2 [59] and also a trace from the SETI@home desktop Grid
system [29].
Again, for the Condor simulations (as explained in Section 5.1), the MIPS score, the load on each
processor, and the (un)availability states (included or implied by the trace data) all influence the
simulated running of the applications and their replicas. Resources are considered available if they
are running, connected, have no user present and have a local CPU load below 30% as these are the
default settings for the Condor system [75]. A resource may only be assigned one task for execution
at a time. Similarly, for the SETI simulations the double precision floating point speed and the
(un)availability states (implied by the trace data) influence the execution of the applications. In the
SETI trace, a machine is considered available if it is eligible for executing a Grid application. This is
in accordance with the resource owner’s settings which may dictate that a machine is only available
if the CPU is idle and the user is not present. Since the SETI trace contains 226,208 resources over
1.5 years, 600 resources are randomly chosen and only execute the simulation on the first 6 months
of the trace to compare directly to the results obtained from the Condor trace simulations.
Applications are simulated by inserting them at random times throughout the six months, with
durations uniformly distributed between five minutes and 25 hours (this follows the same simula-
tion/scheduling methodology as defined in Section 5.1).1 A uniform distribution demonstrates how
each replication technique performs for a wide range of job lengths, without emphasizing a particular
set of lengths. 25% of the applications are checkpointable, and can take an on-demand checkpoint
in five minutes, if the user reclaims a resource, or if the local CPU load exceeds 50%. If the resource
directly transitions to an unavailable state (for example, by becoming unreachable), no checkpoint
can be taken and the job must restart from its last periodic checkpoint.2 Non-checkpointable appli-
cations must restart from the beginning of their execution when evicted regardless of the eviction
type. In both cases, an evicted job is added to the job queue and immediately rescheduled on an
available resource.
The PPS scheduler, as defined in Section 5.1, is used to schedule jobs. Again, jobs are scheduled
each round (once every 3 minutes) by removing and scheduling the job at the head of the queue
until a job can no longer be scheduled. For each job, each available resource is scored and the
resource with the highest score executes the job. Resource i ’s score (RS(i)) is computed as (scoring
1The duration is determined by the runtime on an unloaded resource with average speed.2Periodic checkpoints are taken every 6 hours (Condor’s default).
54
approached S0 as defined in Section 5.3.1):
RS(i) = mi · (1 − li)
where mi is i ’s MIPS score, and li is the CPU load currently on resource i.
When a job is first scheduled, the scheduler determines if a single replica of the task should be
created based on the current replication policy. 3 If the scheduler chooses to create a replica, an
identical copy of the job is then scheduled. This replica is not considered for further replication.
When any replica of a task completes, all other replicas of the task are located and terminated,
freeing up the resources executing those replicas.
Some replication policies use an availability prediction. For this the TRF predictor, as described
in Section 4.2.1 [60], is used to forecast resource availability. TRF takes as input the expected
duration of the job and the name of a resource, and returns a percentage from 0 to 100 to represent
the predicted probability that the resource will complete the expected duration without becoming
unavailable.
In these experiments, the total number of jobs in the system is varied using 1K, 14K, 27K, and
40K total jobs over the 6 month traces, in four separate sets of simulations. This translates to 0.001,
0.13, 0.26, and 0.39 Jobs per Resource per Day, respectively. These same values were used for the
rest of the tests described in this chapter. They are referred to as the Low, Medium Low, Medium
High, and High load cases.
6.2 Static Replication Techniques
This section explores the effect that replicating jobs based on checkpointability can have on both
extra operations and on job makespan. The total number of jobs varies from Low to High load as
defined in Section 6.1 cases.
Figure 6.1 includes the following replication policies:
• 1x: Replicates each job exactly once. If either the main job or the replica runs on a resource
that becomes unavailable, it is rescheduled but never re-replicated. Thus, two versions of each
job are always running.
• Non-Ckpt: Replicates only jobs that are not checkpointable.
• Ckpt: Replicates only jobs that are checkpointable.4
3Although results have been obtained for an increased number of replicas per job, this study is restricted by limitingthe number of replicas to one per job due to space constraints.
4This replication policy is not based on intuition, but instead serves as a useful comparison for the Non-Ckpt policy
55
Low Med−Low Med−High High−50
−40
−30
−20
−10
0
10
20
30Makespan vs. Job Load
Job Load
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
No−Rep
1x
Non−Ckpt
Ckpt
50% Probability
Low Med−Low Med−High High0
100
200
300
400
500
600
700Extra Operations vs. Job Load
Job Load
Per
cent
Ext
ra O
pera
tions
vs. N
o R
eplic
atio
n
Figure 6.1: Checkpointability Based Replication: Makespan (left) and extra operations (right) fora variety of replication strategies, two of which are based on checkpointability of jobs, across fourdifferent load levels.
• 50% Probability: Replicates half of the jobs, at random.
• No-Rep: Does not make replications
Figure 6.1 illustrates that under low loads, increasing replicas achieves more makespan improve-
ment, a benefit that falls off for higher loads because replication forces subsequent jobs to use even
less desirable resources. Replicating only non-checkpointable jobs (Non-Ckpt) improves makespan
for all but the highest load test, and at significantly less cost than 1x replication. Moreover, repli-
cating the half of the jobs that are non-checkpointable is better than replicating half of the jobs at
random, which indicates that using checkpointability for replication policies has benefit, as expected.
6.3 Prediction-based Replication
As shown in the previous section, replicating non-checkpointable jobs yields larger rewards than
replicating checkpointable jobs due to the inherent fault tolerance of checkpointable jobs. Replicating
longer jobs also yields higher improvements although those results are not included here.
These replication strategies can be improved by identifying and replicating the jobs that are
most likely to experience resource unavailability. For this work, a resource is selected based on
its projected near-future performance (speed and load) as described in Section 6.1, and jobs are
replicated based on the probability of the job completing on that resource. Jobs that are more likely
to experience resource unavailability are replicated to decrease job makespan. For this reason, it is
important to consider makespan improvement and overhead within the same metric. This metric
quantifies how much of makespan improvement each replication policy achieves per replica created.
Replication Efficiency is defined as:
56
ReplicationEfficiency = MakespanImprovement
ReplicasPerJob
where Makespan Improvement is the percentage makespan improvement over not creating any repli-
cas. As defined, increasing makespan improvement with the same number of replicas will increase
efficiency, and increasing replicas to get the same makespan improvement will decrease efficiency.
Another way of considering efficiency is to see how much makespan improvement can be achieved
with a certain number of replicas. The best replication strategies will make replicas of the “right”
jobs, and achieve more improvement for the same cost (number of replicas).
Replication efficiency is important in any environment that applies a cost to inserting and exe-
cuting jobs on a system. One user may wish to complete a job as soon as possible (reduce makespan)
regardless of cost, whereas a frugal user may only wish to replicate if the replica created is likely to
provide significant makespan improvement; the frugal user may choose an extremely efficient repli-
cation policy. Efficiency can also be useful to administrators in choosing which replication policy is
most suited to the goals of their system. Efficient replication strategies reduce the overhead and in-
crease the throughput of the overall system by reducing the number of wasted operations performed
by replicas that do not complete or that are evicted. This section analyzes methods for using the
availability predictions to make better replication decisions in terms of both makespan improvement
and efficiency.
6.3.1 Replication Score Threshold
This section first investigates how the prediction threshold at which the system makes a replica
affects performance. The Replication Score Threshold (RST) is the predicted completion probability,
as calculated by TRF, below which the system makes a replica, and above which it does not. As
demonstrated in this section, RST can effect both makespan improvement and replication efficiency.
For example, a system with an RST of 100 replicates all jobs scheduled on resources with a predicted
completion probability below 100%.5 Additionally, the RST-L-NC strategy chooses jobs to replicate
based their predicted completion probability (RST), their Length (longer jobs are replicated), and
whether they are checkpointable (only Non-Checkpointable jobs are replicated). A job length of
greater than ten hours was chosen based on empirical analysis of replication approaches with a
varying length parameter. Non-fixed length approaches have not been investigated and are left to
future work. These results are compared with a policy (1x) that replicates each job, and a policy
5An RST of 100 will not replicate all jobs but only jobs whose predicted probability of completion is below 100%.If a resource has not yet become unavailable within the period of time being traced, the predictor may output a 100%probability of completion.
57
40 60 80 1005
10
15
20
25Makespan vs. RST: Condor
Replication Score Threshold
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
40 60 80 10020
25
30
35
40Efficiency vs. RST: Condor
Replication Score Threshold
Effi
cien
cy
1x
WQR−FT
RST
RST−L−NC
40 60 80 10014
16
18
20
22
24Makespan vs. RST: Seti
Replication Score Threshold
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
40 60 80 10020
25
30
35
40
45
50Efficiency vs. RST: Seti
Replication Score ThresholdE
ffici
ency
1x
WQR−FT
RST
RST−L−NC
Figure 6.2: Replication Score Threshold’s effect on replication effectiveness - Low Load
that replicates on available resources only (WQR-FT, described in Section 2.4) [7].
To summarize,
• 1x: Creates a single replica of each job
• RST: Given an availability prediction from the TRF predictor, replicates all jobs whose pre-
dicted completion probability falls below the configured RST value
• RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than ten hours (L),
and given an availability prediction from the TRF predictor, whose predicted completion
probability falls below the configured RST value
• WQR-FT: First schedules all incoming jobs, then uses the remaining resources for replicas
by creating up to one replica of each job until either all jobs have a single replica executing or
there are no longer any available resources
Figure 6.2 plots makespan and efficiency versus RST in the “Low” load case. The top two graphs
represent the results from executing the simulation on the Condor trace and the bottom two graphs
present the results from the SETI trace.
As RST increases, more jobs are replicated as more predicted completion probabilities fall below
58
the RST. In the low load case, this increases makespan improvement for both RST and RST-L-NC.
The effect is more prominent in the Condor simulations, but still evident in the SETI simulations.
1x and WQR-FT achieve the highest makespan improvement in the Condor simulations but are
matched by RST in SETI.
The efficiency graphs exhibit the opposite trend. For all RST values, both RST and RST-L-NC
produce higher efficiencies than WQR-FT and 1x. In fact, RST-L-NC produces a 39.5 (Condor)
and a 47.5 (SETI) replication efficiency versus WQR-FT’s efficiency of 21 (for both SETI and
Condor). RST-L-NC’s efficiency outperforms RST due to the incorporation of job characteristics
in choosing to replicate longer, non-checkpointable jobs. The more intelligently and selectively the
scheduler chooses which jobs to replicate, the higher the achieved efficiency. Clearly, considering
both predicted completion probability and an application’s characteristics yields the most efficient
strategy. For example in SETI, RST matches the makespan improvement of WQR-FT but creates
35% fewer replicas while RST-L-NC comes within 6% of the makespan improvement of WQR-FT
while creating 67% fewer replicas. Similarly in Condor, RST-L-NC comes within 6% of the makespan
improvement of WQR-FT but creates 76.8% fewer replicas.
6.3.2 Load Adaptive Motivation
This section examines how increasing load on the system influences the replication policies’ effec-
tiveness. Figure 6.3 plots makespan and efficiency versus RST in the “Medium High” load case.
Again, the top two graphs represent the results from the Condor simulations and the bottom two
graphs represent the results from the SETI simulations. Under Medium High load, increasing RST
actually decreases makespan improvement; creating additional replicas in higher load situations de-
nies other jobs access to attractive resources. RST70 (RSTi, where i is a value between 0 and 100
replicates all jobs whose predicted probability of completion is below i%), for example produces the
largest makespan improvement for Condor and RST40 produces the largest makespan improvement
for SETI. RST and RST-L-NC produce much higher efficiencies than either 1x or WQR-FT across
all RST values in both the Condor and SETI simulations. At this load level, RST-L-NC produces
a 30 (Condor) and a 20 (SETI) replication efficiency versus WQR-FT’s 7 (Condor) and 8.5 (SETI),
respectively. In SETI, RST-L-NC comes within 0.5% of the makespan improvement of WQR-FT
while creating 58% as many replicas and RST improves on the makespan by 2.5% and creates 13.7%
fewer replicas. For Condor, RST-L-NC improves on the makespan of WQR-FT by 1.5% while cre-
ating 72.5% fewer replicas. Again, RST-L-NC produces higher replication efficiency than RST due
59
40 60 80 1006
7
8
9
10Makespan vs. RST: Condor
Replication Score Threshold
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
40 60 80 1005
10
15
20
25
30
35Efficiency vs. RST: Condor
Replication Score Threshold
Effi
cien
cy
1xWQR−FTRSTRST−L−NC
40 60 80 1007
7.5
8
8.5
9
9.5
10
10.5Makespan vs. RST: Seti
Replication Score Threshold
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
40 60 80 1008
10
12
14
16
18
20Efficiency vs. RST: Seti
Replication Score ThresholdE
ffici
ency
1x
WQR−FT
RST
RST−L−NC
Figure 6.3: Replication Score Threshold’s effect on replication effectiveness - Medium High Load
to its higher replica creation selectivity. Figure 6.3 demonstrates that under a higher load, selective
replication strategies produce larger makespan improvements. Even under low load (Figure 6.2),
these more selective and efficient replication techniques result in more improvement per replica.
Figure 6.4 further investigates the effect that varying the load in the system has on the perfor-
mance achieved by the replication techniques proposed in Section 6.3.1. Figure 6.4 plots the RST
replication technique with RST values of both 100% and 40% and compares them with not replicat-
ing (No Replication) and replicating a random 20% of jobs (20% Replication) only for the Condor
case due to space constraints (although the result holds for the SETI case). As load increases, the
makespan improvement and efficiency of all replication strategies initially increases slightly but then
falls as Med-High and High load levels are reached. Intuitively, this is because under lower loads
replicas will not interfere by taking quality resources away from other jobs. As load increases, repli-
cas can cause new jobs to wait. Another important result is comparing the proposed techniques to
blind replication. For example, the RST40-L-NC technique creates almost exactly the same number
of replicas as randomly replicating 20% of jobs across the loads levels. Given the same number of
replicas, the RST technique achieves a higher makespan improvement and efficiency across all load
levels; it is choosing the “right” jobs to replicate.
60
Low Med−Low Med−High High−10
−5
0
5
10Makespan vs. Job Load: Condor
Job Load
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
Low Med−Low Med−High High−10
0
10
20
30
40Efficiency vs. Job Load: Condor
Job Load
Effi
cien
cyLow Med−Low Med−High High
0
0.5
1
1.5Replicas vs. Job Load: Condor
Job Load
Num
ber
of R
eplic
as C
reat
ed (
x 10
4 )
No−RepRST100−L−NCRST40−L−NC20% Probability
Figure 6.4: Replication strategy performance across a variety of system loads
Figure 6.4 also shows that the replication strategy that provides the highest achievable makespan
or efficiency varies based on the load of the system. For example, depending on the load level, either
RST40-L-NC or RST100-L-NC produces the largest makespan improvement.
6.3.3 Load Adaptive Techniques
Since load varies unpredictably, the replication policy should adapt dynamically. And because differ-
ent users may target different metrics, the replication policies must also vary by job, suggesting three
load adaptive replication techniques that respond to varying load levels by choosing the replication
strategy that best achieves three different desired metrics. A load measurement system counts the
number of jobs submitted to the system in the past day. This value then determines the current
load level and helps select a replication strategy. Low load is defined as fewer than 40 jobs per
day, Med-Low load as fewer than 115 jobs per day, Med-High load as fewer than 200 jobs per day
and High load as 200 or more jobs a day. RST-based strategies consider different combinations of
checkpointability and job length depending on the current system load and desired metric.
• Performance: Reduce average job makespan across loads
– Low load: 4x - Create 4 replicas of each job
– Med-Low load: RST100 - Replicate a job if the completion probability is less than 100%
61
Low Med−Low Med−High High−10
0
10
20
30
Makespan vs. Job Load: Condor
Job Load
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
No−RepWQR−FTPerformance−RepEfficiency−RepCompromise−Rep
Low Med−Low Med−High High−10
0
10
20
30
40
50Efficiency vs. Job Load: Condor
Job Load
Effi
cien
cyLow Med−Low Med−High High
−10
0
10
20
30
Makespan vs. Job Load: Seti
Job Load
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
No−RepWQR−FTPerformance−RepEfficiency−RepCompromise−Rep
Low Med−Low Med−High High−10
0
10
20
30
40
50Efficiency vs. Job Load: Seti
Job LoadE
ffici
ency
Figure 6.5: Load adaptive replication techniques
– Med-High load: RST40-NC - Replicate a job if the completion probability is less than
40% and the job is non-checkpointable
– High load: No replication
• Efficiency: Increase replication efficiency across loads
– Low, Med-Low, Med-High: RST40-L-NC - Replicate a job if the completion probability
is less than 40%, the job is non-checkpointable and the length is over 10 hours
– High load: No replication
• Compromise: Compromise between average job makespan and efficiency across loads
– Low load: RST100 - Replicate a job if the completion probability is less than 100%
– Med-Low load: RST100 - Replicate a job if the completion probability is less than 100%
– Med-High load: RST40-NC - Replicate a job if the completion probability is less than
40% and the job is non-checkpointable
– High load: No replication
Figure 6.5 explores the results of the three proposed load adaptive replication policies across
varying load levels and compares their results to the closest related work replication strategy, WQR-
62
FT. Again, the top two graphs represent the results from the Condor simulations and the bottom
two illustrate the SETI results. For both cases, all of the load adaptive techniques produce the
largest (or extremely close to the largest) achievable value for their chosen metric at each given
load level with the exception of the high load case for SETI. The Efficiency technique produces the
largest efficiency across all load levels (except in the high load case in SETI) while simultaneously
producing the smallest makespan improvement. The Performance technique produces the opposite
trend. The Compromise technique produces an intermediate makespan improvement and efficiency
across all loads. The results demonstrate these replication techniques can determine the replication
strategy that is best suited to achieving the desired metric at various loads.
Compared to WQR-FT, all three load adaptive techniques produce higher replication efficiencies
across all loads except for the Performance technique in the low load case and the high load case in
SETI. The Efficiency technique produces an average replication efficiency increase of 10.8 across all
loads with a maximum increase of 22 compared with WQR-FT. Similarly, the Efficiency technique
produces at most an increase of 25 in efficiency for the Condor simulations and on average increase
of 15 across all loads compared with WQR-FT. The Performance technique beats the makespan
improvement achieved by WQR-FT by an average of 1.27% across all loads in the Condor simulations
while simultaneously achieving a higher efficiency in all but the low load case. The Performance
technique also improves on the makespan improvement of WQR-FT in all but the high load case
in SETI. The Compromise technique improves or matches the efficiency and makespan achieved by
WQR-FT across all loads with the exception of the high load case in the SETI simulations.
63
Chapter 7
Load Adaptive Replication
The previous chapter investigated using availability predictors to decide which jobs to replicate. The
goal of this prediction-based replication approach is to replicate those jobs which are most likely
to experience resource unavailability. These jobs benefit the most from the fault tolerance which
replication provides and produce the largest benefit to the system. As seen in Chapter 6, it is
especially important to choose the “right” jobs to replication because of the costs associated with
replication. As seen in Section 6.3.2, the load on the system has a large effect on the performance of a
replication strategy. In particular, as load increases, jobs are competing for fewer and fewer resources.
Given this lack of available resources under higher loads, replication strategies must become even
more selective about which jobs are replicated. This is in contrast to lower load situations which
can benefit from more aggressive replication policies.
Section 6.3.3 proposed three static load adaptive replication policies. These policies measure the
load on the system in the past 24 hours and choose the replication policies appropriate for that
load out of a set of four replication policies. The major drawback to this approach is that these
replication policies are chosen statically ahead of time based on testing a range of policies with a
given resource availability trace and workload. The “best” performing policy is then chosen for that
load level range. Unfortunately, this means the load-adaptive approach is tailored to that specific
availability and workload. The four strategies are statically chosen. As shown in this chapter, this
strategy does not handle new workloads or new resource environments well. An additional issue is
that only having four load levels means a relatively high granularity of load adaption. This large
granularity is not able to account for the wide range of loads within each granularity level.
This chapter proposes a new load adaptive approach which addresses these issues and delivers
improved performance. The chapter is organized as follows. Section 7.2 investigates the effect that
varying system load has on the performance achieved by the prediction-based replication techniques
proposed in the previous chapter and motivate the need for load-adaptive replication. Section 7.3
64
introduces a load adaptive replication approach, Sliding Replication (SR) and goes on to test various
configurations for SR in Sections 7.3.1, 7.3.2 and 7.3.3. Lastly, Section 7.4 tests SR against the
load adaptive approach proposed in Section 6.3.3 and then against WQR=FT using two real world
load traces.
7.1 Load Adaptive Experimental Setup
The same simulation methodology described in Section 5.1 is used for studying these load adaptive
replication policies. This methodology dictates how resources are considered available based on the
availability trace. Section 6.1’s experimental setup is used including using the Prediction Product
Score (PPS) scheduler with the following resource ranking function:
RS(i) = mi · (1 − li)
where mi is i ’s MIPS score, and li is the CPU load currently on resource i. When a job is first
scheduled, the scheduler determines if a single replica of the task should be created based on the
current replication policy. The Condor and SETI resource availability traces are used to dictate
when and how resources become unavailable.
These experiments use real world workload traces from the Grid Workload Archive [27] (GWA),
an online repository of 11 distributed system workload traces including TeraGrid, NorduGrid, LCG,
Sharcnet, Das-2, Auvergrid and Grid5000 to drive the simulations. The Grid5000 and NorduGrid
traces are used in the load-adaptive replication simulations. Grid5000 consists of between 10,000
and 25,000 system cores with over 1 million jobs. NorduGrid consists of over 25,000 system cores
with between 500,000 and 1 million total job submissions. NorduGrid therefore represents a lower
overall load level.
7.2 Prediction based Replication
Section 6.3.1 describes a replication strategy that utilizes predictions to choose which jobs to repli-
cate [62], those that are most likely to experience resource unavailability (RST). Again, if a job’s
likelihood of failure exceeds some threshold, its replicated. The Replication Score Threshold (RST)
is the predicted completion probability, as calculated by the TRF predictor (Section 4.2.1), below
which the system makes a replica, and above which it does not. As shown in the last chapter, RST
affects both makespan improvement and replication efficiency.
65
This section motivates the need for load aware replication strategies and investigates the effect
of varying RST. A pool of synthetic jobs is simulated by inserting each job at a random time
throughout the six month Condor trace, with application durations uniformly distributed between
five minutes and 24 hours using the simulation methodology described in Section 5.1. The number
of jobs is varied according to the “Low” (1000 jobs) and “Med-High” loads (27,000 jobs) described
in Section 6.1. 1 The goal is to determine the effect that system load and RST have on replication
performance.
Figure 7.1 plots makespan and efficiency versus RST for “Low” and “Med-High” load. Under
“Low” load, as RST increases, more jobs are replicated because more predicted completion proba-
bilities fall below the RST; this increases makespan improvement. The efficiency graphs exhibit the
opposite trend. As RST increases, extra job replicas contribute slightly less to makespan improve-
ment, leading to lower efficiency as RST increases.
Under Medium High load, increasing RST actually decreases makespan improvement; creating
additional replicas in higher load situations denies other jobs access to attractive resources. RST70
(RSTi, where i is a value between 0 and 100, replicates all jobs whose predicted probability of
completion is below i%), for example produces the largest makespan improvement for Condor.
Efficiency decreases as RST increases but with a sharper slope than the Low load case, indicating
the additional replicas are decreasing efficiency more under the higher load.
Figure 7.1 illustrates that the replication strategy that creates the largest makespan improvement
varies with the load on the system. As load increases, excess replicas steal quality resources from non-
replicated jobs; schedulers should become more selective about which jobs to replicate. Figure 7.1
shows that varying RST can be used as a tool to determine the selectivity of the replication policy.
Higher RST values translate to less selectivity and lower RST values mean more selectivity (less
jobs being replicated). This idea is used later in developing a load adaptive replication approach.
7.3 Load Adaptive Replication
Sliding Replication (SR) is a load-adaptive replication approach that uses the current load on the
system to influence its replication strategy choice. One simple load adaptive replication approach is
to directly translate load to a probability of replication. A function takes system load as input and
produces output that dictates the probability of replication. The Div-0.7 load adaptive replication
approach calculates the probability of replication (RP) with the following equation:
1“Med-Low” and “High” load scenarios are not included in this study for clarity.
66
40 60 80 1008
10
12
14
16
18
20Makespan Improv. vs. RST: Condor
Replication Score Threshold
Per
cent
Mak
espa
n Im
prov
emen
tvs
. No
Rep
licat
ion
40 60 80 1005
10
15
20
25
30Efficiency vs. RST: Condor
Replication Score Threshold
Effi
cien
cy
RST: Low LoadRST: Med−High Load
Figure 7.1: Replication Score Threshold’s effect on replication effectiveness across various loads
20 40 60 80 1004
6
8
10
12
14Makespan Improv. vs. Job Load: Condor
Percentage of Job Trace
Per
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
20 40 60 80 100150
200
250
300
350Efficiency vs. Job Load: Condor
Percentage of Job Trace
Effi
cien
cy
Div−0.7
Thresh−0.3
Thresh−0.4
Figure 7.2: Load adaptive replication with random replication probability
67
Figure 7.3: Sliding Replication’s load adaptive approach
RP = (1 − (JDR/0.7)) · 100
where JDR is the number of jobs per day per available resource. JDR is calculated by examining the
most recent 24 hours of behavior. The static value 0.7 represents a typical upper bound on the JDR
measurement for a high load scenario. Div-0.7 is compared with Thresh-0.3 and Thresh-0.4 which
replicate all jobs if the JDR measurement is below 0.3 or 0.4, respectively. Figure 7.2 analyzes the
performance of these three approaches in terms of makespan improvement and efficiency using the
Condor resource availability trace paired with the Grid5000 job workload trace. Importantly, the
Grid5000 workload does vary load throughout it’s traced time in terms of how many jobs are injected
into the system, but this doesn’t effect the overall load dictated by the trace onto the system. To
simulate a range from lower overall system loads to higher system loads, the simulation chooses a
different percentage of jobs from the trace to execute in relation to the desired load level. K% means
that the simulator keeps a random K% of jobs from the trace. In Figure 7.2, Thresh-0.4 produces
the highest makespan improvement while also producing the highest efficiency for the two higher
load scenarios. Thresh-0.3 produces the highest efficiency for the two lower load scenarios.
However, these load adaptive replication strategies only choose how many jobs to replicate given
the current load level. These results can be further improved if the replication strategy also chooses
the best jobs to replicate. For example, for a given load, replicating the jobs that will fail will result
in higher gains than replicating jobs that will run to completion without replication. This is the
motivation behind the Sliding Replication (SR) approach. SR not only attempts to regulate how
many replicas are produced given the current load level, but given that number of replicas, it seeks
to choose the proper subset of jobs to replicate, attempting to choose those jobs that are most likely
to fail.
SR employs prediction based replication strategies that replicate all jobs whose predicted likeli-
68
hood of failure exceeds a certain threshold. Since a single failure threshold will replicate a similar
percentage of jobs across most loads, and since load-adaptive replication approaches seek to vary
the percentage of jobs replicated with system load, SR first uses load to determine which strategy
to use to decide whether to replicate. This choice of strategy influences both the number of replicas
and which jobs get replicated.
Figure 7.3 illustrates Sliding Replication strategy’s methodology. Sliding Replication uses an
ordered list of replication strategies. The current load is passed to an Indexing Function to deter-
mine an index into that Ordered Replication Strategy List. The index corresponds to a particular
replication strategy which is used to determine if the job is replicated.
7.3.1 Replication Strategy List
This section investigates various ordered lists of replication strategies. As shown in Figure 7.1,
higher system loads require more selective replication strategies. The first intuitive solution is to
propose an ordered list of replication strategies called 21 where the strategies are ordered from the
most aggressive (always replicate) to most selective (only make a replica of the job if it’s completion
probability is below 40%, it’s non-checkpointable and it’s duration is over 10 hours). This list is
drawn from previously described replicate strategies in Chapter 6 [62]. The indexed list is as follows:
• Index 0: 1x: Creates a single replica of the job
• Index 1-7: RST: Given an availability prediction from the TRF predictor, replicates all jobs
whose predicted completion probability falls below 100% (index 1) to 40% (index 7)
• Index 8-14: RST-NC: Replicates all non-checkpointable (NC) jobs whose completion prob-
ability (as predicted by TRF) falls below 100% (index 8) to 40% (index 14)
• Index 15-21: RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than
ten hours (L), and given an availability prediction from the TRF predictor, whose predicted
completion probability falls below 100% (index 15) to 40% (index 21)
• Index > 21: 0x: Never replicates
21 is compared with three other ordered lists:
• 21 rev: Uses the same list as 21 except in reverse order (most selective to most aggressive)
• 7: Uses the same list as 21 except only the first 7 indexes. Any index larger than 7 indicates
the job will not be replicated.
69
21 21_rev 7 21_rand0
0.5
1
1.5
2
2.5
3Makespan Improv. vs. Replication List
List
Ave
rage
Job
Mak
espa
n(S
econ
ds x
105 )
21 21_rev 7 21_rand0
0.2
0.4
0.6
0.8
1
1.2
1.4Evictions vs. Replication List
List
Evi
ctio
ns x
105
Figure 7.4: Sliding Replication list comparison
• 21 rand: Has 21 indexes but uses probabilistic replication. The probability that each replica-
tion strategy in the 21 list has of creating a replica is determined and then based on the index
chosen, the strategy randomly creates replicas with a probability equal to the corresponding
strategy in the 21 list. This isolates choosing the ’right’ jobs to replicate from just choosing
the ‘right’ ’number’ of jobs to replicate at each load.
Figure 7.4 demonstrates the performance of all four ordered replication strategy lists for average
job makespan and number of evictions using the Condor availability trace and 100% of the jobs from
the Grid5000 workload. The 21 list strategy produces the fewest job evictions and lowest average
job makespan. Beating 21rev demonstrates that making fewer replicas under higher loads is better,
and beating 21rand demonstrates that the list of policies is making replicas of the right jobs.
7.3.2 Replication Index Function
This section investigates varying the function used by SR to alter the condition under which indices
are chosen. SR can use both linear and exponential functions when calculating these indices. The
exponential function (Exp-X) calculates an index into the ordered list of strategies as follows:
Index = (JDR · 10)X
X determines the strategy’s sensitivity to system load. Smaller values of X result in smaller indices
and therefore more replicas. The linear function (Lin-X) is:
Index = (JDR · 10) · X
Again, lower X values result in lower indices and more replicas.
70
0 50 10010
15
20
25Makespan Improv. vs. Job Load
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 100100
150
200
250Efficiency vs. Job Load
Percentage of Job Trace
Effi
cien
cy
Exp−1.1
Exp−1.6
Exp−2.2
0 50 10010
15
20
25Makespan Improv. vs. Job Load
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 10050
100
150
200
250Efficiency vs. Job Load
Percentage of Job Trace
Effi
cien
cy
Lin−0.75
Lin−1
Lin−1.25
Lin−1.5
Figure 7.5: Sliding Replication function comparison
Figure 7.5 compares linear indexing (Lin-X) in the top two graphs with exponential indexing
(Exp-X) in the bottom two, for the Condor resource availability trace paired with the Grid5000
job workload trace. The percentage of jobs injected from the Grid5000 workload is varied from
25% to 100%. The X parameter is also varied in both equations. For both functions, larger values
of X produce higher replication efficiencies, especially at higher loads. The exponential expression
produces similar but slightly higher replication efficiency when compared with the linear function.
7.3.3 SR Parameter Analysis
This section analyzes how the X parameter in the exponential index expression effects which index,
and therefore, which replication policies are chosen. The simulation executes the Condor trace
coupled with the Grid5000 workload. The X parameter is varied from 1.1, 1.6 and 2.2 to see which
replication policies (indices) are chosen throughout the simulation.
Figure 7.6 plots the frequency with which each value of the X parameter selects each index
and how many replicas are created at each index. Values over 21 result in no replicas and are not
shown. The only unexpected result is index 8 (RST-40) creating a larger number of replicas than
the surrounding strategies.
71
0 5 10 15 20
500
1000
1500
Frequency Count vs. Index
Index
Fre
quen
cy C
ount
0 5 10 15 200
500
1000
1500
Replicas Created vs. Index
Index
Rep
licas
Cre
ated
Exp−1.1Exp−1.6Exp−2.2
Figure 7.6: Sliding Replication index analysis
7.4 Sliding Replication Performance
This section compares Sliding Replication with other replication strategies. Section 7.4.1 compares
SR against the prediction-based load-adaptive replication approach Compromise-Rep, proposed in
Section 7.4.1. Section 7.4.2 then compares SR with WQR-FT using two real world workload traces.
7.4.1 Prediction-based Replication Comparison
This section compared the SR strategy with the previously proposed replication strategy Compromise-
Rep as defined in Section 6.3.3 [62]. Recall that Compromise-Rep measures the number of jobs in the
last day and if the number of jobs is less than 115, it replicates the job if it’s completion probability
is lower than 100%. If the number of jobs is greater than or equal to 115 but less than 200, it creates
a replica if the job’s completion probability is less than 40% and the job is non-checkpointable. Any
more than 200 jobs per day and Compromise-Rep will not create a replica. To compare these two
strategies, again the Condor availability trace coupled with the Grid5000 workload is used. Again,
to simulate a range from lower overall system loads to higher system loads, different percentages of
jobs are used from the Grid5000 trace in relation to the desired load level. K% means that a random
K% of jobs are kept from the trace.
Figure 7.7 shows the results of the comparison between the two load adaptive replication strate-
gies across a variety of loads. The left subgraph depicts makespan improvement over creating no
replicas. For makespan improvement, SR (Exp-2.2) outperforms Compromise-Rep under low loads;
under other loads the two strategies perform similarly. For replication efficiency, SR significantly
outperforms Compromise-Rep for all loads.
72
20 40 60 80 10010
12
14
16
18
20
22Makespan Improv. vs. Job Load: Condor
Percentage of Job Trace
Per
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
20 40 60 80 10050
100
150
200
250Efficiency vs. Job Load: Condor
Percentage of Job Trace
Effi
cien
cy
Exp−2.2Compromise Rep
Figure 7.7: Static vs. adaptive replication performance comparison
7.4.2 SR Under Various Workloads
This section analyzes SR for a variety of workloads and availability traces across various exponential
values and compares it to the WQR-FT replication strategy (defined in Section 2.4) [7]. SR is tested
using three values of X , namely, 1.1, 1.6 and 2.2. Figure 7.8 depicts results from the Grid5000
workload under both the Condor (top two graphs) and SETI (bottom two) availability traces.
For Condor, SR produces a comparable makespan improvement to WQR-FT (2.75% on average
across the loads) while producing a 225% increase in Replication Efficiency on average indicating SR
is creating significantly less replicas to produce a similar makespan improvement. WQR-FT creates
a replica if any resource is available, an aggressive but potentially inefficient strategy that improves
makespan unless replicas leave significantly fewer quality resources for future jobs.
For SETI, SR produces slightly over 1% makespan improvement over WQR-FT for all load levels
while simultaneously increasing replication efficiency by an average of 547%. Increasing the value
of X increases the replication efficiency in both the Condor and SETI simulations while slightly
decreasing makespan improvement across all load levels.
Figure 7.9 depicts results from the NorduGrid workload. Again, the top two graphs represent
results from the Condor trace and the bottom two from SETI. SR produces a higher makespan
improvement and replication efficiency in all but the SETI lower load case.
73
0 50 10010
15
20
25Makespan Improv. vs. Job Load: Condor
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 1000
50
100
150
200Efficiency vs. Job Load: Condor
Percentage of Job Trace
Effi
cien
cy
Exp−1.1
Exp−1.6
Exp−2.2
WQR−FT
0 50 10012
14
16
18
20
22Makespan Improv. vs. Job Load: SETI
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 1000
100
200
300Efficiency vs. Job Load: SETI
Percentage of Job Trace
Effi
cien
cy
Exp−1.1
Exp−1.6
Exp−2.2
WQR−FT
Figure 7.8: Sliding Replication under the Grid5000 workload
0 50 100−5
0
5
10Makespan Improv. vs. Job Load: Condor
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 100−50
0
50
100
150Efficiency vs. Job Load: Condor
Percentage of Job Trace
Effi
cien
cy
Exp−1.1
Exp−1.6
Exp−2.2
WQR−FT
0 50 1004
6
8
10
12
14Makespan Improv. vs. Job Load: SETI
Percentage of Job TracePer
cent
age
Mak
espa
n Im
prov
emen
tvs
No
Rep
licat
ion
0 50 1005
10
15
20
25Efficiency vs. Job Load: SETI
Percentage of Job Trace
Effi
cien
cy
Exp−1.1
Exp−1.6
Exp−2.2
WQR−FT
Figure 7.9: Sliding replication under the NorduGrid workload
74
Chapter 8
Multi-Functional Device Utilization
Conventional workstations and clusters are increasingly supporting distributed computing, including
applications that require distributed file systems [9][10] and parallel processing [64][34][36]. For com-
putationally intensive workloads, schedulers can characterize and predict workstation availability,
and use the information to effectively use pools of distributed idle resources as done in previous chap-
ters [64][34][54][58]. In fact, many projects have successfully harvested idle cycles from workstations
and general purpose compute clusters.
This chapter describes efforts to leverage unconventional computing environments, defined here as
collections of devices that do not comprise general purpose workstations or machines. In particular,
the aim is to leverage special purpose instruments and devices that are often deployed in great
numbers by corporations and academic institutions. These devices have increasingly fast processors
and large amounts of memory, making them capable of supporting high performance computing.
They are often connected by high speed networks, making them capable of receiving and sending
large data intensive jobs. Importantly, and as seen in Section 8.4, these devices often have significant
amounts of idle time (95% to 98% idle). These characteristics provide potential to federate and
harvest the unused cycles of special purpose devices for compute jobs that require high performance
platforms [57].
8.1 Multi-Functional Devices
One concrete example of an unconventional environment containing highly capable special purpose
devices is a pharmaceutical company and its multi-function document processing devices (MFDs).
MFDs, workgroup printers, and coupled servers are common in businesses, corporations, and aca-
demic institutions. All three types of institutions fulfill printing and document creation needs with
fleets of such devices, which provide ever increasing speed, functionality, and print quality.
75
0 10 20 30 40 50 60 70 80 90 100 110 1200
5
10
15
20
25
30
35
40
45
50
# of devices per floor
freq
uenc
y
Figure 8.1: MFD count per floor at a large pharmaceutical company
To support this additional functionality—including email, scanning, faxing, copying, document
character recognition, concurrency and workflow management, and more—the devices now contain
processing power, RAM, secondary storage, and even GPUs, making them as computationally ca-
pable as general purpose servers. A single MFD, for example, includes at least a 1 GHz processor,
2 GB of memory, 80 GB HDD, Gigabit Ethernet, other processing cores and daughter boards that
function as co-processors, phone and fax capability, and mechanical components. Such a device
would currently cost approximately $25,000. This configuration compares favorably to a typical
high performance workstation or even one or more cluster nodes (the components of typical Grid
and cloud computing platforms).
Large organizations deploy a significant number of such devices, in part because to be useful they
must be located in physical proximity to users. Figure 8.1 includes the number of MFDs on each
floor of a large pharmaceutical company, a customer of Xerox Corporation. Each bar represents
the number of floors of a building that contain the number of MFDs indicated on the X axis. The
company has a total of 1655 MFDs, and over 65% of the floors in its buildings have five or more
such MFDs.
As a second example, MFDs from a business division of a large company from the computer
industry (managed by Xerox Corporation) are depicted in Figure 8.2. This organization consists of
14 floors split across several buildings.1 The organization has a total of 80 MFDs, 54% of the floors
have five or more MFDs, and 67% of the buildings have 13 or more MFDs.
1A typical building’s floor is 40,000 square feet.
76
0 2 4 6 8 10 12 14 16 180
1
2
3
4
5
6
7
# of devices per floor
freq
uenc
y
Figure 8.2: MFD count per floor at a division of a large computer company.
8.2 Unconventional Computing Environment Challenges
Unfortunately, simply recognizing a new hardware platform and plugging in existing cycle harvesting
solutions will not work. Unconventional computing environments may differ significantly from the
environments that are normally used to support Grid jobs. These differences require new approaches
and solutions. In particular, the differences include
• characteristics of the local jobs that the environments support,
• expectations that local users have for their availability and performance,
• the effect on Grid jobs when the resources are reclaimed for local use, and
• the kinds of Grid jobs the devices can support well (when they are not processing local jobs).
These differences are addressed below.
Local jobs: Desktop and workstation availability can track human work patterns; stretches
of work keep the machines busy, with short periods corresponding to breaks. In contrast, shared
printing resources (workgroup printers and MFDs) and print servers can be available for a relatively
high percentage of time, but may be frequently interrupted to process short jobs. Moreover, these
interruptions tend to arrive in bursts, as shown later in the chapter. Figure 8.3 depicts the burstiness
of job arrivals of a sample MFD, by tracking the state of the device through one day. When the
resource is printing, it is shown in the figure as Busy, when it is processing email, faxing, or scanning
a document, it is shown as Partially Available, and when it is idle, it is shown as Available. Figure 8.3
shows that MFDs in large organizations can be available for Grid use 95% or more of the time. But
device usage patterns exhibit volatility; short high priority jobs arrive intermittently.
77
Available
Busy/Partially Available
Total Time
Busy
Partially Available
Non−available Time
10am 11am Noon 1pm 2pmAvail.
Busy
Part. Avail.MFD State vs. Time
Time of day
Figure 8.3: Trace data from four MFDs in a large corporation. The devices are Available over 98%of the time, and requests for local use are bursty and intermittent.
Local user expectations, and the effect on Grid jobs: Workstations and cluster nodes can
often suspend Grid or cloud jobs when a local job preempts them, keeping the Grid jobs in memory
in case the resource may soon become idle again. Further, in conventional environments, the Grid
job may be able to take an on-demand checkpoint, or even migrate to another server. Typically local
users demand fast turnaround time for their jobs, and local jobs typically require the full capabilities
of the MFD. Therefore, when a user requests service from an MFD, any Grid jobs must immediately
be halted, without the possibility of checkpointing or migrating.
Types of jobs: Partly because of these local user expectations, MFDs are better able to support
shorter Grid jobs (e.g. tens of minutes) that can be squeezed into relatively short periods of avail-
ability. These relatively short Grid jobs may be related to printing, raster (image) processing, or
other services such as optical character recognition (OCR). These jobs are processing intensive, but
they can be divided amongst the devices in a floor or building. The jobs naturally lend themselves
to being processed in a page parallel fashion (MFD jobs typically arrive as a number of separate
pages). For example, a Grid job that is squeezed into some period of availability might represent
several pages of a long OCR job from a nearby MFD. For the OCR example, a job that may have
taken five to ten minutes might then require under two minutes if cycles from five nearby MFDs
could be harvested.
78
A doctor’s office is another example of an environment with non-traditional HPC requirements.
Several federated MFDs could effectively support a processing intensive job such as OCR and archiv-
ing of patient medical records. During every visit, patients fill in and sign privacy notices, prior
medical history, and other identifying information. For these to be automatically correlated to elec-
tronic medical records, OCR is necessary. At the same time, doctors’ offices print and fax frequently,
making the job traffic to MFDs highly intermittent and varied. MFDs are better suited to scanning,
handling paper forms, image related workflows, and secure prescription printing at the doctor’s of-
fice. However, as outlined above, they benefit from acting in concert for processing intensive OCR
jobs whenever a subset of devices detect a period of availability. Traditional compute clusters are
not only overkill for this domain, but they also do not handle paper workflows such as form scanning
and prescription printing.
This chapter characterizes this domain and proposes a burstiness-aware scheduling technique
designed to schedule Grid or cloud jobs to make better use of cycles that are left free in this en-
vironment. The devices also occasionally need bursts of computational power, perhaps from one
another, to perform image processing, including raster processing, image corrections and enhance-
ments, and other matrix operations that could be parallelized using known techniques [23] [10].
Sharing the burden of demanding jobs would allow fewer devices to meet the peak needs of bursty
job arrivals. Using known programming models [10] also enables leveraging the devices for general
purpose distributed computations.
This chapter investigates the viability of using MFDs as a Grid for running computationally
intensive tasks under the challenging constraints described above. Section 8.3 first defines an MFD
multi-state availability model similar to the multi-state availability model presented in Section 3.1.1.
The key difference is that this new model is geared towards the availability states of unconventional
computational resources such as MFDs whereas the previous model represents the states of tradi-
tional general purpose computational resources (e.g. workstations). Next, Section 8.4 gathers an
extensive MFD availability trace from a representative environment and analyzes it to determine
how MFDs in the real world behave in terms of the states states in the proposed model. Section 8.5
proposes an online and computationally efficient test for MFD “burstiness” using the trace. This
proposed burstiness metric is used to capture the dynamic and preemptive nature of jobs encoun-
tered on the devices. Burstiness is a tool to show how MFDs can be used as computational resources
and the scheduling improvements that can be achieved by considering resource burstiness in making
Grid job placements. Section 8.6.2 computes statistical confidence intervals on the lengths of Grid
jobs that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for
79
Scan/Fax/Email
Printing
Available toGrid
Unavailable
Figure 8.4: MFD Availability Model: MFDs may be completely Available for Grid jobs (when theyare idle), Partially Available for Grid jobs (when they are emailing, faxing, or scanning), Busy orunavailable (when they are printing), or Down.
the bursty environment. On average across all loads depicted, the proposed scheduling technique
comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average
of 18.3% makespan improvement over the random scheduler.
8.3 Characterizing State Transitions
This section develops an availability model with which to categorize the types of availability and
unavailability that an unconventional computational resource such as an MFD experiences. In
particular, this section is interested in how the job patterns of a certain core set of native MFD jobs
(core or local jobs) affect the availability of CPU cycles to a second elastic set of Grid jobs, which are
assumed to have to run at a lower priority. Devices oscillate between being available and unavailable
for Grid jobs, depending on the core job load they experience from local users. MFDs can occupy
four availability states based on processing their core job load. As shown, these states can have a
significant impact on Grid jobs.
Figure 8.4 identifies the four availability states suggested by MFD activity. An Available machine
is currently running with network connectivity, and is not performing any other document related
task such as printing, scanning, faxing, emailing, character recognition or image enhancement. It
may or may not be running a Grid application in this case. A device may transition to a Partially
Available state if a local user begins to use it for scanning, faxing, or email. These tasks require
some but not all of the device’s resources. When an MFD is printing a document, it typically
80
requires the full capability of the machine’s memory and processing power, and so a separate Busy
state is designated for MFDs that are currently processing print jobs. Finally, if an MFD fails or
becomes unreachable, it is in the Down state. These states differentiate the types of availability and
unavailability a resource will experience.
When scheduling jobs onto a Grid of MFDs, it is best to avoid devices that are likely to become
Busy before the application has finished executing. If the Busy state is entered, the Grid job must
be preempted and that it loses all progress. This assumption reflects the expectations of typical
MFD users which are highly concerned with local user job turn around time. Entering the Partially
Available (e.g. scan/fax/email) state also affects Grid jobs, which will be left with fewer cycles.
However, because functionalities such as scanning, faxing and emailing require significantly fewer
computational resources compared to printing, Grid jobs need not be immediately halted; they may
continue to execute with a significantly smaller percentage of the total CPU capacity while the user’s
scan, fax, or email job completes.
Given the different types of unavailability a device may experience, applications that attempt to
utilize them as computational resources should attempt to avoid those devices most likely to enter
the Busy state because of the large amount of lost cycles which occur if an application must restart
on a different resource. Furthermore, those applications should also attempt to avoid a MFD likely
to enter the Partially Available state due to the slow down although the consequences of entering
this state compared with the Busy state are much less severe.
A scheduler that is aware of each MFD’s behavior and likelihood of entering these states can
make better Grid job placement decisions by attempting to avoid the Busy state altogether and
by limiting the number of times applications are slowed down by MFDs that enter the Partially
Available state. Section 8.5 introduces a metric called burstiness, which allows schedulers to exploit
the fact that transitions into Busy and Partially Available states are typically clustered together in
time.
8.4 Trace Analysis
A 153 day MFD job trace was gathered and analyzed from four MFDs located at a large enterprise
company between mid-2008 and mid-2009. The trace consists of a series of time-stamped MFD jobs.
Each job includes the job type (print, scan, fax or email), job name, number of pages, submit time,
completion time and the user submitting the job. This data directly implies the Busy and Partially
Available states. The trace only identifies when jobs are submitted and completed; it therefore
81
precludes determining exactly when the Down state was entered. As results show, MFDs are rarely
Down, and these particular MFDs were more than 99% available, by service agreement; MFDs were
Available if they were not otherwise being used. This section analyzes this trace data to characterize
MFDs according to the states of the proposed model with the goal of identifying trends that could
enhance the scheduling of Grid jobs.
8.4.1 Overall State Behavior
To determine how much of an MFD’s time is available to Grid jobs, Figure 8.3 analyzes the overall
amount of time spent in each state (this is the time that an MFD is available and not printing).
Figure 8.3 depicts the overall percentage of time spent by the four MFDs in each availability state
and an example of that availability. Figure 8.3a shows that availability is the dominant state with
98.2% of the MFD’s overall time being spent in the Available state; just 1.8% of the time is spent in
the Busy and Partially Available states. In regards to how much time is spent in the non-available
states shown in the right sub-graph, 85.7% of non-availability is Busy, leaving 14.3% in Partially
Available. Figure 8.3 shows an example of the typical behavior of an MFD during mid-day on a
weekday. It is mostly available but has sporadic bursts of unavailability. Thus, MFDs are highly
available resources with a large potential for use by Grid jobs.
Although there are more than 1,200 instances of availability of more than one hour, two-thirds
of the availability periods during daytime hours are 20 minutes or less. For daytime Grid jobs, it
can therefore be difficult to find an uninterrupted availability period. Even though there are many
available time periods during the day, scheduling greedily to fill most of the availability period might
lead to frequent application failures and restarts. As a result, the failure rate gets compounded—it
includes the failure rate for restarted jobs as well. Therefore, Section 8.6.2 conducts an experiment
to determine the size of Grid jobs that the MFD pool can handle well—that is, for which the pool
can keep Grid job failure rates between 5% and 10%.
8.4.2 Transitional Behavior
This section analyzes the behavior of the MFDs in terms of when and what types of transitions they
make. State transitions can affect the execution of an application because they can trigger reduced
application performance (the Partially Available state) or even require application restart (the Busy
state). Figure 8.5 shows the number of transitions to each state by the day of the week. Transitions
to Available, Busy and Partially Available states are highest on Monday, taper off slightly by the
82
M T W T F S S0
500
1000
1500
Day of WeekN
umbe
r of
Tra
nsiti
ons
to A
vaila
ble
Sta
te
M T W T F S S0
500
1000
1500
Day of Week
Num
ber
of T
rans
ition
sto
Bus
y S
tate
M T W T F S S0
50
100
150
200
Day of Week
Num
ber
of T
rans
ition
s to
Par
tially
Ava
ilabl
e S
tate
Figure 8.5: MFD state transitions by day of week
0 10 20
200
400
600
Hour of Day
Num
ber
of T
rans
ition
sto
Ava
ilabl
e S
tate
0 10 200
200
400
600
Hour of Day
Num
ber
of T
rans
ition
sto
Bus
y S
tate
0 10 200
20
40
60
80
100
Hour of Day
Num
ber
of T
rans
ition
s to
Par
tially
Ava
ilabl
el S
tate
Figure 8.6: MFD state transitions by hour of day
end of the week, and are significantly reduced on weekends.
Figure 8.6 shows the number of transitions to each state by hour of day. The numbers of
transitions to each of the three states show clear diurnal patterns, and are similar to one another.
Transitions peak at around 10am and 2pm with a trough between the two peaks corresponding to the
lunch break. Transitions are much lower overnight. Thus, MFD utilization follows strong weekly
and diurnal patterns. These patterns lend themselves to increased predictability and scheduling
potential.
8.4.3 Durational Behavior
This section analyzes the behavior of the MFDs in terms of how long they remain in each state
relative to the day of the week and the hour of the day. Longer periods of availability would allow
M T W T F S S0
5
10
15
20
25
Day of Week
Avg
. Ava
ilabl
e D
urat
ion
(Hou
rs)
M T W T F S S0.025
0.03
0.035
0.04
0.045
Day of Week
Avg
. Bus
y D
urat
ion
(Hou
rs)
M T W T F S S0.01
0.02
0.03
0.04
0.05
Day of Week
Avg
. Par
tially
Ava
ilabl
eD
urat
ion
(Hou
rs)
Figure 8.7: MFD state duration by day of week
83
0 10 20
6
8
10
12
14
16
Hour of Day
Avg
. Ava
ilabl
e D
urat
ion
(Hou
rs)
0 10 200
0.01
0.02
0.03
0.04
0.05
Hour of Day
Avg
. Bus
y D
urat
ion
(Hou
rs)
0 10 200
0.02
0.04
0.06
Hour of Day
Avg
. Par
tially
Ava
ilabl
eD
urat
ion
(Hou
rs)
Figure 8.8: MFD state duration by hour of day
20 40 600
100
200
300
400
Available State Duration (Minutes)
Num
ber
of O
ccur
ance
s
20 40 600
1000
2000
3000
4000
Busy State Duration (Minutes)
Num
ber
of O
ccur
ance
s
20 40 600
200
400
600
Partially Available State Duration (Minutes)
Num
ber
of O
ccur
ance
s
Figure 8.9: Number of occurrences of each state duration
longer Grid jobs to complete without interruption.
Figure 8.7 depicts the average duration of each state according to the day of the week. The
average duration of an Available state is quite low during the weekdays at approximately 1.4 hours
but becomes much larger during the weekend reaching a peak of 20 hours on Saturday. Figure 8.8
investigates the average duration of each state according to the hour of the day. As expected, the
average duration of the Available state is at its lowest in the middle of the day, since this is when
users are most likely to utilize the MFDs. The duration of the Busy state shows no such trends.
These results indicate that longer jobs are best scheduled later in the day when availability periods
are typically much longer.
Figure 8.9 shows the distribution of durations spent in each state. It plots the number of
occurrences of each duration versus the length of that duration. The most significant result shown
in Figure 8.9 is the duration of the Available state. Clearly, the vast majority of availability periods
are extremely short, lasting 20 minutes or less with an extremely long tail, indicating decreasing
occurrences of longer available durations.
84
8.5 A Metric for Burstiness
This section proposes a burstiness metric and an algorithm with which to determine the burstiness
of a given resource. Burstiness captures temporary surges in activity that occur sporadically in
the MFD domain. Activities tend to occur in “bursts” in the sense that MFD use is clustered in
time and lasts for short durations. This makes it difficult to detect and counter effectively. For
example, the frequent occurrence of burstiness certainly impedes the completion of any long Grid
job and makes scheduling short jobs difficult. This is because shorter jobs are purged or restarted
frequently, thereby causing resource thrashing and wasted network and/or processor cycles.
Therefore, to perform high performance distributed Grid computing on bursty devices (due to
preemption by higher priority jobs), burstiness can be used as a form of availability prediction. The
proposed burstiness metric enables the system to schedule jobs when the burstiness of a resource is
low therefore greatly lowering the likelihood of job eviction.
8.5.1 Burstiness Test
The following notations are used:
At any given time tnow, let T refer to a recently elapsed time window where burstiness is to
be estimated (T issetto2hours). Let µ and σ refer to the mean and standard deviation of the time
between jobs (that is, the interarrival time) respectively, in the time-period T .
To quantify burstiness at time tnow,, the algorithm looks back at a period T and estimates
whether the most recent block of time within T (chosen as the average arrival time µ between jobs
in T ) (a) has experienced more transitions to the busy state of the device compared to other blocks
of time in T ; and (b) if the spread in the most recent block is significantly different from others in T .
These comparisons detect whether the most recent block has had an abnormally high job arrival rate
with spreads just enough to frequently disrupt Grid tasks. More formally, the steps are as follows:
1. Starting from tnow, go back and divide the period T into contiguous blocks of time each of
length µ, the average interarrival time. Call each time block bp where p is the index of the
block counting backwards from tnow.
2. Let the number of job arrivals in a block bp be np.
3. Compute the mean and standard deviation of arrival times across all jobs in all blocks, and
refer to them as µB and σB .
85
4. For a given block p, the spread in jobs is given by the standard deviation σp of the time
between jobs.
5. Comparison 1: Determine whether np is within one σB of µB (meaning there are relatively
more transitions into the unavailable state in this time block).
6. Comparison 2: Determine whether σp < σ (meaning the jobs are more clustered and less
spread out than normally observed within period T ).
8.5.2 Degree of Burstiness
If the answers to both Comparison 1 and 2 are Yes, then the period p is considered bursty.2 Other-
wise, the block bp does not exhibit burstiness. Two dimensionless quantities characterize burstiness
quantitatively. The strength S of the burst is defined as
Sp =np
µB
(8.1)
where a quantity greater than one denotes that a time block p is experiencing a burst. Spread ratio
(D) is defined as
Dp =σ
σp
(8.2)
where a quantity greater than one denotes that the time block p is experiencing a tighter cluster
than the whole time history window.
The burstiness of a window p is defined as
Bp = Sp + Dp (8.3)
Other forms of this metric could weigh the two quantities in Equation 3 differently from one another.
One drawback of this metric is that bursty events may spill over into adjacent time blocks {bp, bp−1}
and/or {bp, bp+1}. This is addressed in future work by establishing time block combination strate-
gies.
8.5.3 Burstiness Characterization
To quantify the burstiness of devices, Figure 8.10 computes burstiness values at 107 sample points
(that sampling is slightly less than once every 2 seconds for the trace). Figure 8.10 plots the calcu-
2The degree of burstiness is defined in Equation 4.
86
0 5 10 15 20
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Hour of Day
Bur
stin
ess
M T W T F S S
0.5
1
1.5
2
Day of Week
Bur
stin
ess
Figure 8.10: Burstiness versus Day of Week and Hour of Day
lated burstiness values versus both the time of the day and the day of the week for the four MFDs.
The left subgraph illustrates that the device burstiness peaks at about mid-day and dramatically falls
off early and late in the day. The right graph indicates that burstiness reaches a peak on Monday,
at which point it slowly tapers off until it falls dramatically on the weekends. These results confirm
the intuition that MFDs experience maximum burstiness on weekdays around mid-day. As shown in
the next section, the burstiness of a resource closely follows it’s transitional behavior. Hence it is a
good indicator of volatility and can be extremely useful in scheduling to avoid preemptions brought
on by MFD transitions to unavailable states (e.g. printing).
8.6 Burstiness-Aware Scheduling Strategies
This section discusses the viability of scheduling jobs on MFDs given the volatility they exhibit.
This section proposes a new scheduling technique that leverages the burstiness metric proposed in
Section 8.5 to make better job placement decisions. These placement decisions reduce the number of
evictions/preemptions experienced by jobs and thereby reduce the average job makespan, increasing
the overall efficiency of the system. When a preemption is encountered, the executing job will halt
and restart execution elsewhere (thereby increasing makespan).
8.6.1 Experiment Setup
To investigate the scheduling techniques proposed in this section, late sections simulate jobs exe-
cuting on the MFDs; each uses its own recorded availability in accordance with the gathered (core
job) workload trace. The simulations in this chapter insert Grid jobs (the aforementioned elastic
class) at random times in the trace such that for the overall simulation, job injection is uniformly
87
distributed for the considered period of 153 days. However, jobs are only injected between 9am
and 5pm during weekdays to reflect the types of jobs that are expected to be executed on such a
system as discussed earlier. The simulation assigns each job a duration (i.e. a number of operations
needed for completion) where the duration is an estimate based on the execution of the job on an
idle device with average CPU speed (device attributes are not unlike server attributes and are based
on published design/model specifications).
Similarly to the simulation setup in Section 5.1, the CPU speed, the load on each MFD, and
the (un)availability states (determined by the trace) influence the running of the jobs. MFDs are
considered available if they are not executing a print job (are not in the Busy state). An MFD
may only be assigned one task for execution at a time. During each simulator tick, the simulator
calculates the number of operations completed by each working MFD, and updates the records for
each executing job. If a device executing a job enters the Busy state due to the arrival of a core job,
the Grid job is preempted and put back in the queue for rescheduling.
MFDs are assumed to have no load on their CPU unless the Partially Available state is entered.
While in this state, an executing job will not be preempted, but the job imposes a load of 0.5 on
the processor. Note, however, that this state can model any portion of availability. In this case,
jobs executing on an MFD in the Partially Available state will effectively take twice as long as they
would if they were run on an MFD in the Available state.
Section 8.6.2 tests the following scheduling strategies:
• Random: Randomly selects an MFD for execution
• Upper Bound3: Selects the available MFD that will execute the job in the smallest execution
time, without preemption, based on global knowledge of MFD availability. When all MFDs
would become unavailable before completing the job, the scheduler chooses the fastest available
MFD.
• Burst: For each job, each MFD’s burstiness is calculated at the time the job is being scheduled.
The MFD with the smallest burstiness value is greedily selected for job execution.
• Burst+Speed: For each job, each MFD’s burstiness is calculated (Bi) at the time the job is
being scheduled. The algorithm greedily choose the MFD with the highest score RSi where
the score of MFD i is calculated according to the following expression:
3Used for comparison purposes only; global knowledge will not usually be available.
88
0 1 2 30
10
20
30
40
50
60
Average Job Length (Hours)P
erce
ntag
e M
akes
pan
Impr
ovem
ent
0 1 2 30
500
1000
1500
2000
2500
Average Job Length (Hours)
Pre
empt
ions
Random
Upper Bound
Burst
Burst+Speed
Figure 8.11: Comparison of MFD schedulers as Grid job length increases
RSi =100 − Bi
100·CPUmax
CPUi
(8.4)
where Bi is MFD i’s burstiness, CPUi is it’s CPU speed and CPUmax is the highest CPU
speed of all resources (for normalization).
Random is a scheduler with no information and represents the least sophisticated scheduling ap-
proach (a lower bound). Upper Bound is a scheduler with global knowledge and represents an upper
bound on possible scheduling results (maximum makespan improvement with the fewest number of
preemptions). Both Burst and Burst+Speed are greedy schedulers - Burst only takes MFD volatility
into account whereas Burst+Speed considers both MFD volatility and speed.
8.6.2 Empirical Evaluation
This section investigates the performance of the proposed scheduling strategies with two types of
experiments.
Impact of (Grid) Job Length
Figure 8.11 illustrates the performance of the scheduling strategies as average job length varies. Job
length is the time required to complete the job on a machine with an average dedicated CPU. 1,000
jobs are injected randomly and uniformly over the trace while varying the average length of the jobs.
For each simulation, the four schedulers are given an identical set of jobs to schedule. In trace with
the the shortest jobs, job lengths vary from five to 30 minutes (average length of 17.5 minutes). In
the trace with the longest jobs, job lengths vary from five minutes to six hours (average length of
3.04 hours). Intermediate average lengths varied jobs between five minutes and 76, 147, 218, and
289 minutes, respectively.
89
Figure 8.11’s left subgraph plots the percentage makespan improvement over the random sched-
uler versus the average job length. Figure 8.11 demonstrates that simply utilizing the burstiness
metric (Burst) to consider the volatility of resources and avoid preemptions greatly reduces the av-
erage job makespan by 24% for the shorter average job length case and 3.6% as average job length
increases to 3 hours. Considering volatility and resource speed when scheduling with Burst+Speed
further reduces the average job makespan by 36.6% for the shorter average job length case and 7.5%
as average job length increases to three hours. Burst+Speed comes within an average of 7.6% of
Upper Bound’s makespan improvement across all job lengths; it comes within 13.7% for shorter jobs
gradually improving to within 4.3% for longer jobs. Burst+Speed produces an average makespan
improvement of 18.3% over the random scheduler. This indicates that the burstiness metric closely
captures the true volatility of the devices and allows schedulers to choose MFDs that are far less
likely to experience print preemptions, greatly increasing the efficiency of the system by lowering
average job makespan. The overall trend for all schedulers is that as the average job length increases,
the possible makespan improvement over random scheduling lessens. This is because longer jobs are
simply more likely to fail and given the volatility of the devices in general, even an ideal scheduler
such as Upper Bound cannot avoid preemptions for these longer jobs.
Figure 8.11’s right subgraph plots the number of preemptions (jobs interrupted by the MFD
encountering a print job and being forced off the MFD) versus the average job length. As average
job length increases, all scheduling strategies naturally encounter more job preemptions. Note the
direct correlation between preemptions encountered and average job makespan—fewer preemptions
reduce the average job makespan. Interestingly, Burst and Burst+Speed encounter a similar number
of preemptions through the range of average job lengths with Burst+Speed encountering slightly
more. Incorporating CPU speed into the MFD selection criteria (in Burst+Speed) greatly increases
job makespan improvement (from a 3.8% improvement for longer jobs to a 12% improvement for
shorter jobs) over only considering volatility with Burst.
Figure 8.12’s left subgraph shows the maximum job length such that it will only be preempted
k% of the time. 10 replications of the aforementioned experiments were executed and the Box-
Whisker diagram plots in Figure 8.12a provide a 95% confidence interval. For example, using the
Burst+Speed scheduler the diagram indicates with a95% confidence that a job of 7.5 minutes will
only fail around k = 10% of the time. Users could read the figures directly off the graph to choose
Grid job lengths that could be run on the system, given some tolerance for application failure and a
particular statistical confidence interval. Figure 8.12’s right subgraph shows how the average failure
rate varies with job length as computed from the ten experiment replications mentioned above.
90
1 7.5 14 20.5 27 33.5 40 46.5 53 59.50
10
20
30
40
50
Job
Fai
lure
Rat
e (%
)
Job Length (Minutes)0 20 40 60
0
10
20
30
40
50
Job Length (Minutes)
Job
Fai
lure
Rat
e (%
)
RandomUpper BoundBurstBurst+Speed
Figure 8.12: Determination of Grid job lengths with 95% confidence intervals for given failure rates
0 0.5 1 1.5 2
x 104
−10
0
10
20
30
40
50
Number of Jobs
Per
cent
age
Mak
espa
n Im
prov
emen
t
0 0.5 1 1.5 2
x 104
0
1000
2000
3000
4000
5000
6000
Number of Jobs
Pre
empt
ions
Random
Upper Bound
Burst
Burst+Speed
Figure 8.13: Comparison of MFD schedulers as Grid job load increases
Impact of (Grid) Job Load
Figure 8.13 investigates the effect of varying the load. This test fixes the job length between five
minutes and one hour (average Grid job length of 32.5 minutes) and varies the number of jobs in-
jected over the course of the trace from 1,000 to 20,000 jobs. Again, Figure 8.13’s left subgraph plots
average job makespan improvement as the load increases using the random scheduler as a baseline.
Overall, the makespan improvement decreases as job load increases because as MFD contention in-
creases, eventually all MFDs become utilized by all the schedulers, disallowing a scheduler to choose
an MFD which will not experience preemption. Therefore, the schedulers converge on the same aver-
age job makespan. In the low load situations, Burst produces a 19.2% makespan improvement while
Burst+Speed produces a 31.4% makespan improvement compared to the Random scheduler, coming
within 13.3% of Upper Bound’s 44.7% makespan improvement. As load increases, the makespan
improvement gap between these techniques lessens, but Burst consistently produces a makespan
improvement over Random and in turn Burst+Speed consistently produces a makepan improvement
over Burst. On average across all loads depicted, Burst+Speed comes within 4.3% makespan im-
91
provement of the Upper Bound and produces an average of 7.11% makespan improvement over the
random scheduler, with the gap lessening as load increases.
Figure 8.13’s right subgraph plots the number of preemptions produced by each scheduling strat-
egy versus the job load on the system. As job load increases, all scheduling strategies produce more
preemptions eventually converging on roughly the same number of preemptions in the extremely
high load case. The largest gaps come at lower loads with the gap narrowing as load increases.
Again, the figure demonstrates that the number of preemptions produced by a scheduling strategy
directly influences the average job makespan of the jobs. In all job load cases, Random produces the
largest number of preemptions followed by Burst+Speed, then Burst and finally Upper Bound.
These scheduling results demonstrate that the Burst+Speed scheduling mechanism greatly re-
duces the number of preemptions experienced by jobs compared with the random scheduler and
thereby decreases the average job makespan for a variety of job lengths and job load situations.
92
Chapter 9
Simulator Tool
This chapter discusses the motivation and implementation of the PredSim simulation framework.
PredSim was used to obtain all the results presented throughout this work. It is a discrete-event,
multi-purpose simulator with capabilities ranging from analyzing resource availability traces to test-
ing job scheduling and availability prediction algorithms. PredSim’s primary purpose is to allow
users to easily add or remove components such as predictors and schedulers for easy comparison.
PredSim also enables researchers to use real-world resource availability and workload traces to drive
their simulations, thereby better approximating the actual behavior of a real system. In this way,
PredSim allows for fast prototyping and development of reliability aware and fault tolerant predic-
tion, and scheduling techniques driven by real world trace scenarios.
9.1 Simulation Background
This section presents background information on simulation based research. Section 9.1.1 discusses
the state of the art and fundamental limitations and Section 9.1.2 discusses related distributed
system simulators.
9.1.1 Simulation Based Research
Simulation based results are used in a variety of fields such as construction, manufacturing, industrial
processes, environmental sciences and computer science to drive effective resource management,
production and research. Simulators are particularly useful to researchers for the following reasons.
• Scalability: Simulators, when implemented well, can scale to incorporate many more resources
and jobs than may be feasible or practical to achieve in the real world. The number of jobs
injected into the system can exceed observed system load, to test the system under more
stressful, rarely seen circumstances, which may become more common in the future.
93
• Ease of Deployment: The overhead of creating, distributing and setting up a real world
system can be prohibitively costly and time consuming. Simulations have the advantage of an
extremely low cost of deployment. Furthermore, testing unproven ideas in a real world system
can cause an unnecessary slowdown or loss of service to the users of the system.
• Repeatability and Consistency: As opposed to real world situations, in simulation-based
testing, a given scenario can be run repeatedly till an optimal or close to optimal solution to a
given scenario can be found. This testing and retesting can allow techniques to be fine-tuned
and furthermore, allows for the demonstrable repeatability essential to the scientific process.
This allows researchers to test competing ideas in a consistent environment with common
scenarios to determine the most effective solution(s).
• Speed of Results: Many testing scenarios can take months or years to run to completion.
Testing these scenarios in the real world means that results come extremely slowly. Further-
more, it means that the turnaround time for testing, modifying the technique and retesting
becomes extremely tedious. Simulation can allow these long scenarios to be tested in minutes
or hours, giving the researcher extremely timely feedback with which to further their work.
This allows techniques to be developed several orders of magnitude faster.
• Accessibility: Simulators can easily be made available to the research community and dis-
tributed to interested parties. This is especially important given that it is difficult to have
ready access to an actual Grid system. This easy of access and distribution can allow more
collaboration between researchers on a problem.
Although simulators have many benefits, they still have limitations and weaknesses compared
with other approaches such as real world testing.
• Compatibility: Simulations are often written for the express purposes devised by the coder
of that simulator. In particular, their exact design will often be unique and results from one
simulator may not be directly comparable with results from another simulator.
• Bugs: Human error in the coding can cause the simulator to produce inaccurate results.
• Artificiality: The scenarios examined in simulation can be far fetched and not representative
of any foreseeable scenario. This may lead to the results being misinterpreted.
• Applicability: Even the most thorough and exhaustive simulation cannot possibly represent
the complexities of the real world. By their very nature, simulations are simplifications of
94
real world situations which not only make many assumptions about how the virtual world will
operate but also make simplifying assumptions to facilitate their development.
In the end, nothing can completely take the place of real world testing but simulation-based
research has proved to be an invaluable tool.
9.1.2 Distributed System Simulators
Computer Science as a field has developed a whole range of simulation based tools for use by
researchers in all of it’s various subfields. For example, SimOS can simulate an entire computer
including operating system and application programs to study aspects of computer architecture
and operating systems [63]. TOSSIM is a simulation based framework in which mobile sensor
devices executing the TinyOS software system can be virtually deployed to investigate ad-hoc routing
protocols and environmental data gathering techniques [38]. The range and functionality of these
simulators in quite extensive so in this section focuses on related Grid-based simulation work.
Distributed and Grid systems have evolved to become powerful platforms for solving large scale
problems in science and other related fields. As these systems become larger and more complex, the
management of the resources they comprise becomes more important. Coupled with the difficulty
in obtaining access to a large scale Grid testbed, it becomes imperative to provide researchers with
tools for devising and analyzing algorithms before they are deployed in real world environments.
With this aim, many distributed and Grid system simulators have been developed [14] [12] [73] [71].
This section summarizes the most relevant related work.
GridSim, the most closely related simulator, is a java-based discrete-event Grid simulation toolkit
developed at the University of Melbourne to enable the study of distributed resource monitoring
and application execution [12]. GridSim was specifically designed to incorporate unavailability of
Grid resources during runtime, and provides the functionality to utilize workload traces taken from
real world environments to create a more realistic Grid simulation. GridSim also enables the user to
specify the Grid network topology complete with background traffic. GridSim is built on top of the
SimJava discrete event infrastructure and provides the modeling of resources and jobs executing on
the discrete events defined by the underlying SimJava layer. GridSim incorporates application mod-
eling and configuration, resource configuration, job management and resource brokers and schedulers
to allow for the system to be highly configurable for the user and to enable simulations which are
reflective of their real world counterparts.
SimGrid is another event-driven simulation toolkit designed to provide a testbed for the evalua-
95
tion of distributed system scheduling algorithms [14]. It enables the researcher to study application
behavior in a heterogeneous distributed environment with the key aim of rapid prototyping and evalu-
ation of different scheduling approaches. It incorporates the ability to model resource availability be-
havior based on synthetic resource and network behavior models fine-tuned with parameter-constants
or by inputting resource availability traces. Furthermore, SimGrid supports task dependencies in
the form of Directed Acyclic Graphs (DAG) jobs. DAGSim is build on top of this infrastructure
and is used to evaluate scheduling algorithms for DAG-structured applications [28].
Bricks is a java-based discrete-event simulation system geared towards providing components
for scheduling and monitoring applications through the use of resource and network monitoring
tools [73]. It consists of several components including a ServerMonitor, NetworkMonitor, Scheduling
Unit and Network Predictor. These components work in tandem to monitor resource and network
activity and then actively make application scheduling decisions.
The MicroGrid toolkit provides a platform on which to run Globus applications, allowing the
evaluation of middleware and applications in a Grid based environment [71]. MicroGrid creates a
virtual Grid infrastructure and supports scalable clustered resources with configurable Grid resource
performance attributes. The key functionality provided is to enable users to accurately simulate the
execution of Globus applications by presenting the applications with an identical Application Pro-
gram Interface (API) such that they will behave accurately in the simulation. In this way, MicroGrid
allows applications that have been designed to use the Globus Grid middleware infrastructure to be
tested before deployment.
The most fundamental difference between the previous approaches and PredSim is that PredSim
allows a multi-state resource availability model. As noted in Section 3.1, resources may become
unavailable in a multitude of different ways and because of the different effects that these unavail-
abilities can have on executing applications, it’s important that a simulator incorporate these types
of unavailability. Alongside this functionality, PredSim supports simulating checkpointable applica-
tions including periodic and on demand checkpoints along with checkpoint migration. Also, PredSim
is purpose-built around supporting availability prediction-based scheduling approaches and as such
has built in services for monitoring and cataloging resource history to enable prediction. Addition-
ally, PredSim has a built in mode for testing a predictor’s accuracy which can be used to isolate the
accuracy of that predictor under a given scenario. PredSim also has built in mechanisms for creating
and managing task replicas to test prediction and non-prediction based replication techniques.
96
9.2 PredSim Capabilities and Tools
This section summarizes the primary capabilities and features of PredSim and discusses its included
tools.
9.2.1 Capabilities
PredSim allows for the following core functionalities
• Multi-state availability model compatible
• Task scheduling and prediction accuracy modes
• User specified components such as schedulers and predictors can easily be added
• Compatible with FTA and Condor resource availability trace formats
• GWA workload format capable
• Fully configurable synthetic resource availability and task creation options
• Provides statistical and resource information services for analysis
• Automatic execution of ranges of scenarios (load, resource characteristics, etc)
In these ways, PredSim creates a functional testbed on which to compare prediction, prediction-
based scheduling and replication techniques.
9.2.2 Tools
In addition to the functionalities provided by the core PredSim modules, PredSim also includes
several tools for further analysis and ease of use.
Multi-exec Tool
Often it becomes important to test a given technique in multiple environments under a series of
conditions. The core modules of PredSim allow the simulator to automatically execute a range of
conditions for a series of schedulers/predictors such as varying load, checkpointability or replicata-
bility within a given trace. However, its also important to test techniques under different resource
availability sets and workloads. Multi-exec allows the user to set PredSim to test a given technique
or set of techniques utilizing different resource availability and workload traces and within each
97
trace, the conditions for that simulation can then be varied. Additionally, each run can be taken a
configurable number of times to, for example, develop confidence intervals for the results.
Combine Tool
Given a series of simulation runs, the Combine Tool allows the output files to be combined into a
single set of output files. This can be useful for aggregating results and understanding the overall
behavior of techniques in a variety of circumstances and environments. Also, sometimes results can
be inconsistent between runs. Executing many copies of a run and then combining the results can
be helpful in determining the underlying performance trends.
Availability Analysis Tool
Section 3.2 investigated the availability behavior of the Notre Dame Condor pool. This data was
created by the Availability Analysis Tool which inputs a resource availability trace and outputs a
whole range of availability analysis data. It incorporates the ability to analyze both single state and
multi-state availability traces. The analysis data it produces includes resource behavior versus day
of the week and time of day. It also analyzes aggregate resource behavior such as time spent and
transitions to each state over a trace. This tool can be useful for describing and characterizing the
multi-state availability behavior of a set of traced resources.
9.3 PredSim System Design
At it’s core, PredSim was designed to allow and account for the different types of availability and
unavailability that a resource may exhibit. Section 3.1 discussed these different states of availability
including the user becoming present on the machine, the local load going over a certain threshold
and the machine becoming unavailable unexpectedly (ungraceful failure). PredSim was designed
to incorporate these different types of unavailability while having the functionality to simulate a
variety of application types. PredSim supports both application checkpointing and migration in-
cluding periodic checkpointing and on-demand checkpointing. PredSim also has built in services
for supporting resource availability prediction and prediction-based scheduling and task replication
algorithms.
Figure 9.1 shows the architectural components of PredSim grouped into three categories: Input
Components, User Specified Components and PredSim Components. The following sections discuss
the design of these three component groups.
98
Figure 9.1: PredSim System Architecture
9.3.1 Input Components
In order to test prediction algorithms and scheduling techniques, both resource availability informa-
tion and task information must be available to the simulator. PredSim has multiple methods for
both accepting and/or generating this information.
Resource Input
As noted earlier, PredSim is designed from the ground up to support a multi-state resource avail-
ability model. As such, it’s primary resource input format is that of Condor formatted resource
availability traces. By default, the Condor distributed system logs resource availability information
including user presence information, local CPU load measurements and overall resource connectiv-
ity/availability. This information can be extracted from the system via the condor stats command.
At this point an analysis tool included with PredSim, then analyzes the condor formatted trace
information and converts it into a format readable by PredSim. The PredSim format consists of
newline separated unix epoch, availability state and CPU load items. The availability state is an
integer from 1 to 5 indicating the availability state according to the multi-state model. In this way,
the input format retains all the information about resource availability, user presence and CPU load.
99
PredSim also accepts resource availability information stored in the Failure Trace Archive (FTA)
format [35]. The FTA is an online repository of 26 resource availability traces all stored in a common
format. Each FTA formatted trace consists of a series of tables output to files contained within a
folder. These text files contain information on the nodes such as CPU speed, available memory as well
as availability information. The key input file is event trace.tab which contains newline delimited
event information corresponding to each resource’s availability via unix epoch time stamped events.
Unfortunately, the FTA format does not include the ability to differentiate the user present state
but PredSim adjusts by simply ignoring the user present state.
PredSim stores all input traces in available memory via a set of time stamped resource states.
Each resource’s availability history is represented as an array of events. Each event contains a unix
time stamp, an integer representing it’s availability state according to the presented multi-state
availability model and a CPU load float value. FTA traces are converted to this format by parsing
through the event trace.tab, and for each resource, recording periods of availability or unavailability
as dictated by the event information contained in the FTA trace.
Using real world resource availability traces is typically the preferred method for algorithm
testing, but often a researcher will want to design his or her own resource availability scenario
which is not present in any existing resource availability traces. For this reason, PredSim includes
a Synthetic Job Creation component. This component allows the user to create a set of resources
and is fully configurable by the user. It incorporates the following parameters.
• Quantity: Dictates the total number of resources
• Availability: The user can specify the exact availability characteristics of each machine
specifically or by setting the “class” of each resource. The “class” dictates the probability
distribution a resource will follow in regards to it’s availability behavior including:
– State Duration: Each state is given a parameter for a probability distribution. This
parameter dictates once a state is entered, how long the resource will stay in that state.
Distributions include Weibull and Uniform.
– State Frequency: Each state is also given a parameter for a probability distribution
indicating how likely that state is to be entered. This dictates how frequently each state
is entered.
• Performance: The speed and load on each resource is also configurable by the user.
100
– Resource speed: Speed, specified in terms of Millions of Operations Per Second (MIPS)
of each resource can be individually specified or the “class” of the resource can dictate
its speed.
– CPU Load: The local load on the processor is a configurable parameter. It can be set
to follow a distribution or always remain at zero.
Task Input
The other key input component is the tasks which the simulation system will execute. PredSim
includes two methods for this input component. The first method is that of accepting workload
traces in the Grid Workload Archive format (GWA) [27]. The GWA is an online repository of 11
workloads all stored a common format. Importantly, the GWA format indicates both when jobs
arrive and the length of each job. The length of the job is logged as the number of seconds it took to
execute. This approach has the drawback that the execution time of the job will be dependant on
the speed and local load of the machine on which it executed. For this reason, PredSim assumes that
all the machines executing jobs on the traced system are the same speed. Another key drawback
to the format is that it does not indicate if jobs are checkpointable or replicatable. To bypass this,
PredSim allows the user to input a traced workload and then apply checkpointable or replicatable
attributes to jobs as the user sees fit. For example, the user could specify that 25% of the jobs
input are checkpointable, which would cause PredSim to randomly assign 25% of the jobs with the
attribute “checkpointable”.
As with resource availability traces, sometimes a desired workload trace scenario may not exist.
In this case, PredSim allows a synthetic set of jobs to be created and executed on the system.
PredSim allows the user to adjust the following job attributes:
• Quantity: The total number of jobs injected into the system over the course of the simulation.
• Job Length: The total number of operations required to complete the job can be given as a
range of lengths. The length of each job can be chosen via a probability distribution in that
range. Job length distributions include uniform and Pareto.
• Checkpointability: Determines what percentage of jobs are checkpointable.
• Replicatability: Determines what percentage of jobs are replicatable.
• Day or Night: If all the jobs are injected only during the day or may be injected at any time.
101
Combining Resource and Task Inputs
Given a resource availability trace and a workload trace, PredSim must then match up these two
input components which may lead to certain difficulties. For example, the traces may be taken at
different times. To adjust for these differences, PredSim will match up the beginning of each trace
in time. Additionally, traces may have different lengths. In these cases, PredSim will shorten the
longer of the two traces and throw out the excess. The last issue is slightly more complex. When a
workload is gathered from a system, then applied to another traced system, the two systems may be
different sizes. For this reason, PredSim can dynamically scale the workload to match the number
of resources in the availability trace. It does so by calculating the number of jobs per resource for
the workload, then given the number of resources in the availability trace, it will scale the number
of jobs per resource from the workload by randomly throwing out excess jobs. The last concern is
that of varying load level. Often, a researcher will want to test an approach for a variety of load
levels but a given workload trace will only specify a narrow range of loads. PredSim can dynamically
adjust the load level of a GWA input trace by only keeping a certain percentage of jobs from it. This
enables testing with a given workload type as specified by a workload trace while still facilitating
tests for a large variety of loads.
9.3.2 User Specified Components
PredSim strives to be versatile through giving the user the freedom to swap in and out system and
algorithmic components. PredSim provides the user with a library of built in techniques for a given
component but also allows for new components to be added.
Availability Predictors
PredSim defines an interface for prediction techniques in Table 9.1. The user can define a new
prediction module using this interface or use/modify existing prediction modules. The input interface
specifies the resource to predict, a future interval of time for which the prediction is to be made
and the availability history of the resource. The predictor then outputs four probabilities which
sum to 100%. These probabilities represent the likelihood of the resource entering each respective
unavailability state, or completing the interval in the available state. Single-state predictors need not
return four probabilities but can just specify the Completion and Ungraceful Eviction probabilities.
The simulator comes built in with a whole range of multi-state and single state resource avail-
ability predictors and prediction tools. These built in prediction modules can be used to develop
102
Input Interface
Resource Index of resource for which the prediction is to be madeCurrent Time The current simulation time in unix epoch format
Prediction Length The length of the interval to make the prediction for in numberof minutes
Resource History An array of event structs containing unix epoch stamped avail-ability state history information for the resource
Output Interface
Resource Index of resource for which the prediction was made
Probability Outputs
Completion- probability that the machine will remain availablefor the duration of the intervalUngraceful Eviction- probability that the machine will fail un-gracefullyUser Eviction- (Optional) probability that the user will return tothe machineCPU Eviction- (Optional) probability that the local CPU load onthe machine will exceed the threshold
Likely Availability Duration (Optional) The predicted duration that the resource will remainavailable
Predicted CPU Load (Optional) The predicted average CPU load during the period ofavailability
Table 9.1: Prediction Interface
new prediction techniques and include:
• Prediction Time Frame: This tool dictates at what points in a resource’s history to exam
such as examining the previous X hours or examining the prediction period during the last N
days.
• Availability Analysis Tools: Once the period of resource history to analyze is chosen, the
user can then can specify how those periods of history are analyzed. Analysis techniques
include durational examination and transition counting schemes.
• Transition Weighting Schemes: If transition weighting is the chosen form of historical
analysis, the transitions can be weighted according to a variety of factors such as time of day,
day of week and freshness.
The included prediction techniques can be parameterized and combined in various configurations
to provide the user with a wide range of availability predictors (such as those proposed and investi-
gated in Chapter 4 like TRF and TDE). These predictors can also be compared with built-in related
work prediction modules such as Ren, Saturating Counter, History Counter, Sliding Window Single
and Sliding Window Multi (discussed in detail in Chapter 4).
103
Input Interface
Job
Job number- Unique job identifierInjection time- Unix time stamp of the time the job was injectedinto the systemJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable
Resource Information ArrayAvailability- The current availability state of the resourceMIPS- The speed of the resource in Millions of Instructions PerSecondCPU Load- The current local load on the resource’s processor
Output Interface
Resource Index of resource chosen to execute the job. -1 indicates the jobcould not be scheduled.
Table 9.2: Scheduling Interface
Scheduling Algorithms
The user must address how resources should be chosen for the tasks injected into the system.
PredSim’s scheduling interface is defined in Table 9.2. The input to a scheduler is a job struct
containing information about the job including it’s unique identifier, injection time, the number of
operations to complete the job and it’s ability to be checkpointed or replicated. A scheduler can
then select from the resource array list provided based on whatever criteria chosen by the scheduler
and return the selected resource via an index value.
PredSim accommodates new scheduling modules but also has a set of built in modules including:
• Random: Randomly choose an available resource.
• CPU-Speed: Choose the resource with the fastest CPU and lowest local CPU load.
• Pseudo-optimal: With omnipotent future knowledge, choose the resource which will com-
plete the task the earliest without resource unavailability.
• Resource Score Based Schedulers: Score all resources according to some expression, then
choose the resource with the highest score. An existing scoring method can be chosen or a
new one can be implemented.
• Prediction-Based Schedulers: Any prediction module implemented in the system can be
used by the scheduler for choosing which resource to allocate the task.
– Performance/Reliability Tradeoff Schedulers: These prediction-based schedulers
can be set on a sliding scale to determine what emphasis it places on performance, relia-
104
Input Interface
JobJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable
Completion Probability Predicted probability that the machine executing the job will re-main available for the duration of the job
Scheduled Resource InformationMIPS- The speed of the resource in Millions of Instructions PerSecondCPU Load- The current local load on the resource’s processor
System Load Current system load measured as the number of jobs per day peravailable resource
Table 9.3: Replication Interface
bility or both. There are also included pre-defined schedulers set to known tradeoff values
(e.g. Comp, Perf, etc).
– Prediction Length Algorithms: The length of the prediction to be made for a job of
length X can be modified by any number of included expressions.
• Ren MTTF scheduler: Ren et al. Mean Time Till Failure scheduler [55].
PredSim’s Scheduling components can be chosen by the user from the included library of schedul-
ing algorithms, created from included scheduling algorithm tools or added by the user.
Task Replication Algorithms
PredSim also incorporates the functionality to support task replication. The replication algorithm
will not only choose which jobs are to be replicated, but also, how many replicas of each job are to be
created. Currently, PredSim’s replication functionality is built into the job scheduling infrastructure.
When a job is scheduled, the replication component can then choose to create any number of replicas
of that job or none at all. Table 9.3 lists the input interface to the replication component. The choice
to replicate or not can be based on the job’s characteristics, the predicted probability of the job
completing without failure, the performance characteristics of the resource chosen to execute the
job or based on the current system load. The replication policy has no output interface because it
directly queues any replicas it wishes to create and has no return values.
These factors can be combined in numerous ways by the user or pre-built replication policies can
be used. The replication algorithm components include:
• Static Techniques: Replication policy is determined by some static characteristic:
– Fixed Percentage: Replicate a fixed percentage of jobs.
105
– Ckpt/Non: Replicate jobs based on their checkpointability (e.g. replicate all non-
checkpointable jobs).
– Static Multiple: Create a pre-defined number of replicas for each job (e.g. 3 replicas
per job).
• Prediction-based Techniques: Replicate jobs based on their their host resource’s reliability:
– Replication Score Threshold (RST): Replicate jobs based on their predicted chance
of completion.
– RST and Job Characteristics: Replicate jobs based on their prediction chance of
completion and their characteristics such as checkpointability and length.
• WQR and WQR-FT: WorkQueue with Replication and WorkQueue with Replication-Fault
Tolerant [70].
• Static Load Adaptive: Prediction-based Replication at set failure thresholds for each load
level.
• Sliding Replication: Prediction-based Load Adaptive Replication on a sliding scale (see
Chapter 7).
Tools and statistics are also provided to analyze the effectiveness and replication efficiency of each
replication policy. Additionally, load measurement services are provided for use with load adaptive
replication approaches.
Job Queuing Strategies
When a job is injected into the PredSim system, it’s placed in a queue. Jobs are then taken from
the head of the queue and scheduled onto available resources. The order in which the jobs are
queued and scheduled can have a significant effect on performance. The user can specify a queuing
component based on the interface shown in Table 9.4. The queuing strategy is provided with the
job to be queued which contains meta-information about the job for scheduling purposes. A pointer
to the head of the job queue is also provided. No output interface is specified as it is assumed that
the job will be queued for execution.
PredSim includes the following queuing strategy components
• Rear: New jobs always go to the rear of the queue.
106
Input Interface
Job
Job number- Unique job identifierInjection time- Unix time stamp of the time the job was injectedinto the systemJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable
Job Queue Pointer A pointer to the head of the job queue linked list
Table 9.4: Job Queuing Interface
• Front: Jobs are placed at the front of the queue.
• Min-Min: The queue is arranged from shortest to longest job.
• Max-Max: The queue is arranged from longest to shortest job.
• Min-Min Ckpt: Checkpointable jobs are scheduled before non-checkpointable jobs with
shorter jobs being scheduled before longer jobs in each sub-queue.
• Oldest: Oldest jobs (according to initial injection time) are placed at the head of the queue.
• Newest: Youngest jobs (according to initial injection time) are placed at the head of the
queue.
These queuing strategies may be used or new queuing components can be added by the user.
The system can also be outfitted to schedule in batch mode instead of on-demand queue mode. In
batch mode, jobs are scheduled at given intervals (e.g. every 15 minutes) instead of when they are
first injected.
Job Killing Strategies
Load on a system can vary over the course of time. When this occurs, decisions which were made
under previous conditions may not be relevant to current conditions. This is particularly true for
replication decisions. As discussed in Chapters 6 and 7, the load on the system greatly effects the
performance of the current replication strategy. Particularly when load increases, if many repli-
cas were created under the lower load situation, there may be an inadequate number of available
resources left for executing incoming tasks. In these situations, it becomes beneficial to “kill” previ-
ously created replicas to free up resources to execute new tasks. PredSim includes a replica killing
component to facilitate this adaptive approach.
107
9.3.3 PredSim Components
PredSim takes both the input components and the User Specified components and executes the sim-
ulation utilizing its core components. This section summarizes the functionality of each of PredSim’s
core components.
Resource Monitoring
The resource monitoring service keeps track of all the resources in the system, not only gathering
statistics about their uptime, availability and job execution performance, but also providing resource
information to other components. Resource monitoring entails keeping track of which resources are
available or unavailable and logging that information. When resources become unavailable, the job
management service must be notified so that the proper actions can be taken such as checkpointing,
migration or re-queuing. The service also provides information to predictors regarding resource
availability history for prediction analysis. Lastly, the service provides information to the scheduling
component notifying it which resources are available for job execution.
Job Management
The Job Management component handles jobs by first injecting them into the system according to
the workload trace component specified by the user. When a job is injected, it is first be placed in a
queue according to the current user supplied queuing component. Periodically these jobs are sched-
uled from the queue according to the user supplied scheduling component. The job management
component also monitors job progress by calculating how many operations have been completed
by each executing job per simulator tick while also handling resource unavailability. When a re-
source executing a job becomes unavailable, the job management component enables a checkpoint
to be taken (if the job is checkpointable and the transition to unavailability was graceful) and then
places that job back into the queue for migration/rescheduling. Completed jobs are placed into a
completed job queue, execution statistics are logged and those resources are freed for further task
execution. When the simulation ends, those jobs still executing or in the queue are removed by the
job management system and their statistics are logged.
Information Services
Within the simulator, Information Services provides low level data to other components within the
system by logging various aspects of system execution. Before jobs are injected into the system, they
108
are stored and then injected into the Job Management system at the proper time. Also, resource
availability trace events are stored until they are presented to the Resource Monitoring component.
Additionally, techniques which use omnipotent future knowledge (e.g. the pseudo-optimal scheduler),
can gather future resource information from this component. Information services also tracks high
level simulation information such as the current simulation time, current simulation run, current
scheduler/predictor, etc. which is made available to other components as needed.
Statistics
It’s important to provide users with feedback about the operation and performance of the system.
This information can be used to compare techniques’ effectiveness and develop new techniques.
PredSim logs a significant amount of information throughout each run cataloging system events.
This simulation feedback is then made available to the user in summary form on the terminal as
well as in a series of statistical output files. PredSim logs high level information on graceful and
ungraceful evictions, completed jobs, replicas created, operations lost and checkpoints taken. More
detailed information is also logged and output to files including scheduling results versus day, hour
and job length as well as individual machine statistics for evictions, job completions and uptime.
Scheduling Framework
The Scheduling Framework handles all decisions for resource allocation and deallocation. It provides
the Job Management component with the set of Scheduling Components and Replication Compo-
nents used to choose which resources are chosen for job allocation and which jobs are chosen for
replication.
9.4 PredSim Performance
This section investigates PredSim’s performance. As it is critically important that PredSim deliver
timely results, this analysis focuses on factors that influence simulation completion time. Execution
time is most affected by factors that scale up the simulation size. When testing scheduling techniques
against each other while executing sets of jobs, the two main scaling factors are the number of
schedulers being compared and the number of jobs being scheduled. This section reports PredSim
execution times with PredSim executing on a dual processor 2.66 GHz dual core Xeon 5150 processor
with eight GBs of RAM running Debian 6.0.1.
Figure 9.2 investigates varying the number of jobs executed in the simulation and its effect on
109
0 1 2 3 40
10
20
30
40
50Synthetic Workload
Number of Jobs (x 104)
Exe
cutio
n T
ime
(Min
utes
)
0 1 2 3 40
10
20
30
40
50Grid5000 GWA Workload
Number of Jobs (x 104)
Exe
cutio
n T
ime
(Min
utes
)
Condor TraceSeti Trace
Figure 9.2: PredSim execution time versus the number of jobs executed
simulation execution time. The simulations execute a single scheduler, the PPS scheduler using
the TRF predictor discussed in Section 5.2. The number of jobs executed varies according to the
X-axis and two availability traces are used to drive the simulations, the Condor trace and the FTA
formatted Seti trace. The left subgraph investigates using a synthetic workload to drive the traces
and demonstrates that increasing the job load on the system leads to linear increases in execution
time. The Seti traces take significantly longer for all load levels. The right subgraph plots the same
experimental setup executing the Grid5000 GWA workload. This workload executes significantly
faster than the synthetic workload due to the shorter job lengths contained within the Grid5000
trace. In both scenarios, the results show that even under extremely high load scenarios, PredSim
execution time does not exceed 45 minutes.
Figure 9.3 investigates how simulation execution time is affected by varying the number of sched-
ulers being compared against each other. The simulator executes 1000 synthetic jobs using both the
Condor and FTA formatted Seti traces. The results show that PredSim execution time increases
linearly as the number of schedulers being compared increases while executing both availability
traces, demonstrating good scalability characteristics. Execution time does not exceed 22 minutes
even when executing 15 separate scheduling algorithms for comparison.
110
0 5 10 150
5
10
15
20
25Condor With Synthetic Workload
Number of Schedulers
Exe
cutio
n T
ime
(Min
utes
)
Condor TraceSeti Trace
Figure 9.3: PredSim execution time versus the number of schedulers investigated
111
Chapter 10
Conclusions
10.1 Summary
The volatility of resources can have serious performance consequences for executing applications. To
categorize this volatility, Section 3.1 introduces four resource availability states, which capture why
and how (not just when) resources become unavailable over time. This approach suggests predictors
and schedulers that can distinguish between graceful and ungraceful transitions to unavailability.
Section 3.2 describes four months of data from a university Condor pool and how its resources
behave in terms of these states, over time, and how these resources can then be classified according
to trends and behavior. Section 3.2.3 demonstrates by simulation that mapping jobs of different
lengths onto resources in these different categories has a significant effect on whether they can run
to completion.
Chapter 4 introduces four different methods for predicting the future multi-state availability of
a resource. These algorithms combine two methods of analyzing a resource’s behavior (Duration in
each state, and the number of Transitions to that state) with two methods to select which periods
of a resource’s history to analyze (recent Days at the same time of day, and Recent hours). Five
different weighting techniques further enhance the predictor’s accuracy: Recentness, Time of Day,
Day of Week, Recentness & Day of Week and All. Section 4.2 analyzes the proposed prediction
methods to determine the most effective prediction algorithm, and then tests the most successful
predictors against several related work predictors. The results demonstrate that the TDE predictor
is the most accurate of the proposed prediction algorithms, and is 4.6% more accurate than the best
performing related work predictor, the Ren predictor [54].
These availability predictors allow schedulers to consider resource availability for resource allo-
cation. Chapter 5 shows that prediction-based schedulers that are aware of the various types of
unavailability can take advantage of this information by making better job placement decisions.
112
The TRF-Comp scheduler outperforms other traditional schedulers, and other schedulers that trade
off reliability for performance, including the only other scheduler based on multi-state availability
prediction. Results show reduced makespan and fewer evictions, for a variety of scenarios.
However, even with reliability aware scheduling, resource unavailability is still inevitable. Check-
pointing and replication are the two primary tools that Grids can use to mitigate the effect of these
resource unavailabilities. Schedulers can replicate jobs to reduce the effect of resource unavailability
and ultimately reduce makespan. However, replication requires additional Grid cycles, which can
have direct cost within a Grid economy, and an indirect cost in tying up attractive resources espe-
cially under relatively higher loads. Chapter 6 describes techniques that use availability predictions
to influence a scheduler’s decision about whether to replicate a job. Availability predictors can
forecast resource behavior, and allow schedulers to consider resource reliability when making job
replication decisions. Section 6.3 introduces the notion of “replication efficiency,” which combines
the benefit and cost of replication into a single metric, and studies a variety of replication policies
under various loads, using simulations based on two representative Grid resource traces to demon-
strate the improved performance the TRF predictor and replication techniques produce. A strategy
that considers process checkpointability, job length, predicted resource reliability and current system
load does the best job of using additional replicated operations to reduce job makespan efficiently.
Replication decisions are also affected by the current load on the system. Section 7.3 pro-
poses Sliding Replication, a load adaptive replication technique that dynamically chooses an ap-
propriate prediction based replication strategy given the load on the system. Section 6.3.3 com-
pares the proposed replication technique’s performance to previously proposed prediction based and
non-prediction based load adaptive replication strategies and shows that SR produces comparable
makespan improvement while significantly reducing the number of required extra operations. These
replication techniques can improve Grid user experiences not only by reducing the time it takes for
their tasks to complete but also by significantly reducing the real world cost incurred by executing
replicas on the Grid through intelligent job replication.
Chapter 8 extends these fault tolerant computing paradigms to an unconventional distributed
computing environment. Section 8.3 characterizes the availability of MFDs and demonstrates their
applicability as Grid computation devices for non-data intensive tasks. This includes developing a
MFD specific multi-state availability model, characterizing when the various types of jobs arrive,
proposing a generic metric to describe their burstiness, and scheduling around that burstiness.
Section 8.6 proposes a computationally efficient method for online calculation and scheduling. These
burstiness-aware scheduling strategies significantly improve makespan and reduce the percentage of
113
Grid job preemptions.
Chapter 9 introduces and describes PredSim, the experimental testbed used in this study. Pred-
Sim is an event-driven task scheduling simulation platform specifically designed for multi-state pre-
diction, scheduling and replication. PredSim accepts real world resource availability and workload
traces as input. It includes modes for testing prediction accuracy and task scheduling algorithms,
and provides detailed statistical feedback to the user.
10.2 Future Work
This section lists future research directions grouped by category.
1. Prediction
(a) Determine other uses for multi-state availability prediction other than making job place-
ment and replication decisions (e.g. data placement, frequency and timing of checkpoint-
ing, etc).
(b) Predict time until resource unavailability (not just completion probability during a given
interval).
(c) Predict time until recovery (to be used when choosing whether to suspend or migrate for
example).
(d) Use patterns in uptime behavior instead of just counting transitions.
(e) Predict rare events such as machine hardware failure differently than user presence and
local CPU load, perhaps by using a separate prediction method.
(f) Account for reasons that an application may have failed previously such as recurring is-
sues. An example of this may be not enough system RAM for the application to complete.
(g) Use a mixture-of-experts prediction approach where predictions are made for each resource
with a pool of predictors and the predictions from the predictor that has been working
best for that particular resource is used for scheduling
2. Scheduling
(a) Determine the criteria that makes a predictor good for scheduling. Results indicate
that overall accuracy is not necessarily the best indicator of a predictor’s usefulness in
scheduling.
114
(b) Investigate prediction based scheduling techniques that do not require the expected run-
time of the application.
(c) Actively migrate executing applications that are expected to experience resource unavail-
ability to other resources as they become available.
(d) Investigate waiting to schedule jobs until “better” resources become available instead of
scheduling them immediately.
(e) Investigate scheduling sets of jobs with a variety of priorities. Higher priority jobs could
be assigned to faster resources that experience less resource unavailability.
(f) Develop a non-greedy scheduling approach that attempts to match the available resource
pool to the current job pool using batch queuing.
(g) Schedule I/O bound jobs instead of compute bound jobs (data placement and availability,
etc).
3. Replication
(a) Investigate prediction based replication policies that can create more than a single replica
for each job.
(b) Investigate replication with different underlying scheduling policies. Currently, schedulers
place jobs after considering only resource speed and load, not reliability for example.
(c) Investigate replicating jobs after they have begun running, upon eviction, or when a new
resource becomes available.
4. Trace
(a) Gather new traces from other sources and test the ideas proposed in this study in these
varying environments.
(b) Develop synthetic trace availability models that can be varied in terms of their volatil-
ity, and investigate how previously proposed techniques behave as the the volatility and
heterogeneity of the synthetic resources change over time.
5. Meta-scheduling
(a) Investigate policies and techniques that could support availability prediction in a metaschedul-
ing environment such as several clusters managed with a meta scheduler.
6. Real-world Implementation
115
(a) Implement the prediction, scheduling and replication techniques proposed in this study
in a real distributed or Grid system and investigate their performance.
116
Bibliography
[1] Abu-Ghazaleh, N., and Lewis, M. Toward self organizing grids. In International Conferenceon High Performance Distributed Computing Hot Topics Session (2006), pp. 324–327.
[2] Acharya, A., Edjlali, G., and Saltz, J. The utility of exploiting idle workstations forparallel computation. In SIGMETRICS (1997), pp. 225–234.
[3] Amin, A., Ammar, R., and Gokhale, S. An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures. In Symposium on Computersand Communications (2003), p. 1207.
[4] Anderson, D. Boinc: A system for public-resource computing and storage. In IEEE/ACMWorkshop on Grid Computing (2004), pp. 4–10.
[5] Anderson, D., and Fedak, G. The computational and storage potential of volunteer com-puting. In International Symposium on Cluster Computing and the Grid (2006), pp. 73–80.
[6] Androutsellis-Theotokis, S., and Spinellis, D. A survey of peer-to-peer content distri-bution tech. Journal of American College of Medical Coding Specialists 36, 4 (2004), 335–371.
[7] Anglano, C., and Canonico, M. Fault-tolerant scheduling for bag-of-tasks grid applications.In Advances in Grid Computing - EGC 2005 (2005), pp. 630–639.
[8] Arpaci, R., Dusseau, A., Vahdat, A., Liu, L., Anderson, T., and Patterson, D. Theinteraction of parallel and sequential workloads on a network of workstations. In InternationalConference on Measurement and Modeling of Computer Systems (1995), pp. 267–278.
[9] Bolosky, W., Douceur, J., Ely, D., and Theimer, M. Feasibility of a serverless dis-tributed file system deployed on an existing set of desktop pcs. In SIGMETRICS (2000),pp. 34–43.
[10] Borthakur, D. The Hadoop Distributed File System: Architecture and Design. The ApacheSoftware Foundation, 2007.
[11] Braun, T., Siegel, H., and Beck, N. A comparison of eleven static heuristics for mappinga class of independent tasks onto heterogeneous distributed computing systems. Journal ofParallel and Distributed Computing 61, 6 (2001), 810–837.
[12] Buyya, R., and Murshed, M. Gridsim: A toolkit for the modelling and simulation of dis-tributed resource management and scheduling for grid computing. The Journal of Concurrencyand Computation: Practice and Experience (CCPE) (2002).
[13] Cardinale, Y., and Casanova, H. An evaluation of job scheduling strategies for divisibleloads on grid platforms. In High Performance Computing and Simulation Conference (2006),pp. 705–712.
[14] Casanova, H. Simgrid: a toolkit for the simulation of application scheduling. In Proceedingsof the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid2001) (2001).
117
[15] Casanova, H., Zagorodnov, D., Berman, F., and Legrand, A. Heuristics for schedul-ing parameter sweep applications in grid environments. In HCW ’00: Proceedings of the 9thHeterogeneous Computing Workshop (Washington, DC, USA, 2000), IEEE Computer Society,p. 349.
[16] Chun, B., and Vahdat, A. Workload and failure characterization on a large-scale federatedtestbed. Tech. Rep. IRB-TR-03-040, Intel Research Berkeley, 2003.
[17] Dail, H., Casanova, H., and Berman, F. A decoupled scheduling approach for grid appli-cation development environments. Journal of Parallel and Distributed Computing 63, 5 (2003),505–524.
[18] Dinda, P., and O’Hallaron, D. An extensive toolkit for resource prediction in distributedsystems. Tech. Rep. CMU-CS-99-138, Carnegie Mellon University, 1999.
[19] Dogan, A., and Ozguner, F. Biobjective scheduling algorithms for execution time-reliabilitytrade-off in heterogeneous computing systems. The Computer Journal 48, 3 (2005), 300–314.
[20] Erdil, D. C., and Lewis, M. J. Grid resource scheduling with gossiping protocols. InInternational Conference on Peer to Peer Computing (2007), pp. 193–202.
[21] for EsciencE, E. G. http://public.eu-egee.org/, 2008.
[22] Foster, I., and Iamnitchi, A. On death, taxes, and the convergence of peer-to-peer andgrid computing. In International Workshop on Peer-To-Peer Systems (2003).
[23] Foster, I., Kesselman, C., and Tuecke, S. The anatomy of the grid - enabling scalablevirtual organizations. International Journal of Supercomputer Applications 15 (2001), 2001.
[24] Frey, J., Tannenbaum, T., Livny, M., Foster, I., , and Tuecke, S. Condor-g: Acomputation management agent for multi-institutional grids. In International Conference onHigh Performance Distributed Computing (2001), pp. 55–63.
[25] Fujimoto, N., and Hagihara, K. A comparison among grid scheduling algorithms forindependent coarse-grained tasks. In International Symosium on Applications and the Internet(2004), IEEE Computer Society Press, pp. 674–680.
[26] Grid, O. S. http://www.opensciencegrid.org/, 2008.
[27] Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., and Epema, D.
H. J. The grid workloads archive. Future Gener. Comput. Syst. 24, 7 (2008), 672–686.
[28] Jarry, A., and Casanova, H. Dagsim: A simulator for dag scheduling algorithms. InLaboratoire de l’informatique du paralllisme (2000).
[29] Javadi, B., Kondo, D., Vincent, J., and Anderson, D. Mining for statistical avail-ability models in large-scale distributed systems: An empirical study of seti@home. In 17thIEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer andTelecommunication Systems (MASCOTS) (September 2009).
[30] Kang, W., and Grimshaw, A. S. Failure prediction in computational grids. In SimulationSymposium (2007), pp. 275–282.
[31] Kartik, S., and Murthy, C. Task allocation algorithms for maximizing reliability of dis-tributed computing systems. IEEE Transactions on Computers 41, 9 (1992), 1156–1168.
[32] Kondo, D., Anderson, D., and McLeod, J. Performance evaluation of scheduling policiesfor volunteer computing. In International Conference on e-Science (2007), pp. 415–422.
118
[33] Kondo, D., Chien, A., and Casanova, H. Resource management for rapid applicationturnaround on enterprise desktop grids. In International Conference on High PerformanceComputing (2004), p. 17.
[34] Kondo, D., Fedak, G., Cappello, F., Chien, A. A., and Casanova, H. Resourceavailability in enterprise desktop grids. Tech. Rep. 00000994, INRIA, 2006.
[35] Kondo, D., Javadi, B., Iosup, A., and Epema, D. The failure trace archive: Enablingcomparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid,IEEE International Symposium on 0 (2010), 398–407.
[36] Kondo, D., Taufer, M., III, C. B., Casanova, H., and Chien, A. Characterizing andevaluating desktop grids: An empirical study. In IEEE International Parallel and DistributedProcessing Symposium (2004), p. 26b.
[37] Lamehamedi, H., Szymanski, B., and Shentu, Z. Data replication strategies in gridenvironments. In in Proceedings of the Fifth International Conference on Algorithms and Ar-chitectures for Parallel Processing (ICA3PP02 (2002), Press, pp. 378–383.
[38] Levis, P., Lee, N., Welsh, M., and Culler, D. Tossim: accurate and scalable simulationof entire tinyos applications. In Proceedings of the 1st international conference on Embeddednetworked sensor systems (New York, NY, USA, 2003), SenSys ’03, ACM, pp. 126–137.
[39] Lewis, M., and Grimshaw, A. The core legion object model. In International Conferenceon High Performance Distributed Computing (1996), pp. 551–561.
[40] Li, Y., and Mascagni, M. Improving performance via computational replication on a large-scale computational grid. In CCGRID ’03: Proceedings of the 3st International Symposiumon Cluster Computing and the Grid (Washington, DC, USA, 2003), IEEE Computer Society,p. 442.
[41] Litke, A., Skoutas, D., Tserpes, K., and Varvarigou, T. Efficient task replication andmanagement for adaptive fault tolerance in mobile grid environments. Future Gener. Comput.Syst. 23, 2 (2007), 163–178.
[42] Litzkow, M., Livny, M., and Mutka, M. Condor - a hunter of idle workstations. InInternational Conference on Distributed Computing Systems (1988), pp. 104–111.
[43] Menasce, D. A., Saha, D., da Silva Porto, S. C., Almeida, V. A. F., and Tripathi,
S. K. Static and dynamic processor scheduling disciplines in heterogeneous parallel architec-tures. J. Parallel Distrib. Comput. 28, 1 (1995), 1–18.
[44] Mickens, J., and Noble, B. Predicting node availability in peer-to-peer networks. InInternational Conference on Measurement and Modeling of Computer Systems (2005).
[45] Mickens, J., and Noble, B. Exploiting availability prediction in distributed systems. InNetwork Systems Design and Implementation (2006), pp. 73–86.
[46] Mickens, J., and Noble, B. Improving distributed system performance using machineavailability prediction. In International Conference on Measurement and Modeling of ComputerSystems Performance Evaluation Review (2006).
[47] Nurmi, D., Brevik, J., and Wolski, R. Modeling machine availability in enterprise andwide-area distributed computing environments. In Europar (2005), pp. 432–441.
[48] Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,
and Zagorodnov, D. The eucalyptus open-source cloud-computing system. In Proceedingsof the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid(Washington, DC, USA, 2009), CCGRID ’09, IEEE Computer Society, pp. 124–131.
119
[49] open platform for developing debugging, P. L. A., and accessing planetary scale
services. http://www.planet-lab.org/, 2008.
[50] Pietrobon, V., and Orlando, S. Performance fault prediction models. Tech. Rep. CS-2004-3, University of Venice, 2004.
[51] Qin, X., Jiang, H., Xie, C., and Han, Z. Reliability-driven scheduling for real-time taskswith precedence constraints in heterogeneous distributed systems. In International Conferenceon Parallel and Distributed Computing (2000), pp. 617–623.
[52] Ramakrishnan, L., and Reed, D. A. Performability modeling for scheduling and faulttolerance strategies for scientific workflows. In HPDC ’08: Proceedings of the 17th internationalsymposium on High performance distributed computing (New York, NY, USA, 2008), ACM,pp. 23–34.
[53] Ranganathan, K., and Foster, I. Identifying dynamic replication strategies for a highperformance data grid. In In Proc. of the International Grid Computing Workshop (2001),pp. 75–86.
[54] Ren, X., and Eigenmann, R. Empirical studies on the behavior of resource availability infine-grained cycle sharing systems. In International Conference on Parallel Processing (2006),pp. 3–11.
[55] Ren, X., Lee, S., Eigenmann, R., and Bagchi, S. Resource failure prediction in fine-grained cycle sharing system. In International Conference on High Performance DistributedComputing (2006).
[56] Ren, X., Lee, S., Eigenmann, R., and Bagchi, S. Prediction of resource availability infine-grained cycle sharing systems empirical evaluation. Journal of Grid Computing 5, 2 (2007),173–195.
[57] Rood, B., Gnanasambandam, N., Lewis, M., and Sharma, N. Toward high performancecomputing in unconventional computing environments. In to appear in the Workshop on Chal-lenges of Large Applications in Distributed Environments 2010 (in conjuction with HPDC 2010)(2010).
[58] Rood, B., and Lewis. Scheduling on the grid via multi-state resource availability prediction.In International Conference on Grid Computing (2008).
[59] Rood, B., and Lewis, M. Multi-state grid resource availability characterization. In Interna-tional Conference on Grid Computing (2007), pp. 42–49.
[60] Rood, B., and Lewis, M. Resource availability prediction for improved grid scheduling.In Workshop on Advances in High-Performance E-Science Middleware and Applications (inconjunction with e-Science 2008) (2008).
[61] Rood, B., and Lewis, M. Grid resource availability prediction-based scheduling and taskreplication. Journal of Grid Computing (2009).
[62] Rood, B., and Lewis, M. Availability prediction based scheduling strategies for grid envi-ronments. In CCGrid 2010: The 10th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (2010).
[63] Rosanblum, M., Herrod, S. A., Witchel, E., and Gupta, A. Complete computer systemsimulation: The simos approach. IEEE Parallel and Distributed Technology (1995).
[64] Ryu, K., and Hollingsworth, J. Unobtrusiveness and efficiency in idle cycle stealing forpc grids. In IEEE International Parallel and Distributed Processing Symposium (2004), p. 62a.
120
[65] Sahoo, R., Oliner, A., Rish, I., Gupta, M., Moreira, J., Ma, S., Vilalta, R., and
Sivasubramaniam, A. Critical event prediction for proactive management in large-scale com-puter clusters. In Special Interest Group on Knowledge Discovery and Data Mining (2003),pp. 426–435.
[66] Sahoo, R. K., and Squillante, M. S. Failure data analysis of a large-scale heterogeneousserver environment. In In Proceedings of the 2004 International Conference on DependableSystems and Networks (2004), pp. 772–781.
[67] Santos-neto, E., Cirne, W., Brasileiro, F., Lima, R., and Grande, C. Exploit-ing replication and data reuse to efficiently schedule data-intensive applications on grids. InProceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing (2004),pp. 210–232.
[68] Schroeder, B., and Gibson, G. A. A large-scale study of failures in high-performancecomputing systems. In Proc. 40 th Annual IEEE/IFIP International Conference on DependableSystems and Networks (DSN 2006 (2006), pp. 249–258.
[69] Schroeder, B., and Gibson, G. A. Disk failures in the real world: What does an mttf of1,000,000 hours mean to you?, 2007.
[70] Silva, D. P. D., Cirne, W., Brasileiro, F. V., and Grande, C. Trading cycles forinformation: Using replication to schedule bag-of-tasks applications on computational grids. InApplications on Computational Grids, in Proc of Euro-Par 2003 (2003), pp. 169–180.
[71] Song, H. J., Liu, X., Jakobsen, D., Bhagwan, R., Zhang, X., Taura, K., and Chien,
A. The microgrid: a scientific tool for modelling computational grids. In IEEE Supercomputing(SC2000) (2000).
[72] Srinivasan, S., and Jha, N. Safety and reliability-driven task allocation in distributedsystems. In International Conference on Parallel and Distributed Systems (1999), pp. 238–251.
[73] Takefusa, A., Matsuoka, S., and Nakada, H. Overview of a performance evaluationsystem for global computing scheduling algorithms. In 8th International Symposium on HighPerformance Distributed Computing (HPDC8) (1999).
[74] Teragrid. http://www.teragrid.org, 2008.
[75] University of Wisconsin-Madison. Condor Version 7.0.4 Manual, 2008.
[76] Vilalta, R., and Ma, S. Predicting rare events in temporal domains. In InternationalConference on Data Mining (2002), p. 474.
[77] Wang, L., Tao, J., Kunze, M., Castellanos, A. C., Kramer, D., and Karl, W.
Scientific cloud computing: Early definition and experience. 2008 10th IEEE InternationalConference on High Performance Computing and Communications (2008), 825–830.
[78] Weiss, G., and Hirsh, H. Learning to predict rare events in categorical time-series data. InInternational Conference on Machine Learning (1998), pp. 83–90.
[79] Weissman, J. B. Fault tolerant computing on the grid: What are my options. Tech. rep.,University of Texas at San Antonio, 1998.
[80] Wolski, R., Spring, N., and Hayes, J. The network weather service: A distributed resourceperformance forecasting service for metacomputing. Future Generation Computer Systems 15(1999), 757–768.
121