GRID RESOURCE AVAILABILITY PREDICTION-BASED …

GRID RESOURCE AVAILABILITY PREDICTION-BASED

SCHEDULING AND TASK REPLICATION

BY

BRENT ROOD

BS, State University of New York at Binghamton, 2005

MS, State University of New York at Binghamton, 2007

DISSERTATION

Submitted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in Computer Science

in the Graduate School ofBinghamton University

State University of New York2011

c©Copyright by Brent Rood 2011

All Rights Reserved

Accepted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in Computer Science

in the Graduate School ofBinghamton University

State University of New York2010

May 13, 2011

Dr. Michael J. Lewis, Chair and Faculty AdvisorDepartment of Computer Science, Binghamton University

Dr. Madhusudhan Govindaraju, MemberDepartment of Computer Science, Binghamton University

Dr. Kenneth Chiu, MemberDepartment of Computer Science, Binghamton University

Dr. Yu Chen, Outside ExaminerDepartment of Electrical and Computer Engineering, Binghamton University

iii

Abstract

The frequent and volatile unavailability of volunteer-based Grid computing resources challenges

Grid schedulers to make effective job placements. The manner in which host resources become

unavailable will have different effects on different jobs, depending on their runtime and their ability

to be checkpointed or replicated. A multi-state availability model can help improve scheduling

performance by capturing the various ways a resource may be available or unavailable to the Grid.

This paper uses a multi-state model and analyzes a machine availability trace in terms of that

model. Several prediction techniques then forecast resource transitions into the model’s states.

This study analyzes the accuracy of proposed predictors, which outperform existing approaches.

Later chapters propose and study several classes of schedulers that utilize the predictions, and

a method for combining scheduling factors. Scheduling results characterize the inherent tradeoff

between job makespan and the number of evictions due to resource unavailability, and demonstrate

how prediction-based schedulers can navigate this tradeoff under various scenarios. Schedulers can

use prediction-based job replication techniques to replicate those jobs that are most likely to fail.

The proposed prediction-based replication strategies outperform others, as measured by improved

makespan and fewer redundant operations. Multi-state availability predictor can provide information

that allows distributed schedulers to be more efficient — as measured by a new efficiency metric —

than others that blindly replicate all jobs or some static percentage of jobs. PredSim, a simulation-

based framework, facilitates the study of these topics.

iv

Dedication

I dedicate this dissertation to all the people in this world who have cared for me and taken me

into their hearts. To my mom. For watching me grow and helping me when I fell down. For giving

me this incredible feeling that she will always be there for me. I hope she knows what that means

to me, and that I’ll always be there for her.

To Marisa. Who’s gentleness and smile helped me through the hardest of times. Her friendship

and understanding taught me what it really means to love. I will forever carry our moments with

me in my heart.

To Holly. For showing me the beauty of a sunset and for always lending an ear when the world

seemed so big and the road seemed so long. For teaching me how the love of a friend is the brightest

light in this darkness.

And when you have reached the mountain top, then you shall begin to climb.

And when the earth shall claim your limbs, then shall you truly dance.

- Khalil Gibran

v

Acknowledgements

It is no small exaggeration to say that this work would not exist if it were not for the guidance

and insight of my advisor, Michael J. Lewis. I would like to acknowledge him for inspiring me to

reach and strive for novelty. For showing me the thrill of publication and the tempering beauty of

failure. For pushing me to better myself and in a very profound way, changing the course of my life.

vi

Contents

List of Tables ix

List of Figures x

1 Introduction 11.1 Characterizing Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Predicting Resource Unavailability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Prediction-based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Prediction-based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Load-adaptive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Multi-functional Device Characterization and Scheduling . . . . . . . . . . . . . . . . 91.7 Simulation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Related Work 122.1 Resource Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 MFD characterization and Burstiness . . . . . . . . . . . . . . . . . . . . . . 132.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Availability Model and Analysis 193.1 Availability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Condor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Unavailability Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.3 Grid Application Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Trace Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Condor Pool Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 Classification Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Multi-State Availability Prediction 324.1 Prediction Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Predictor Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Prediction Method Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.2 Analysis of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Prediction-Based Scheduling 395.1 Scheduling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Reliability Performance Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Prediction Based Scheduling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.1 Scoring Technique Performance Comparison . . . . . . . . . . . . . . . . . . . 435.3.2 Prediction Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Multi-State Prediction Based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 47

vii

5.4.1 Multi-State Prediction Based Scheduling . . . . . . . . . . . . . . . . . . . . . 475.4.2 Predictor Quality vs. Scheduling Result . . . . . . . . . . . . . . . . . . . . . 485.4.3 Scheduling Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 49

6 Prediction-Based Replication 526.1 Replication Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Static Replication Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Prediction-based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3.1 Replication Score Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3.2 Load Adaptive Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.3 Load Adaptive Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Load Adaptive Replication 647.1 Load Adaptive Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Prediction based Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3 Load Adaptive Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3.1 Replication Strategy List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3.2 Replication Index Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.3.3 SR Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.4 Sliding Replication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4.1 Prediction-based Replication Comparison . . . . . . . . . . . . . . . . . . . . 727.4.2 SR Under Various Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8 Multi-Functional Device Utilization 758.1 Multi-Functional Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.2 Unconventional Computing Environment Challenges . . . . . . . . . . . . . . . . . . 778.3 Characterizing State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.4 Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.4.1 Overall State Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.4.2 Transitional Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.4.3 Durational Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.5 A Metric for Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5.1 Burstiness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5.2 Degree of Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.5.3 Burstiness Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.6 Burstiness-Aware Scheduling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 878.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878.6.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9 Simulator Tool 939.1 Simulation Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.1.1 Simulation Based Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.1.2 Distributed System Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.2 PredSim Capabilities and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.3 PredSim System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.3.1 Input Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.3.2 User Specified Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029.3.3 PredSim Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9.4 PredSim Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10 Conclusions 11210.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11210.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

viii

Bibliography 117

ix

List of Tables

3.1 Machine Classification Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Machine Classification Scheduling Results . . . . . . . . . . . . . . . . . . . . . . . . 31

9.1 Prediction Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039.2 Scheduling Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049.3 Replication Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.4 Job Queuing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

x

List of Figures

1.1 Thesis flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Multi-state Availability Model: Each resource resides in and transitions between fouravailability state, depending on the local use, reliability, and owner-controlled sharingpolicy of the machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Machine State over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Total Time Spent in Each Availability State . . . . . . . . . . . . . . . . . . . . . . . 243.4 Number of Transitions versus Hour of the Day . . . . . . . . . . . . . . . . . . . . . 243.5 Number of Transitions versus Day of the Week . . . . . . . . . . . . . . . . . . . . . 253.6 State Duration versus Hour of the Day . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7 State Duration versus Day of the Week . . . . . . . . . . . . . . . . . . . . . . . . . 263.8 Machine Count versus State Duration . . . . . . . . . . . . . . . . . . . . . . . . . . 273.9 Machine Count versus Average Number of Transitions Per Day . . . . . . . . . . . . 283.10 Failure Rate vs. Job Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Transitional Predictor Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Weighting Technique Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Prediction Accuracy by Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 As schedulers consider performance more than reliability (by increasing TradeoffWeight W), makespan decreases, but the number of evictions increases. . . . . . . . 42

5.2 The resource scoring decision’s effectiveness depends on the length of jobs. For longerjobs, schedulers should emphasize speed over predicted reliability. . . . . . . . . . . . 44

5.3 Scoring decision effectiveness depends on job checkpointability. Schedulers that treatcheckpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer moreevictions, but mostly for checkpointable jobs; therefore, makespan remains low. . . . 45

5.4 Schedulers can improve results by considering predictions for intervals longer thanthe actual job lengths. This is especially true for job length, between approximately48 and 72 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 TRF and Ren MTTF show similar trends for both makespan and evictions. For anyparticular makespan, TRF causes fewer evictions, and for any number of evictions,TRF achieves improved makespan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6 TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all Pre-diction Length Weights. TRF also produces a shorter average job makespan than allRen predictors, except the one day predictor (which produces 45% more evictions). . 48

5.7 TRF-Comp produces fewer job evictions than all comparison schedulers (with theexception of the Pseudo-optimal scheduler) for jobs up to 100 hours long. . . . . . . 49

5.8 TRF-Comp produces fewer job evictions than all comparison schedulers (with theexception of the Pseudo-optimal scheduler) for loads up to 40,000 jobs . . . . . . . . 50

6.1 Checkpointability Based Replication: Makespan (left) and extra operations (right)for a variety of replication strategies, two of which are based on checkpointability ofjobs, across four different load levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 Replication Score Threshold’s effect on replication effectiveness - Low Load . . . . . 586.3 Replication Score Threshold’s effect on replication effectiveness - Medium High Load 60

xi

6.4 Replication strategy performance across a variety of system loads . . . . . . . . . . . 616.5 Load adaptive replication techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Replication Score Threshold’s effect on replication effectiveness across various loads . 677.2 Load adaptive replication with random replication probability . . . . . . . . . . . . . 677.3 Sliding Replication’s load adaptive approach . . . . . . . . . . . . . . . . . . . . . . . 687.4 Sliding Replication list comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5 Sliding Replication function comparison . . . . . . . . . . . . . . . . . . . . . . . . . 717.6 Sliding Replication index analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.7 Static vs. adaptive replication performance comparison . . . . . . . . . . . . . . . . 737.8 Sliding Replication under the Grid5000 workload . . . . . . . . . . . . . . . . . . . . 747.9 Sliding replication under the NorduGrid workload . . . . . . . . . . . . . . . . . . . 74

8.1 MFD count per floor at a large pharmaceutical company . . . . . . . . . . . . . . . . 768.2 MFD count per floor at a division of a large computer company. . . . . . . . . . . . 778.3 Trace data from four MFDs in a large corporation. The devices are Available over

98% of the time, and requests for local use are bursty and intermittent. . . . . . . . 788.4 MFD Availability Model: MFDs may be completely Available for Grid jobs (when

they are idle), Partially Available for Grid jobs (when they are emailing, faxing, orscanning), Busy or unavailable (when they are printing), or Down. . . . . . . . . . . 80

8.5 MFD state transitions by day of week . . . . . . . . . . . . . . . . . . . . . . . . . . 838.6 MFD state transitions by hour of day . . . . . . . . . . . . . . . . . . . . . . . . . . 838.7 MFD state duration by day of week . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.8 MFD state duration by hour of day . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.9 Number of occurrences of each state duration . . . . . . . . . . . . . . . . . . . . . . 848.10 Burstiness versus Day of Week and Hour of Day . . . . . . . . . . . . . . . . . . . . 878.11 Comparison of MFD schedulers as Grid job length increases . . . . . . . . . . . . . . 898.12 Determination of Grid job lengths with 95% confidence intervals for given failure rates 918.13 Comparison of MFD schedulers as Grid job load increases . . . . . . . . . . . . . . . 91

9.1 PredSim System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.2 PredSim execution time versus the number of jobs executed . . . . . . . . . . . . . . 1109.3 PredSim execution time versus the number of schedulers investigated . . . . . . . . . 111

xii

Chapter 1

Introduction

The functionality, composition, utilization, and size of large scale distributed systems continue to

evolve. The largest Grids and test beds under centralized administrative control—including Ter-

aGrid [74], EGEE [21], Open Science Grid (OSG) [26], and PlanetLab [49]—vary considerably in

terms of the number of sites and the extent of the resources contained by each site. Various peer-to-

peer (P2P) [6] and Public Resource Computing (PRC) [4] systems allow users to connect and donate

their individual machines, and to administer their systems themselves. Cloud-based systems are also

coming into the mainstream consisting of sets of distributed resources for use or rental by remote

users. Amazon EC2 [77] is a prominent cloud-based solution in which users can rent computational

hardware. Eucalyptus is a private cloud computing implementation designed to operate on computer

clusters behind web-based portals [48]. Hybrids of these various Grid models, cloud-based systems

and middleware will continue to emerge as well [1] [22]. Future Grids will contain dedicated high

performance clusters, individual less-powerful machines, and even a variety of alternative devices

such as PDAs, sensors, and instruments. They will run Grid middleware, such as Condor [42] and

Globus [24], that enables a wide range of policies for contributing resources to the Grid.

This continued increase in functional heterogeneity will make Grid scheduling—mapping jobs

onto available constituent Grid resources—even more challenging, partly because different resources

will exhibit varying unavailability characteristics. For example, laptops may typically be turned on

and off more frequently, and may join and leave the network more often. Individual workstation

and desktop owners may turn their machines off at night, or shut down and restart to install new

software more often than a server’s system administrator might. Even the same kinds of resources

will exhibit different availability characteristics. CS department research clusters may be readily

available to the Grid, whereas a small cluster from a physics department may not allow remote Grid

jobs to execute unless the cluster is otherwise idle. Site autonomy has long been recognized to be

an important Grid attribute [39]. In the same vein, even owners of individual resources will exhibit

1

different usage patterns and implement different policies for how available they make their machines.

If Grid schedulers do not account for these important differences, they will make poor mapping

decisions that will undermine their effectiveness. Schedulers that know when, why, and how the re-

sources become unavailable, can be much more effective, especially if this knowledge is coupled with

information about job characteristics. For example, long-running jobs that do not implement check-

pointing may require highly available host machines. Checkpointable jobs that require heavyweight

checkpoints may prefer resources that are reclaimed by users, rather than those that unexpectedly

become unavailable, thereby making possible an on demand checkpoint. This is less important for

jobs that produce lightweight checkpoints that can be produced on shorter notice. Easily replicable

processes without side effects might perform nearly as well on the less powerful and more intermit-

tently available resources, leaving the more available machines for jobs that cannot deal as well with

machine unavailability (at least under moderate to high contention for Grid resources).

This thesis describes building Grid schedulers that use the unavailability and performance char-

acteristics of target resources when making scheduling decisions, a goal that requires progress in

several directions. The thesis is organized as follows. Chapter 3 describes a multi-state availabil-

ity model and examines a University of Notre Dame trace in terms of that model to establish the

volatility and to characterize the availability of the underlying resources. Chapter 4 examines several

prediction techniques to forecast the future states of a resource’s availability. Chapter 5 investigates

using different approaches to schedule jobs with different characteristics based on those predictions,

by attempting to choose resources that are least likely to become unavailable. Chapter 6 then ex-

plores the efficacy of using the availability predictors in a new way; they are tested to determine how

helpful they can be in deciding which jobs to replicate with the goal of replicating those jobs most

likely to experience resource unavailability. Chapter 7 investigates how replication techniques can

adapt to changing system load by showing how it can effect replication policy selection. Chapter 8

investigates predicting and scheduling in the unconventional environment of multi-functional devices

located in industry settings. Lastly, Chapter 9 describes the simulator tool that produced the results

in this work.

1.1 Characterizing Availability

To build Grid schedulers that consider the unavailability characteristics of target resources, resources

must be traced to uncover why and how they become unavailable. As described in Chapter 2, most

availability traces focus primarily on when resources become unavailable, and detail the patterns

2

based purely on availability and unavailability. This study identifies different types of unavailability,

and analyzes the patterns of that unavailability.

It is also beneficial to classify resources in a way that availability predictors can take advantage of,

and design and build such predictors. Knowing that Grid resources become unavailable differently

and even predicting when and how they do is important only to the extent that schedulers can

exploit this information. Therefore, schedulers must be built that use both availability predictions

(which are based on unavailability characteristics) and application characteristics.

Section 3.1 identifies four resource availability states, dividing unavailable resources into separate

categories based on why they are unavailable. This allows resources that have abruptly become

unreachable to be differentiated from resources whose user has reclaimed them. Events that trigger

resources to transition between availability states are described. Applications are then classified

by characteristics that could help schedulers, especially in terms of those machines’ unavailability

characteristics.

Section 3.2 describes traces that expose unavailability characteristics more explicitly than current

traces. The results presented identify the availability states that resources inhabit, as well as the

transitions that resources make between these states. Section 3.2 reports sample data from four

months of traces from the University of Notre Dame’s Condor pool.

To summarize, the resource characterization contributions include

• a new way of analyzing trace data, which better exposes resource unavailability characteristics,

• sample data from one such Condor-based trace, and

• a new classification approach that enables effective categorization of resources into availability

states.

These results make a strong case for studying how a Grid’s constituent resources become un-

available, for using that information to predict how individual resources may behave in the future,

and for building that information into future Grid schedulers, especially those that will operate in

increasingly eclectic and functionally heterogeneous Grids.

1.2 Predicting Resource Unavailability

An availability predictor, which forecasts the availability of candidate resources for the period of time

that an application is likely to run on them, can help schedulers make better application placement

decisions. Clearly, selecting resources that will remain available for the duration of application

3

runtimes would improve the performance of both individual applications and of the Grid as a whole.

Resource categorization work demonstrates the potential for using past history to organize resources

into categories that define their availability behavior, and then using schedulers that take advantage

of this resource categorization to make better scheduling decisions [59]. However, due to the low

granularity of the classes, the static nature of each machine’s class and the fact that the classes do

not recognize patterns in a machine’s availability, on-demand availability prediction proved to be an

improvement over the classification approach. Importantly, this work uses a “multi-state” availability

model, which captures not just when and whether a resource becomes unavailable, but how it does

so. This distinction allows schedulers to make decisions that improve overall performance, when

they are configured to also consider application characteristics. For example, if some particular

resource is expected to become unavailable because its owner reclaims the machine for local use

(as opposed to the machine becoming unreachable unexpectedly), then that machine may be an

attractive candidate for scheduling an application that can take an on-demand checkpoint.

Chapter 4 describes investigations on predicting the availability of resources. Several kinds

of predictors are described which use different criteria to predict resource availability. The best

performing predictor considers how resources transition between the proposed model’s unavailability

states, and outperforms other approaches in the literature by 4.6% in prediction accuracy. As shown

in later chapters, simulated scheduling results indicate that this accuracy difference significantly

decreases both makespan and operations lost due to eviction.

To summarize, the resource availability prediction contributions include

• introducing new multi-state availability prediction algorithms that effectively forecast future

resource availability behavior for the proposed multi-state availability model,

• evaluating the most effective configurations of the predictors in this set, and

• producing significant prediction accuracy increases across a variety of prediction lengths com-

pared to existing prediction approaches.

These results demonstrate the feasibility of predicting the future availabilities of Grid resources

which can in turn be used by schedulers to make better job placement decisions.

1.3 Prediction-based Scheduling

Schedulers that effectively predict the availability of constituent resources, and use these predictions

to make scheduling decisions, can improve application performance. Ignoring resource availability

4

characteristics can lead to longer application makespans due to wasted operations [19]. Even more

directly, it can adversely affect application reliability by favoring faster but less available resources

that cannot complete jobs before becoming unavailable or being reclaimed from the Grid by their

owners. Unfortunately, performance and reliability vary inversely [19]; favoring one necessarily

undermines the other.

For this reason, Grid schedulers must consider both reliability and performance in making

scheduling decisions. This requirement is true for current Grids, and will become more impor-

tant as resource characteristics become increasingly diverse. Resource availability predictors can be

used to determine resource reliability and this in turn can be combined with measures of resource

performance when selecting resources. Since performance depends on competing resource load dur-

ing the lifetime of the application, schedulers must make use of load monitors and predictors [80],

which can be centralized, or can be distributed using various dissemination approaches [20].

In scheduling for performance and reliability, it is critical to investigate the effect on both ap-

plication makespan and the number of wasted operations due to application evictions, which can

occur when a resource executing an application becomes unavailable. Evictions and wasted op-

erations are important because of their direct “cost” within a Grid economy, or simply because

they essentially deny the use of the resource by another local or Grid application. The proposed

approach to Grid scheduling involves analyzing resource availability history and predicting future

resource (un)availability, monitoring and considering current load, storing static resource capability

information, and considering all of these factors when placing applications. For scheduling, the best

availability predictor (Chapter 4) is used to investigate the effects of weighing resource speed, load,

and reliability in a variety of ways, to decrease makespan and increase application reliability.

Chapter 4 also investigates different approaches to schedule applications with varying character-

istics. In particular, results are presented for scheduling checkpointable jobs to consider speed and

load more heavily than reliability. This is done because their eviction involves fewer wasted oper-

ations when compared with non-checkpointable jobs due to their ability to save state and resume

execution elsewhere.

The simulation-based performance study presented here begins by establishing the inherent per-

formance/reliability scheduling tradeoff for a real world environment. The performance of several

different schedulers that consider both reliability and performance is then characterized. Subsequent

sections develop and explore the idea of varying the requested prediction duration by demonstrating

its effect on application execution performance. The study focuses on scheduling for the two compet-

ing metrics of performance and reliability by characterizing the effects of (i) treating checkpointable

5

jobs differently from “non-checkpointable” jobs, and (ii) varying the checkpointability, length, and

number of jobs. These approaches are compared with the only other multi-state availability predic-

tor and scheduler, designed by Ren et al. [55]. Section 5.4.1 characterizes the relative performance

and configurability of the two competing approaches. Results show that Ren’s scheduler causes

up to 51% more evicted jobs while simultaneously increasing average job makespan by 18% when

compared with the best performing scheduler proposed here.

To summarize, the prediction-based scheduling contributions include

• establishing the performance/reliability scheduling tradeoff with real world traces,

• introducing several schedulers that consider both reliability and performance,

• varying prediction duration and showing it’s effect on job execution performance,

• characterizing the effect of scheduling checkpointable/non-checkpointable jobs differently, and

• evaluating the proposed schedulers with a variety of job loads, lengths and checkpointabilities

compared with other related work scheduling approaches.

The scheduling results presented here demonstrate the feasibility of using resource availability

predictions to aid schedulers in choosing reliable resources and to decrease average job makespan.

1.4 Prediction-based Replication

Chapter 6 investigates the efficacy of using the availability predictors in a completely new way,

namely to decide which jobs to replicate. The basis for this idea is that checkpointing and replication

are two of the primary tools for dealing with resource unavailability. The system cannot generally

add checkpointability; the application programmer must do that. So whereas schedulers can (and

should) take advantage of knowing which jobs are checkpointable (Chapter 5 explores this idea),

they cannot proactively add reliability by increasing the checkpointability of the job mix.

The system can, however, replicate some jobs in an attempt to deal with possible resource

unavailability. Replicating a job can benefit performance in one of two ways. First, jobs are scheduled

onto resources using imperfect ranking metrics that may or may not reflect how fast a job will run

on a machine. Therefore, by starting a job to run simultaneously on more than one resource, the

application makespan is determined by the earliest completion time among all replicas. As this can

depend on unpredictable load and usage patterns, the replica can potentially improve the makespan

of the job. Second, replicated job executions can also help deal with unavailability; then when one

6

resource becomes unavailable, the adverse effect on performance of the jobs it runs can be reduced

if replicas complete without experiencing resource unavailability.

Replication does not come without a cost, however. Within a Grid economy, it is likely that

applications will need to pay for Grid use on a per job, per process, or per operation basis. Therefore,

replication can cost extra, assuming redundant jobs are counted separately (which seems likely).

Replication can also have an adverse indirect effect, as some of the highest ranked resources could

be used for executing replicas, leaving only “worse” (by whatever metric the scheduler uses to rank

jobs) resources for subsequent jobs.

For this reason, Chapter 6 tests the hypothesis that availability predictors can help select the

right jobs to replicate. This approach improves overall average job makespan, reduces the redundant

operations needed for the same improved makespan, or both. Chapter 6 explores several replication

strategies, including strategies that make copies of all jobs, some fixed percentage of jobs, long

jobs, jobs that are not checkpointable, and several combinations. These strategies are tested under

a variety of loads using two availability traces [29][59]. Chapter 6 explores the effectiveness of

replicating only the jobs that are mapped onto resources that fall below some reliability threshold.

That is, the predictor is asked to forecast the likelihood that a resource will complete a particular

job without becoming unavailable. If this predicted probability value is too low, the job is replicated.

If the resource is deemed reliable enough, the job is not replicated.

To summarize, the prediction-based replication contributions include

• describing a new metric, replication efficiency, to analyze replication effectiveness,

• analyzing a replication strategy for using availability predictions to determine when to replicate

jobs. The results investigate the replication strategy’s effect on both makespan and replication

efficiency,

• describing how system load effects replication effectiveness, and

• proposing three different static load adaptive replication techniques, which consider the system

load and the desired metric to decide which replication strategy to use.

Prediction-based replication results show that using resource availability predictions to repli-

cate those jobs that are most likely to experience unavailability can increase job performance and

efficiency.

7

1.5 Load-adaptive Replication

Choosing whether to replicate a job is further complicated by the current load on the system.

As system load increases, selective replication strategies that make fewer replicas produce larger

makespan improvements than more aggressive strategies [62]. The replication strategy described in

Section 6.3.3 consists of a static set of four replication techniques ranging from least selective to

most selective, allowing the current load on the system to determine which replication technique to

use. The four replication strategies are chosen statically based on performance data from a single

synthetic workload. This technique does not perform as well when faced with different real world

workloads. Chapter 7 proposes a replication strategy that adapts to changing system load and

improves on the previously proposed technique.

Section 7.3 introduces a new load adaptive strategy, Sliding Replication (SR). SR first determines

the current load on the system and uses this value as input to a function that outputs an index into

an ordered list of replication strategies. The selected replication strategy determines whether the

job should be replicated. As load changes, the replication strategy changes. This approach attempts

to dynamically discover the most appropriate replication technique given the current load. Real

world resource availability and workload traces are used to test SR’s performance versus previously

proposed load adaptive and non-adaptive replication techniques. The results show that SR produces

comparable average job makespan improvements while decreasing the number of wasted operations,

as measured by replication efficiency (see Section 6.3). SR improves efficiency by an average of

347% when compared with a non-load adaptive approach and by 70% when compared with a less

sophisticated load adaptive approach.

To summarize, the load-adaptive replication contributions include

• showing how the changing system load of real-world job traces effects replication policy per-

formance,

• proposing a load-adaptive replication approach, Sliding Replication, which measures system

load and selects the most appropriate replication policy based on the current system load,

• investigating how varying the parameters of Sliding Replication effects its performance, and

• testing SR’s performance against a related work replication approach in two real-world resource

trace environments and with two real-world job traces.

Load adaptive replication results demonstrate how varying system load affects replication policy

8

performance. Load-adaptive replication can adapt to changing load by selecting the replication

policy most appropriate for the current system load.

1.6 Multi-functional Device Characterization and Scheduling

The traces and workloads for the results summarized in Sections 1.1 through 1.5 have all been

derived from environments that embody this type of setting. To explore the applicability of the

proposed approaches in other environments, unconventional computing environments are leveraged,

which here are defined as collections of devices that do not include general purpose workstations or

machines. In particular, the following facts are exploited:

1. Special purpose instruments and devices are increasingly being outfitted to include processors,

memory, and peripherals that make them capable of supporting high performance computing,

not just for their intended purpose.

2. Large corporations often deploy dozens, hundreds, and even over a thousand such devices

within a single floor, building, or campus, respectively. Moreover, these devices are connected

to one another with high speed networks.

3. The fast processors and large amounts of memory are often required to adequately process peak

loads, not for sustained long running applications. Combined with the inherent characteristics

of the local job request mix, this leaves the devices idle for a significant percentage (95% to

98%) of time.

These characteristics provide potential to federate and harvest the unused cycles of special pur-

pose devices for compute jobs that require high performance platforms. Chapter 8 investigates the

viability of using Multi-Function Devices (MFDs) as a Grid for running computationally intensive

tasks under the challenging constraints described above.

To summarize, the MFD characterization and scheduling contributions include

• developing a multi-state availability model and burstiness metric to capture the dynamic and

preemptive nature of jobs encountered on MFDs,

• gathering an extensive MFD availability trace from a representative environment and analyzing

it to determine the states in the proposed model,

• proposing an online and computationally efficient test for burstiness using the trace, and

9

• showing, through trace-based simulations, that MFDs can be used as computational resources

and demonstrating the scheduling improvements that can be achieved by considering resource

burstiness in making job placements

This MFD characterization work demonstrates the viability of using idle MFDs as computation

resources. In particular, statistical confidence intervals are computed on the lengths of Grid jobs

that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for the

bursty environment. On average across all loads depicted, the MFD scheduling technique proposed

comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average

of 18.3% makespan improvement over a random scheduler.

1.7 Simulation Tool

All the previous research topics require a platform on which to examine their behavior. Current

simulation-based approaches are adequate for some of the experiments that do not use multi-state

prediction, but a system with the capabilities to test multi-state prediction based scheduling was not

available. PredSim is a new experimental platform capable of handing resources that exhibit different

types of unavailability. PredSim consists of a variety of input, user specified and core components.

These components provide a framework for testing prediction, scheduling and replication approaches.

The key contributions of PredSim include

• providing functionality for multi-state based resource availability prediction, scheduling, task

replication and job execution management in a purpose-built environment,

• accepting and combining workload and resource availability traces in multiple formats and

supplying facilities for synthetically creating both jobs and resources,

• providing a “Predictor Accuracy” mode designed to isolate and determine a prediction algo-

rithm’s accuracy across a variety of circumstances, and

• including a library of included predictors, schedulers, replication policies and job queuing

strategies with the flexibility for additional modules.

PredSim has been a helpful tool in researching fault tolerant distributed computing approaches.

10

Figure 1.1: Thesis flow chart

1.8 Summary

This thesis explores the idea of using resource reliability information to improve distributed system

performance. Figure 1.1 demonstrates the logical flow of this work. Resource traces provide availabil-

ity behavior characterization and demonstrate underlying resource volatility. This characterization

information motivates prediction approaches that forecast resource availability. Schedulers use the

prediction information, alongside performance information, to select resources for application exe-

cution. Resource availability predictions choose which jobs are most suitable for replication. These

replication approaches can incorporate current system load in determining which jobs to replicate.

The results presented in this thesis demonstrate that schedulers that judiciously use resource

reliability information provided by availability predictors can improve system performance by se-

lecting resources that are least likely to become unavailable for application execution, by matching

an application’s characteristics to the availability behavior of resources, and by efficiently choosing

those jobs that can most benefit from replication.

11

Chapter 2

Related Work

Related work resides in five broad categories, namely (i) resource tracing and characterization, (ii)

burstiness and MFD scheduling (iii) prediction of resource state (availability and load), (iv) Grid

scheduling, especially approaches that consider reliability and availability, and (v) job replication.

This chapter is organized accordingly.

2.1 Resource Characterization

Condor [42] is the best known cycle stealing system. The ability to utilize idle resources with minimal

technical and administrative overhead has enabled long-term popularity and success. Resource

management systems like Condor form the basis for the resource characterization work in Chapter 3.

Several papers describe traces of workstation pools. Archarya et al. [2] take three 14 day traces

at three university campuses, and analyze them for availability over time, and for the fraction of

time a cluster of k workstations is free. Bolosky et al. [9] trace more than 51,000 machines at

Microsoft, and report the uptime distribution, the available machine count over time, and temporal

correlations among uptime, CPU load per time of the week, and lifetime. Arpaci et al. [8] describe

job submission by time of day and length of job, report machine availability according to time of

day, and analyze the cost of process migration.

Other traces address host availability and CPU load in Grid environments. Kondo et al. [34][36]

analyze Grid traces from an application perspective, by running fixed length CPU bound tasks. The

authors analyze availability and unavailability durations during business and non-business hours,

in terms of time and number of operations [34]. They also show the cumulative distribution of

availability intervals versus time and number of operations [36]. The authors derive expected task

failure rates but state that the traces are unable to uncover the specific causes of unavailability

(e.g. user activity vs. host failure). The Condor traces used in this work borrow from this general

12

approach.

Ryu and Hollingsworth [64] describe fine grained cycle stealing (FGCS) by examining a Network

Of Workstations (NOW) trace [8] and a Condor trace for non-idle time and its causes (local CPU

load exceeding a threshold, local user presence, and the time the Grid must wait after a user has

left before utilizing the resource). Anderson and Fedak [5] analyze BOINC hosts for computational

power, disk space, network throughput, and number of hosts available over time (in terms of churn,

lifetime, and arrival history). They also briefly study on-fraction, connected-fraction, active-fraction,

and CPU efficiency. Finally, Chun and Vahdat [16] report PlanetLab all pairs ping data, detailing

resource availability distribution and mean time to failure (MTTF).

Unlike the work described above, the traces investigated in this work are designed to uncover the

causes of unavailability. The trace data is analyzed and used to classify machines into the proposed

availability states. Individual resources are analyzed for how they enter the various states over time

(by both day and hour). Less extensive traces are intended primarily to motivate the approach to

availability awareness and prediction in Grid scheduling proposed in this work.

2.1.1 MFD characterization and Burstiness

As mentioned above, Archarya et al. [2], Bolosky et al. [9] and Arpaci et al. [8] analyze availability of

workstations over time in various academic and industry settings. But in aforementioned classes of

resources, the dynamics exhibited by intermittent job arrivals (burstiness) and preemption is lacking

or not explicitly analyzed - this is something unique to this thesis.

Kondo et al. [34][36], Ryu and Hollingsworth [64] and Anderson and Fedak [5] examine resource

behavior but state that the traces are unable to uncover the specific causes of unavailability (e.g.

user activity vs. host failure). Unlike the work described above, the traces gathered for this work

are designed not only to uncover the causes of unavailability, but also to examine them in the envi-

ronment of MFDs. Burstiness and preemptive job arrivals in workstation pools have received little

attention in the literature possibly because such resources are seldom suitable for distributed com-

puting, especially parallel computing. In the literature, the volatility of any type of computational

resource is mitigated by modeling the future availability of resources with predictors (and using such

predictors for inline scheduling as described below). The approach proposed in Chapter 8 character-

izes the magnitude of the disturbances from intermittent preemptive job arrivals towards Grid jobs

and normalizes them across a pool of resources.

13

2.2 Prediction

Prediction algorithms have been used in many aspects of computing, including the prediction of

future CPU load and resource availability. The most notable prediction system for Grids and

networked computers is Network Weather Service (NWS) [80]. NWS uses a large set of linear

models for host load prediction, and combines them in a mixture-of-experts approach that chooses

the best performing predictor. The RPS toolkit [18] uses a set of linear host load predictors. One

such common linear technique is called BM(p), or Sliding Window, which records the states occupied

in the last N intervals and averages those values.

Computer and system hardware failure has been extensively studied in previous work [66][69][68].

Schroeder et al. analyze nine years of failure data from the Los Alamos National Laboratory including

more than 23,000 machine failures. The authors study the root causes of failure and the mean

time to failure of the resources [68]. Sahoo et al. investigate system failures from a network of

400 heterogeneous servers for patterns and distribution information [66]. Gibson et al. analyze disk

replacement data from high-performance computing sites and Internet sites for mean time till failure

data distribution [69].

It is not the purpose of this work to specifically target hardware failures. Instead, this approach

analyzes and predicts the reasons machines become unavailable to the Grid, including resource

failure, user presence and high local CPU load. Unavailability due to resource failure, while not

uncommon, is less frequent than other forms of unavailability. For this reason, the prediction

techniques proposed here forecast all types of unavailability, not just hardware failure unavailability.

Importantly, this work attempts to predict the future occurrence of individual instances of resource

unavailability rather than fitting failure data to probability distributions.

In availability prediction, Ren et al. [55] [54] [56] use empirical host CPU utilization and resource

contention traces to develop the only other multi-state model, prediction technique, and multi-state

prediction based scheduler for resource availability. Their multi-state availability model includes five

states, three of which are based on the CPU load level (which resides in one of three zones); the two

other states indicate memory thrashing and resource unavailability. This model neither captures

user presence on a machine, nor allows for differentiation between graceful and ungraceful process

eviction. For prediction, the authors count the transitions from available to each state during the

previous N days to produce a Markov chain for state transition predictions. These transition counts

determine the probability of transitioning to each state from the available state. The Transitional

N-Day Equal Weight (TDE) predictor (Chapter 4) uses a similar approach, but weighs each day’s

14

transitions separately by first computing the probabilities for each of the N days and then combining

them. This is in contrast to the Ren predictor, which sums all the transitions through the N days

and then computes the probability according to their algorithm. Experiments show that the TDE

method more effectively combines many days without allowing a single day’s behavior to dominate

the average. Also, Ren only counts the number of transitions to each state from available (i.e.

three counters), uses these counts to determine the probability of exiting available for each state,

then computes the probability of staying available as one minus the sum of the three exit state

probabilities. In contrast, the TDE approach counts the number of times a resource completes a

period of time equal to the prediction length during resource history analysis. This additional state

counter is summed with the non-available state counters and then each state count is divided by the

total count to produce the four probabilities. This technique allows the predictor to more accurately

determine how often a resource completes the time interval required by the application. To further

differentiate the approaches proposed in this work, the most successful scheduling predictor examines

the most recent N hours of behavior instead of the time interval in question on the previous N days.

The proposed predictors also employ several transition weighting schemes for further prediction

improvement and enhanced scheduling results. These differences significantly influence scheduling

results.

Mickens and Noble [45] [44] [46] use variations and combinations of saturating counters and

linear predictors, including a hybrid approach similar to NWS’s mixture-of-experts, to predict the

likelihood of a host being available for various look ahead periods. A Saturating Counter predictor

increases a resource-specific counter during periods of availability, and decreases it during unavail-

ability. A History Counter predictor gives each resource 2N saturating counters, one for each of the

possible availability histories dictated by the last N availability sampling periods. Predictions are

made by consulting the applicable counter value associated with the availability exhibited in the last

N sampling periods.

Pietrobon and Orlando [50] use regressive analysis of past job executions to predict whether a

job will succeed. Nurmi et al. [47] model machine availability using Condor traces and an Inter-

net host availability dataset, attempt to fit Weibull, hyper-exponential, and Pareto models to the

availability duration data, and evaluate them in terms of “goodness-of-fit” tests. They then provide

confidence interval predictions for availability durations based on model-fitting [16]. Similarly, Kang

and Grimshaw filter periodic unavailabilities out of resource availability traces and then apply sta-

tistical models to the remaining availability data [30]. Finally, several machine learning techniques

use categorical time-series data to predict rare target events by mining event sets that frequently

15

precede them [76] [78] [65].

Chapter 4’s prediction approach differs from the current work by developing novel state transition

counting schemes, by proposing and investigating numerous methods for determining which parts

of a resource’s historical data to analyze, and by using its predictions to forecast a unique set of

availability states. Several transition weighting techniques are developed and are applied to these

algorithms to further increase prediction accuracy. Multi-state availability prediction is used to

forecast the occurrence of the model’s states, and shows that the improved accuracy provided by

the proposed predictors produces significantly improved scheduling results.

2.3 Scheduling

Most Grid scheduling research attempts to decrease job makespan or to increase throughput. For

example, Kondo et al. [32] explore scheduling in a volunteer computing environment. Braun et al. [11]

explore eleven heuristics for mapping tasks onto a heterogeneous system, including min-min, max-

min, genetic algorithms and opportunistic load balancing. Cardinale and Casanova [13] use queue

length feedback to schedule divisible load jobs with minimal job turnaround time. GrADS [17] uses

performance predictions from NWS, along with a resource speed metric, to help reduce execution

time.

Fewer projects focus on scheduling for reliability. Kartik and Murphy [31] calculate the optimal

set of processor assignments based on expected node failure rates, to maximize the chance of task

completion. Qin et al. [51] investigate a greedy approach for scheduling task graphs onto a hetero-

geneous system to reduce reliability cost and to maximize the chance of completion without failure.

Similarly, Srinivasan and Jha [72] use a greedy approach to maximize reliability when scheduling

task graphs onto a distributed system.

Unfortunately, scheduling only for reliability undermines makespan, and scheduling only on the

fastest or least loaded machines can be detrimental due to the performance ramifications of resource

unavailability. Dogan and Ozguner [19] develop a greedy scheduling approach that ranks resources

in terms of execution speed and failure rate, weighing performance and reliability in different ways.

Their work does not use availability predictions, and assumes that each synthetic resource follows

a Poisson failure probability with no load variation, and that machines which become unavailable

never restart. All of these assumptions are removed in the work described in this thesis, which uses

availability and load traces from real resources.

Amin et al. [3] use an objective function to maximize reliability while still meeting a real time

16

deadline. They search a scheduling table for a set of homogeneous non-dedicated processors to

execute tandem real-time tasks. These authors also assume a constant failure rate per processor.

Some distributed scheduling techniques use availability prediction to allocate tasks. Kondo et

al. [33] examine behavior on the previous weekday to improve the chances of picking a host that will

remain available long enough to complete a task’s operations. Ren et al. [55] also examine scheduling

jobs with their Ren N-day predictor. The Ren MTTF scheduler first calculates each resource’s mean

time to failure (MTTFi) by adding the probabilities of exiting available for time t, as t goes from 0

to infinity. It then calculates resource i’s effective task length (ETLi) as:

ETLi = MTTFi · CRi · (1 − Li)

CRi is resource i’s clock rate and Li is its predicted average CPU load. The algorithm then

selects the resource with the smallest ETL from the resources with MTTF values that are larger

than the ETL. If no such resources exist, it selects the resource with the minimum job completion

time, considering resource availability. The scheduling approach in Chapter 5 is most similar to

Ren et al.’s, and differs in the following ways: (i) it considers how and why a resource may become

unavailable, and attempts to exploit varying consequences of different kinds of unavailability, (ii) it

schedules checkpointable and non-checkpointable jobs differently, to improve overall performance,

and (iii) it explicitly analyzes and schedules for the tradeoff between performance and reliability.

2.4 Replication

Replication and checkpoint-restart are widely studied techniques for improving fault tolerance and

performance. Data replication makes and distributes copies of files in distributed file sharing systems

or data Grids [53][37]. These techniques strive to give users and jobs more efficient access to data by

moving it closer, and to mitigate the effects of resource unavailability. Some work considers replica

location when scheduling tasks. Santo-neto et al. [67] schedule data-intensive jobs, and introduce

Storage Affinity, a heuristic scheduling algorithm that exploits a data reuse pattern to consider data

transfer costs and ultimately reduce job makespan.

Task replication makes copies of jobs, again for both fault tolerance and performance. Li et

al. [40] strive to increase throughput and decrease Grid job execution time, by determining the

optimal number of task replicas for a simulated and dynamic resource environment. Their analytical

model determines the minimum number of replicas needed to achieve a certain task completion

probability at a specified time. They compare dynamic rescheduling with replication, and extend

17

the replication technique to include an N-out-of-M scheduling strategy for Monte Carlo applications.

Similarly, Litke et al. [41] present a task replication scheme for a mobile Grid environment. They

model resources according to a Weibull reliability function, and estimate the number of task replicas

needed for certain levels of fault tolerance. The authors use a knapsack formulation for scheduling,

to maximize system utilization and profit, and evaluate their approach through simulation.

Silva et al. [70] investigate scheduling independent tasks in a heterogeneous computational Grid

environment, without host speed, load, and job size information; the authors use replication to

cope with dynamic resource unavailability. Workqueue with Replication (WQR) first schedules all

incoming jobs, then uses the remaining resources for replicas. The authors use simulation to compare

WQR with various maximum amounts of replicas (1x, 2x, 3x, etc), to Dynamic FPLTF [43] and

Sufferage [15] through simulation. Angalano et al. [7] later extend this work with a technique called

WQR Fault Tolerant (WQR FT), which adds checkpointing to the algorithm. In WQR a failed task

is abandoned and never restarted, whereas WQR FT adds automatic task restart to keep the number

of replicas of each task above a certain threshold. Tasks may use periodic checkpoints upon restart.

Fujimoto et al. [25] develop RR, a dynamic scheduling algorithm for independent coarse grained

tasks; RR defines a ring of tasks that is scanned in round robin order to place new tasks and

replicas. The authors compare their technique with five others, concluding that RR performed next

to the best without knowledge of resource speed or load, even when compared with techniques that

utilize such information.

Others investigate the relationship between checkpoint-restart and replication. Weissman [79]

develops performance models for quantitatively comparing the separate use of the two techniques

in a Grid environment. Similarly, Ramakrishnan et al. [52] compare checkpoint-restart and task

replication by first analytically determining the costs of each strategy and then provide a framework

that enables plug and play of resource behavior to study the effects of each fault tolerant technique

under various parameters.

Importantly, none of the related work uses on demand individual resource availability prediction

or application characteristics such as length and checkpointability to determine if a task replica

should be created. A mix of checkpointable and non-checkpointable jobs are used, and the sim-

ulations are driven by real world resource traces that capture resource unavailability. This thesis

introduces a new metric for studying the efficiency of replication strategies. Proposed replication

techniques are then shown to improve upon results obtained by existing techniques.

18

Chapter 3

Availability Model and Analysis

Resources in non-dedicated Grids oscillate between being available and unavailable to the Grid.

When and how they do so depends on the availability characteristics of the machines, the policies

of resource owners, the scheduling policies and mechanism of the Grid middleware, and the charac-

teristics of the Grid’s offered job load. Section 3.1 identifies four availability states and Section 3.2

analyzes a trace to uncover how resources behave according to that model.

3.1 Availability Model

This section identifies four availability states, and several job characteristics that could influence

their ability to tolerate resource faults. This discussion focuses on an analysis of Condor [42], but

the results translate to any system with Condor’s basic properties of (i) non-dedicated distributed

resource sharing, and (ii) a mechanism that allows resource owners to dictate when and how their

machines are used by the Grid. Section 3.1.1 first discusses the Condor system as a motivation for

the availability model, Section 3.1.2 then presents an availability model, and Section 3.1.3 discusses

how different types of Grid applications respond to the types of availability described in the model.

3.1.1 Condor

Condor [42] harnesses idle resources from clusters, organizations, and even multi-institutional Grid

environments (via flocking and compatibility with Globus [24]) by integrating resource management,

monitoring, scheduling, and job queuing components. Condor can automatically create process

checkpoints for migration. Condor manages non-dedicated resources, and allows individual owners

to set their own policies for how and when they are used, as described below. Default policies

dictate the behavior of resources in the absence of customized user policies, and attempt to minimize

Condor’s disturbance of local users and processes.

19

CPU ThresholdExceeded

User Present

Available toGrid

DownMachine becomes unreachable

User returns to machine

Local load on machine increases

User departs

Local load decreases

Machine becomes reachable

Figure 3.1: Multi-state Availability Model: Each resource resides in and transitions between fouravailability state, depending on the local use, reliability, and owner-controlled sharing policy of themachine.

By default, Condor starts jobs only on resources that have been idle for 15 minutes, that are not

running another Condor job, and whose local load is less than 30%. Running jobs remain subject to

Condor’s policies. If the keyboard is touched or if CPU load from local processes exceeds 50% for 2

minutes, Condor halts the process but leaves it in memory, suspended (if its image size is less than

10 MB). Condor resumes suspended jobs after 5 minutes of idle time, and when local CPU load falls

below 30%. If a job is suspended for longer than 10 minutes or if its image exceeds 10 MB, Condor

gives it 10 minutes to gracefully vacate, and then terminates it. Condor may also evict a job for a

higher priority job, or if Condor itself is shut down.

Condor defaults dictate that jobs whose checkpoints exceed 60 MB checkpoint every 6 hours;

those with larger images checkpoint every 12 hours. By default, Condor delivers checkpoints back

to the machine that submits the job.

3.1.2 Unavailability Types

Condor’s mechanism suggests a model that encompasses the following four availability states, de-

picted in Figure 3.1:

• Available: An Available machine is currently running with network connectivity, more than

15 minutes of idle time, and a local CPU load of less than the CPU threshold. It may or may

not be running a Condor job.

20

• User Present: A resource transitions to this state if the keyboard or mouse is touched, indi-

cating that the machine has a local user.

• CPU Threshold Exceeded: A machine enters this state if the local CPU load increases above

some owner-defined threshold, due to new or currently running jobs.

• Down: Finally, if a machine fails or becomes unreachable, it directly transitions to Down.

These states differentiate the types of unavailability. In the context of this work, resource un-

availability refers to the User Present, CPU Threshold Exceeded and Down states. If a job has been

suspended for 15 minutes or the machine is shutdown, this is called a graceful transition to unavail-

able; a transition directly to Down is ungraceful. This model is motivated by Condor’s mechanism,

but can reflect the policies that resource owners apply. For example, if an owner allows Condor jobs

even when the user is present, the machine never enters the User Present state. Increasing the local

CPU threshold decreases the time spent in the CPU Threshold Exceeded state, assuming similar

usage patterns. The model can also reflect the resource’s job rank and suspension policy by showing

when jobs are evicted directly without first being suspended.

3.1.3 Grid Application Diversity

Grid jobs vary in their ability to tolerate faults. A checkpointable job need not be restarted from

the beginning if its host resource transitions gracefully to unavailable. Only Condor jobs that run

in the “standard universe” [75] are checkpointable. This requires re-linking with Condor compile,

which does not allow jobs with multiple processes, interprocess communication, extensive network

communication, file locks, multiple kernel level threads, files open for both reading and writing, or

Java or PVM applications [75]. These restrictions demonstrate that only some Grid jobs will be

checkpointable. Another important factor is job runtime; Grid jobs may complete in a few seconds,

or may require many hours or even days [8]. Longer jobs will experience more faults, increasing the

importance of their varied ability to deal with them.

Grid resources will have different characteristics in terms of how long they stay in each availability

state, how often they transition between the states, and which states they transition to. Different

jobs will behave differently on different resources. If a standard universe job is suspended and then

eventually gracefully evicted, it could checkpoint and resume on another machine. An ungraceful

transition requires using the most recent periodic checkpoint. A job that is not checkpointable must

restart from the beginning, even when gracefully evicted.

21

The point is that job characteristics, including checkpointability and expected runtime, can

influence the effectiveness of scheduling those jobs on resources that behave differently according to

their transitions between the availability states identified in Section 3.1.2.

3.2 Availability Analysis

The characterization and analysis of Grid resources can be used to provide insight into resource

behavior for multiple purposes. The data can be used by researchers to better understand resource

behavior so that more realistic availability models can be built for simulation based experimentation.

More importantly for this work, the gathered traces can be directly used to drive simulations to test

various prediction, scheduling and replication strategies. Furthermore, analysis of the availability

patterns can identify trends and behaviors that can be exploited to create more accurate availability

predictors and scheduling strategies.

3.2.1 Trace Methodology

For this study, a four month Condor resource pool trace at the University of Notre Dame was

accessed, organized, and analyzed. The trace consists of time-stamped CPU load (as a percentage)

and idle time (in seconds). Condor records these measurements approximately every 16 minutes,

and makes them available via the condor status command. Idle times of zero imply user presence,

and the absence of data indicates that the machine was down.

The data recorded by Condor precludes determining whether the machine was shut down or

the machine directly transitioned to Down, because intentional shutdown and unexpected machine

failure appear the same in the data. Since the Down state is relatively uncommon, this study

conservatively assumes that all transitions to the Down state are ungraceful. Also, a machine is

considered to be in the user state after the user has been present for 5 minutes; this policy filters out

short unavailability intervals that lead to application suspension, not graceful eviction. A machine

is considered to be in the CPU Threshold Exceeded state if its local (i.e. non-Condor) load is

above 50%. Otherwise, a machine that is online with no user present and CPU load below 50%

is considered Available. This includes machines currently running Condor jobs, which are clearly

available for use by the Grid. On SMP machines, this study follows Condor’s approach of treating

each processor as a separate resource.

The trace data was processed and analyzed to categorize resources according to the states pro-

posed in the multi-state model. Again, the goal is to identify trends that enable multi-state avail-

22

0 50 100250

300

350

400

450

Machine Availability vs. Time

Days

Num

ber

of A

vaila

ble

Mac

hine

s

0 50 1000

20

40

60

80

100

User Presence vs. Time

Days

Num

ber

of M

achi

nes

With

Use

rs P

rese

nt0 50 100

20

40

60

80

100

120

140

CPU Threshold Exceeded vs. Time

Days

Num

ber

of M

achi

nes

with

CP

UT

hres

hold

Exc

eede

d

0 50 100

20

40

60

80

100

Down vs. Time

DaysN

umbe

r of

Dow

n M

achi

nes

Figure 3.2: Machine State over Time

ability prediction and scheduling.

3.2.2 Condor Pool Characteristics

First, the pool of machines is examined as a whole. This enables conclusions about the machines

aggregate behavior.

Pool State over Time

Figure 3.2 depicts the number of machines in each availability state over time. Gaps in the data

indicate brief intervals between the four months’ data. The data shows a diurnal pattern; availability

peaks at night, and recesses during the day, but rarely below 300 available machines. This indicates

workday usage patterns and a policy of leaving machines turned on overnight, exhibited by the

diurnal User Present pattern. The number of machines that occupy the CPU Threshold Exceeded

state is less reflective of daily patterns. Local load appears to be less predictable than user presence,

and the number of Down machines also does not exhibit obvious patterns.

Figure 3.3 depicts the overall percentage of time spent by the collective pool of resources in

each availability state. The Available state encompasses 62.9% of the total time, followed by Down

(29.7%), User Present (3.6%) and CPU Threshold Exceeded (3.7%).

23

Available

Down

UserCPUPercentage of time spent in each state (total)

Down

User

CPU

Percentage of non−idle time spent in each state (total)

Figure 3.3: Total Time Spent in Each Availability State

0 5 10 15 20 250

1000

2000

3000

4000

5000Number of Transitions to Available vs. Hour of Day

Hour of Day

Num

ber

of T

rans

ition

s to

Ava

ilabl

e

0 5 10 15 20 250

1000

2000

3000

4000

5000

6000Number of Transitions to User Present vs. Hour of Day

Hour of Day

Num

ber

of T

rans

ition

s to

Use

r P

rese

nt

0 5 10 15 20 25200

400

600

800

1000

1200

1400

Number of Transitions to CPU ThresholdExceeded vs. Hour of Day

Hour of Day

Num

ber

of T

rans

ition

s to

CP

UT

hres

hold

Exc

eede

d

0 5 10 15 20 250

50

100

150

200

250

300

350Number of Transitions to Down vs. Hour of Day

Hour of Day

Num

ber

of T

rans

ition

s to

Dow

n

Figure 3.4: Number of Transitions versus Hour of the Day

Daily and Hourly Transitions and Durations

This section examines how often resources transition between states, and how long they reside in

those states. The data is reported as a function of both the day of the week and the hour of the

day. State transitions can affect the execution of an application because they can trigger application

suspension, lead to checkpointing and migration, or even require restart. Figure 3.4 shows the

number of transitions to each state by hour of day.

Again, the numbers of transitions to both the Available and User Present states show clear

diurnal patterns. Users return to their machines most often at around 3pm, and most seldom at 5am.

Machines also frequently become available at Midnight and 10AM. Transitions to CPU Threshold

24

M T W T F S S4000

6000

8000

10000

12000Total Number of Transitions to Available vs. Day of Week

Day of Week

Num

ber

of T

rans

ition

s to

Ava

ilabl

e

M T W T F S S2000

4000

6000

8000

10000

12000Total Number of Transitions to User Present vs. Day of Week

Day of Week

Num

ber

of T

rans

ition

s to

Use

r P

rese

nt

M T W T F S S2000

2200

2400

2600

2800

3000

3200

3400

Total Number of Transitions to CPUThreshold Exceeded vs. Day of Week

Day of Week

Num

ber

of T

rans

ition

s to

CP

UT

hres

hold

Exc

eede

d

M T W T F S S0

100

200

300

400

500

600

Total Number of Transitions to Downvs. Day of Week

Day of Week

Num

ber

of T

rans

ition

s to

Dow

nFigure 3.5: Number of Transitions versus Day of the Week

Exceeded and Down states also seem to exhibit similar (but slightly less regular) patterns.

Figure 3.5 reports the daily pattern of machine transitions among states. Transitions to Available,

User Present and CPU Threshold Exceeded fall from a highs at the beginning of the week to a low

on the weekend. Down transitions show less regular patterns.

Figure 3.6 investigates state duration and its patterns. It plots the average duration that re-

sources remain in each state, beginning at each hour of the day. Expectedly, the average duration

of availability is at its lowest in the middle of the day, since this is when users are present. In-

terestingly, CPU Threshold Exceeded and Down durations both reach their minimum at mid-day,

meaning that whereas machines are not typically completely free, they also are not running CPU

intensive non-Condor jobs.

Down duration is expectedly higher at night, when users abandon their machines until morning.

The duration of the CPU Threshold Exceeded state is at its lowest during the day, most likely due

to short bursts of daytime activity and longer, non-Condor jobs being run overnight.

Finally, Figure 3.7 examines the weekly behavior of the duration that resources reside in each

state. The average length of time spent in the Available state increases throughout the week with

the exception of Thursday while User Present duration decreases during the week. CPU Threshold

Exceeded duration seems to be less consistent. The Down state duration exhibits fairly consistent

behavior with the exception of Thursday.

25

0 5 10 15 20 2511

11.5

12

12.5Average Availability Duration vs. Hour of Day

Hour of Day

Avg

. Ava

ilabl

e D

urat

ion

(Hou

rs)

0 5 10 15 20 250

0.05

0.1

0.15

0.2Average User Present Duration vs. Hour of Day

Hour of Day

Avg

. Use

r P

rese

nt D

urat

ion

(Hou

rs)

0 5 10 15 20 250.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Average CPU Threshold Exceeded Durationvs. Hour of Day

Hour of Day

Avg

. CP

U T

hres

hold

Exc

eede

dD

urat

ion

(Hou

rs)

0 5 10 15 20 256.92

6.94

6.96

6.98

7

7.02Average Down Duration vs. Hour of Day

Hour of DayA

vg. D

own

Dur

atio

n (H

ours

)

Figure 3.6: State Duration versus Hour of the Day

M T W T F S S10

10.5

11

11.5

12Average Availability Duration vs. Day of the Week

Day of Week

Avg

. Ava

ilabl

e D

urat

ion

(Hou

rs)

M T W T F S S0.2

0.4

0.6

0.8

1

1.2Average User Present Duration vs. Day of the Week

Day of Week

Avg

. Use

r P

rese

nt D

urat

ion

(Hou

rs)

M T W T F S S0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Average CPU Threshold Exceeded Durationvs. Day of the Week

Day of Week

Avg

. CP

U T

hres

hold

Exc

eede

dD

urat

ion

(Hou

rs)

M T W T F S S

0.7

0.8

0.9

1

1.1

1.2

1.3Average Down Duration vs. Day of the Week

Day of Week

Avg

. Dow

n D

urat

ion

(Hou

rs)

Figure 3.7: State Duration versus Day of the Week

26

0 500 1000 1500 20000

100

200

300

400

500

600Machine Count vs. Average Availability Duration

Average Availability Duration (Hours)

Mac

hine

Cou

nt

0 10 20 30 40 50 600

100

200

300

400

500

600Machine Count vs. Average User Present Duration

Average User Present Duration (Hours)

Mac

hine

Cou

nt

0 50 100 150 2000

100

200

300

400

500

600

Machine Count vs. Average CPU ThresholdExceeded Duration

Average CPU Threshold Exceeded Duration (Hours)

Mac

hine

Cou

nt

0 1000 2000 3000 40000

100

200

300

400

500

600Machine Count vs. Average Down Duration

Average Down Duration (Hours)

Mac

hine

Cou

nt

Figure 3.8: Machine Count versus State Duration

Individual Machine Characteristics

This section examines the characteristics of individual resources in the pool. Figure 3.8 and Fig-

ure 3.9 plot cumulative machine counts such that an (x, y) point in the graph indicates that y

machines exhibit x or fewer (i) average hours per day in a state (Figure 3.8) or (ii) number of tran-

sitions per day to a state (Figure 3.9). The first aspect explored is how long, on average, each node

remains in each state.

Figure 3.8 shows the distribution according to the average time spent in each of the four states.

For Availability, the data suggests three distinct categories: 144 resources average 9.4 hours or less,

328 average between 7.5 and 195 hours, and the top 31 resources average more than 195 hours.

For User Present, 173 resources average less than 30 minutes of user presence; of those, 14 have

no user activity. The remaining 330 resources have longer (over 30 minutes) average durations of

user presence. For CPU Threshold Exceeded, 49 have no local load occurrences, 40 resources have

short bursts of 60 minutes or less, 353 resources have longer average usage durations (but less than 6

hours), and the remaining 61 resources remain in their high usage state for over 6 hours on average.

Average time in the Down state shows that 288 resources have durations less than 25 hours; 96

never transition to the Down state. The top 38 resources have Down durations exceeding 1,700

hours (mostly offline).

27

0 2 4 6 80

100

200

300

400

500Machine Count vs. Avg. Number of Transitions to Available

Avg. Number of Transitions to Available per Machine

Mac

hine

Cou

nt

0 2 4 60

100

200

300

400

500Machine Count vs. Avg. Number of Transitions to User Present

Avg. Number of Transitions to User Present per Machine

Mac

hine

Cou

nt

0 1 2 30

100

200

300

400

500

Machine Count vs. Avg. Number of Transitions toCPU Threshold Exceeded

Avg. Number of Transitions to CPU ThresholdExceeded per Machine

Mac

hine

Cou

nt

0 0.05 0.1 0.15 0.2 0.250

100

200

300

400

500

Machine Count vs. Avg. Number ofTransitions to Down

Avg. Number of Transitions to Down per Machine

Mac

hine

Cou

nt

Figure 3.9: Machine Count versus Average Number of Transitions Per Day

Figure 3.9 examines the average number of resource transitions to each state per resource per

day. 78.7% of resources transition to Available fewer than twice per day. The top 2% transition

over 4.5 times on average. As reported earlier, 14 machines (2.7%) have no user activity. On the

other hand, 399 machines (79.3%) infrequently experience local users becoming present (less than 2

transitions to User Present per day). Users frequently come and go on the remaining 104 resources

(20.6%), with some of these experiencing up to 7 transitions to User Present on average per day.

CPU Threshold Exceeded transitions are bipolar with 468 (93%) machines having 1 or fewer tran-

sitions to CPU Threshold Exceeded per day (49 of those 400 have none). The remaining machines

average more than 1 transition and peak at 3.7 transitions each day. Finally, Down is relatively

uncommon with 96 resources (19%) never entering the Down state, 341 (67.7%) transitioning to

Down less than once on average, and only the top 5 machines (1%) having two transitions or more

on average.

Machine Classification

This section classifies resources by considering multiple state characteristics simultaneously. This

approach organizes resources based on which type of applications they would be most useful in

executing. In particular, the average availability duration dictates whether the application is likely to

complete; how a machine transitions to unavailable (gracefully or ungracefully) determines whether

28

Average Num of Graceful TransitionsHigh Low

Average Availability DurationUngraceful Transitions

High Low High Low

High 0 0 44 90Medium High 0 0 30 104Medium Low 22 18 47 47

Low 98 32 3 2

Table 3.1: Machine Classification Information

the job can checkpoint before migrating.

Machines are classified based on average availability duration and the number of graceful and

ungraceful transitions to unavailable per day. Table 3.1 reports the number of machines that fit these

categories. The thresholds for the Availability categories are 65.5 (High), 39.7 (Medium High), and

6.5 (Medium Low) hours. More than 1.6 graceful transitions to unavailable, per day, are considered

to be High, and two or more ungraceful transitions during the four months is considered High.

Roughly 36% of machines have Medium High or High availability durations with both a low

number of graceful and ungraceful transitions, making them highly available and unlikely to expe-

rience any kind of job eviction. In contrast, about 24% of resources exhibit low availability and a

high number of graceful transitions; about 75% of these have high ungraceful transitions. The low

availability group would be most appropriate for shorter running checkpointable applications. The

last significant category (approximately 8.7%) has Medium Low availability and few graceful and

ungraceful transitions. This group lends itself to shorter running applications with or without the

ability to checkpoint.

3.2.3 Classification Implications

This section demonstrates the implications of Table 3.1’s resource classification distribution. To test

how applications would succeed or fail (a failed application is defined as an application that expe-

riences resource unavailability before it completes execution) on resources in the different classes,

a series of 100 jobs is simulated with the job runtimes being equally distributed between approx-

imately 5 minutes to 12 hours (on a machine running at 2500 MIPS). The simulation used each

resource’s local load at each point in the trace, along with the MIPS score that Condor calculates.

Each job was begun at a random available point in each resource’s trace and then the execution was

simulated. This was done 1000 times for each job duration on each resource.

Results are included for the resources classified as (i) low availability with high graceful and

29

2 4 6 8 10

x 107

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Graceful Failure Rate vs. Job Duration

Job Duration (Instructions)

Fai

lure

Rat

e (%

)

2 4 6 8 10

x 107

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

Ungraceful Failure Rate vs. Job Duration

Job Duration (Instructions)

Fai

lure

Rat

e (%

)

HA−LG−LUGMHA−LG−LUGMLA−LG−HUGLA−HG−HUG

Figure 3.10: Failure Rate vs. Job Duration

ungraceful transitions (18% of the machines), (ii) medium-low availability with low graceful transi-

tions, but high ungraceful transitions (8.7%), (iii) medium-high availability with few graceful and

ungraceful transitions (19.3%), and (iv) high availability, also with few graceful and ungraceful tran-

sitions (16.7%). These classes represent a broad spectrum of the classification scheme and 62.7%

of the total resources. They appear in bold in Table 3.1, and are labeled (i) LA-HG-HUG, (ii)

MLA-LG-HUG, (iii) MHA-LG-LUG, and (iv) HA-LG-LUG in Figure 3.10.

Figure 3.10 shows that graceful failure rates climb more rapidly for less available resources. The

low availability resources also have a much higher ungraceful failure rate versus the high and medium

high availability resources, which have a very low ungraceful failure rate.

These results are important because of the diversity of Grid applications, as described in Sec-

tion 3.1.3. Checkpointable applications can tolerate graceful evictions by checkpointing on demand

and resuming elsewhere, without duplicating work. However, these same applications are more sen-

sitive to ungraceful evictions; the amount of lost work due to an ungraceful transition to the Down

state depends on the checkpoint period. On the other hand, applications that do not checkpoint are

equally vulnerable to graceful or ungraceful transitions; both require a restart. Therefore, resources

in different classes should host different types of applications.

Intuitively, to make the best matches under high demand for resources, a scheduler should match

the job duration with the expected availability duration of the machine; machines that are typically

available for longer durations should be reserved, in general, for longer running jobs. And non-

checkpointable jobs should run on resources that are not likely to transition to Down ungracefully.

Finally, machines with high rates of ungraceful transition to Down should perhaps be reserved for

highly replicable jobs.

30

Random Scheduler Classifier Scheduler ImprovementCompleted Jobs 1883.2 1936.0 2.27%

Gracefully Evicted Jobs 1244.0 492.0 60.45%Ungracefully Evicted Jobs 28.6 14.3 50.0%

Avg. Execution time (secs.) 71209.0 48656.0 31.6%

Table 3.2: Machine Classification Scheduling Results

A second simulation demonstrates the importance of the classes. 1000 jobs were created of ran-

dom length (equally distributed between 5 minutes and 12 hours), half of which were checkpointable

with a period of six hours. Two schedulers are tested — a scheduler that maps jobs to resources

randomly, and one that prioritized classes according to Table 3.1. The simulator iterates through

the trace, injecting an identical random job set and scheduling it with each scheduler. Job execution

considers each resource’s dynamic local load and MIPS score. When a resource that is executing a

job becomes unavailable in the trace, the job is rescheduled. If the job is checkpointable and the

transition is graceful, the job takes a fresh checkpoint and migrates. An ungraceful transition to

unavailability requires restarting from the last successful periodic checkpoint. Non-checkpointable

jobs must always restart from their beginning.

The classifier used months one and two to classify the machines; schedulers ran in months three

and four. Each subsequent month was simulated three times and the average values were taken

for that month. Table 3.2 reports the sum of those averages across both months. Jobs meet more

evictions and therefore take an average of 31.6% longer when the scheduler ignores the unavailability

characteristics.

31

Chapter 4

Multi-State Availability Prediction

Predicting the future availability of a resources can be extremely useful for many purposes. First,

resource volatility can have a negative impact on applications executing on those resources. If the

resource on which an application is executing becomes unavailable, the application will have to

restart from the beginning of its execution on another resource. This wastes valuable cycles and

increases application makespan. Prediction can allow the scheduler to choose resources that are

least likely to become unavailable and avoid these application restarts. Second, replication can

also be used to mitigate task failure by creating a copy of an application. Since the system can

only support a finite number of replicas, it is useful to determine which applications are likely

to experience resource unavailability via prediction and replicate those jobs that are most likely to

experience resource unavailability. This can increase the efficiency of the system. Third, when storing

data on distributed systems, resource unavailability can be predicted to ensure that data remains

accessible even in the presence of resource unavailability. This work studies using predictions for

both scheduling and replication in later chapters. This chapter first proposes prediction strategies

in Section 4.1, then analyzes their accuracy in various configurations in Section 4.2.1 and finally

compares these predictors accuracy with related work predictors in Section 4.2.2.

4.1 Prediction Methodology

A multi-state prediction algorithm takes as input a length of time (e.g. estimated application execu-

tion time), and uses a resource’s availability history to predict the probabilities of that resource next

exiting the Available state into each of the non-available states, and to remain available through-

out the interval; these probabilities sum to 100%. The proposed availability predictor outputs four

probabilities, one each for entering Figure 3.1’s User Present, CPU Threshold Exceeded and Down

states next, and one for the probability of completing the interval without leaving the Available

32

state (Completion). It is often useful to think of highest computed probability for a given prediction

as its surety because the higher the largest transition probability the more sure the predictor is of

its forecast.

Since a prediction algorithm examines a resource’s historical availability data, the two important

aspects that differentiate predictors are (i) when during the resource’s history the predictor analyzes

and (ii) how the predictor analyzes that section of availability history. The approaches presented here

are classified along these two dimensions; by combining when and how historical data is analyzed, a

range of availability prediction strategies is created.

In determining when the predictor should analyze, this study takes two different approaches.

Previous work [54] advocates examining a resource’s behavior on previous days during the same

interval. Therefore, the first approach investigated examines a resource’s past N days of availability

behavior during the interval being predicted. This approach is called N-Day or Day. Another

approach examines a resource’s most recent N hours of activity immediately preceding the prediction

(N-Recent).

The proposed predictors analyze the segments of resource availability history in two ways. The

first makes predictions based on the resource’s state durational behavior by calculating the percentage

of time spent in each state during the analysis period. The rationale is that the state that is occupied

the majority of the time could be the most likely next state. The second and more successful approach

considers a resource’s transitional behavior by counting the number of transitions from Available

to each of the other states. Also, for every interval of time a resource was available, the predictor

counts how many times the resource could have completed the requested job duration (e.g. a ten

hour availability interval and a two hour job means five completions). The probabilities for each

exit state (as well as the completion state) are calculated by summing each state’s transition count

and dividing the total number of transitions for each state by the total number of transitions. For

both the durational and transitional approaches, when combining different sections of availability

(e.g. when combining the N days’ probabilities together), the probabilities for each state are then

averaged.

The Transitional schemes weigh all transitions equally (Equal weighting scheme); weighing transi-

tions according to when the transition occurred is also investigated. The predictors weigh transitions

according to the time of day (Time) in response to previous observations that this correlates with

future behavior (Chapter 3) [59]. This means that for the Time weighting scheme, the closer a

transition time is to the time of day at which the prediction occurs, the higher the weight of that

transition. In other words, if a prediction is made at 3pm, then transitions that occur closest to

33

3pm today and on other days have the largest weights. Lastly, Freshness weighting gives a higher

weight to transitions that occur closer to the time of the prediction and likewise, a smaller weight

to events that occurred further in the past.

Section 2.2 evaluates related work predictors for comparison. Since linear regression [18] is such a

prevalent method in many disciplines of prediction, this study includes both Sliding Window Single

and Sliding Window Multi. Because states are categorical, they must be converted to numerical

values. The following conventions are used. In the single state version of sliding window (SW

Single), the Available state is assigned to 1 and any other state as 0. In the multi-state version (SW

Multi), the Available state is assigned to 1, User Present as 2, CPU Threshold Exceeded as 3 and

Down as 5. Counter based predictors such as those used for resource availability prediction [45] are

evaluated, including Saturating Counter and History Counter. Just as in [45], a 2 bit saturating

counter is updated every hour. The Completion predictor, which always predicts the machine will

complete the interval in the Available state is used as a baseline comparison. Finally, Ren’s N-Day

multi-state availability predictor (called simply Ren) is implemented for comparison (see Section 2.2

for a full description of the Ren predictor) [55].

4.2 Predictor Accuracy Comparison

This section evaluates the predictors for their accuracy and focuses its investigation on predictors

that analyze a resource’s history with the transitional method analyzing both the N-Day and the

N-Recent hours approaches for determining when during a resource’s history to examine. Results

for the durational method are excluded due to its inferior performance. This study uses six months

of availability data from the 603 node Condor pool at the University of Notre Dame as its data set.

This includes two additional months of data not reported in Chapter 3. This trace was taken during

the months of January through June of 2007 and is analyzed in Section 3.2. Again, measurements

for CPU load, resource availability and user presence were taken every 15 minutes for each resource,

and the states of availability are inferred according to the proposed multi-state model.

Each predictor performed an identical set of 500,000 predictions on a random resources at a

random time during the six month trace. Predictor accuracy is defined as the ratio of correct

predictions to the total number of predictions. A correct prediction is one for which the machine is

predicted to exit on a certain non-available state and it does, or for which the machine is predicted

to remain available throughout the interval, and it does.

34

0 2 4 6 8 10 12 14 160.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9Prediction Accuracy vs. Number of Days Analyzed

Number of Days

Pre

dict

ion

Acc

urac

y (%

)

TDE: 12hrTDE: 25hrTDE: 60hrTDE: 120hrRen: 12hrRen: 25hrRen: 60hrRen: 120hr

Figure 4.1: Transitional Predictor Accuracy

4.2.1 Prediction Method Analysis

Figure 4.1 depicts the accuracy of the Transitional N-Day with Equal transition weights (TDE)

predictor and the Ren N-Day predictor in relation to the number of days each analyze for predictions

with lengths uniformly distributed between five minutes and the number of hours indicated in the

legend. Recall that TDE examines the resource’s state transition behavior (Transitional) during

the requested prediction interval starting at the same time during each of the previous N days

(N-Day). It then calculates the state exit probabilities based on the number of transitions from

Available to each of the states separately for each day, combining the resulting probabilities for

each day. For predictions of lengths between five minutes and 12, 25, 60 and 120 hours, the graph

shows that TDE exhibits an increase in accuracy when analyzing any number of days in comparison

with Ren. For example, for predictions between 5 minutes and 25 hours, TDE’s accuracy increases

through 16 days, at which point it peaks at 78.3%. In this case, Ren peaks at 73.7% and then falls

abruptly when considering more than one day. Notice that Ren’s predictor decreases in accuracy

when analyzing more days whereas TDE can better incorporate new information into its prediction.

For all prediction lengths, TDE initially increases in accuracy, then levels off when acquiring more

35

0 50 100 150 200 250 3000.745

0.75

0.755

0.76

0.765

0.77

0.775

0.78Prediction Accuracy vs. Number of Hours Analyzed

Number of Hours

Pre

dict

ion

Acc

urac

y (%

)

TRETRFTRT

Figure 4.2: Weighting Technique Comparison

information (considering more days) whereas Ren becomes less accurate the more days it considers.

Overall, TDE is 4.6% more accurate than the Ren predictor.

Figure 4.2 examines the Transitional Recent hours (TR) predictor configured to analyze the

most recent N hours of availability with various transition weighting (Equal, Freshness and Time

of day) schemes for predictions between five minutes and 25 hours (other lengths are not examined

due to space constraints). Again, this predictor examines the resource’s state transition behavior

(Transitional) but in contrast to TDE, does so for the last N-hours before the time that the prediction

is made (Recent). For all weighting schemes, as the predictor considers more hours, accuracy

increases dramatically at first, then the increases slow down, reach a maximum, then slowly decrease.

The Freshness (TRF) weighting scheme provides the best performance, reaching the highest accuracy

of 77.3% when examining the past 48 hours of behavior. TRF weights each transition t (W (t))

according to the following formula:

W (t) = M · (Tt/L)

M is the “recentness weight” set by the user (the default value is five), Tt is the length of time

36

10 20 30 40 50 60 70 80 90 100 110

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Prediction Accuracy vs. Prediction Length

Prediction Duration (Hours)

Pre

dict

ion

Acc

urac

y (%

)

TRFTDERen 1−dayCompletionSaturating CounterHistory CounterSliding Window SingleSliding Window Multi

Figure 4.3: Prediction Accuracy by Duration

that elapsed from the beginning of the analysis interval to the transition t, and L is the total analysis

interval length. In the next section, analysis focuses on the TDE and TRF predictors due to their

high accuracy (78.3% and 77.3% respectively).

4.2.2 Analysis of Related Work

This section compares the accuracy of the best performing predictors, TDE and TRF, to sev-

eral existing approaches from the literature, namely the Saturating and History Counter predic-

tors [45] [44] [46], the Multi-State and Single State Sliding Window predictors [18], the Ren predic-

tor [55] [54] [56], and the Completion predictor.

Figure 4.3 depicts predictor accuracy versus prediction length, for predictions up to 120 hours.

The Counter-based predictors, Completion predictor, and Sliding Window predictors perform sim-

ilarly to one another, compete well with TRF, TDE and Ren for predictions up to 20 hours, and

then sharply decline, never leveling off. In contrast, the TRF, TDE, and Ren predictors decrease

initially and then level off; the rate of decrease for Ren is somewhat larger as the prediction length

increases past 60 hours. The Ren predictor initially has lower accuracy, with accuracy decreasing

37

even faster than the Completion predictor for prediction durations shorter than 19 hours. During

these short intervals, the Ren predictor is up to 5.6% less accurate than the TDE predictor, and

5.2% less accurate than TRF.

Figure 4.3 demonstrates that for predictions shorter than 19 hours, TDE is the most accurate,

accounting for its 1% increase in accuracy over TRF in Section 4.2 for predictions between five

minutes and 25 hours. However, TRF becomes the most accurate predictor for predictions longer

than 42 hours, reaching an accuracy increase of 9.8% over Ren and 3.1% over TDE for predictions of

120 hours. Chapter 5’s scheduling results demonstrate that this large difference in accuracy for longer

predictions is critical. Chapter 5 focuses its prediction-based scheduling analysis on TRF because of

the improved schedules it produces. Predictors that perform better for long term predictions lead

to the best scheduling results when used with the schedulers proposed in Chapter 5, even when

scheduling shorter jobs. As the results demonstrate in Section 5.4.2, this explains why using the

TRF predictor decreases makespan and the number of job evictions when compared to using the

Ren predictor, even when using an identical prediction-based scheduler.

38

Chapter 5

Prediction-Based Scheduling

Resource selection — choosing the resource on which to execute a given task — is a complex

scheduling problem. Resource selection can depend on the length of the job, it’s ability to take a

checkpoint, the load on the system, the availability of resources, the resource’s reliability, and possible

time-constraints on application completion [61][58]. Additionally, the performance metric can vary

from reduced job makespan or increased job reliability, to an increased overall system efficiency

and throughput. These factors lead to a complex scheduling environment involving many different

tradeoffs. The resource selection problem is further complicated by resource volatility. Unexpected

resource unavailability can have a large impact on scheduling performance. Schedulers that are

aware of this resource volatility can make more informed decisions on job placement, avoiding less

reliable resources.

This chapter focuses on dealing with resource volatility through informed job placement, to reduce

job evictions by avoiding resource unavailability, which in turn can decrease average job makespan.

To that end, this chapter investigates scheduling jobs with the aid of resource availability predictions

such as those introduced in Chapter 4. This chapter is organized as follows. First, Section 5.1

explains the simulator setup and scheduling methodology. Section 5.2 demonstrates the inherent

tradeoff between scheduling for reliability and scheduling for performance, Section 5.3 investigates

various prediction-based scheduling heuristics and their performance across various job lengths and

checkpointabilities. Lastly, Section 5.4 compares the prediction based scheduler’s results to various

prediction-based and non-prediction-based scheduling approaches across job lengths.

5.1 Scheduling Methodology

This work investigates various scheduling techniques through simulation-based experiments that vary

the number of jobs, their characteristics, and the schedulers’ scoring techniques. The simulations in

39

this chapter are driven by the Notre Dame Condor pool (approximately 600 nodes) six month trace.

A certain number of jobs are simulated executing on these machines, each utilizing its own recorded

availability and load measurements. The simulation based experiments presented here create and

insert each application at a random time during the six month simulation, such that application

injection is uniformly distributed across the six months. The simulation assigns each application

a duration (i.e. a number of operations needed for completion); application duration is uniformly

distributed between two durations such as between five minutes and 25 hours (the duration is an

estimate based on the execution of the application on a resource with an average MIPS speed with

no load). This uniform distribution of job insertion times and durations allows the experiments

to test the quality of the predictor and scheduler combination with equal weight for all prediction

durations and prediction start times, without emphasizing a particular duration or start time. In

all simulations in this chapter, 25% of the applications are checkpointable; that is, they can take an

on demand checkpoint in five minutes if the user returns or if the local CPU load goes above 50%,

leading to eviction and rescheduling. The remaining 75% non-checkpointable jobs need to start over

upon eviction.

Each machine’s MIPS score (as defined by Condor’s benchmarking tool), and the load and avail-

ability state information contained in the trace, influence the simulated running of the application.

During each simulator tick, the simulator calculates the number of operations completed by each

working resource, and updates the records for each executing application. If a resource executing

an application leaves the Available state (as per the trace), effectively evicting the application, the

executing application joins the back of the application queue for rescheduling. The number of op-

erations that remain to be completed for this evicted application depends on the checkpointability

of the application and the type of eviction (Section 3.1.3). All waiting applications (scheduled or

rescheduled) remain queued until they reach the head of the queue, at which point they are resched-

uled. During each simulator tick, the scheduler places applications off the head of the queue until

the next application cannot be scheduled (e.g. because no resources are available).

To facilitate resource score based scheduling (e.g. prediction-based scheduling as well as other

scheduling heuristics), a scheduling algorithm is defined called Prediction Product Score (PPS) Sched-

uler. The PPS Scheduler scores each available resource and maps the job at the head of the queue

onto the resource with the highest score (Ties are broken arbitrarily.) The resource scoring policy

(how each resource is scored) defines that scheduler’s placement policy. Various resource scoring

approaches are examined throughout the experiments presented here.

Scheduling quality is analyzed according to average job makespan and evictions. Makespan

40

is the time from submission to completion; average makespan is calculated across all jobs in the

simulation. For evictions, the simulator counts the total number of times that jobs need to be

rescheduled because a resource running a job transitions from available to one of the unavailable

states.

5.2 Reliability Performance Relationship

This section establishes the inherent tradeoff between scheduling for reliability and for performance

for a real world environment. In the context of this work, resource reliability refers to the ability

of a resource to consistently complete applications without experiencing unavailability and is used

informally. It is not used formally as in other publications [51]. Schedulers that favor performance

consider the static capability of target machines, along with current and predicted load conditions,

to decrease makespan. Other schedulers may instead consider the past reliability and behavior of

machines to predict future availability and to increase the number of jobs that complete successfully

without interruption due to machines going down or owners reclaiming resources. Schedulers may

consider both factors, but cannot in general optimize simultaneously for both performance-based

metrics like makespan, and reliability-based metrics like the number of evictions or the number

of operations that must be re-executed due to eviction (i.e. “operations lost”) [19]. In fact, the

interplay between these two metrics is often quite complex. Choosing a reliable resource can decrease

job evictions and by virtue of that, decrease job makespan (increase performance). Along the same

lines, choosing a fast resource can lead to a shorter job execution time, reducing the likelihood of

experiencing a job eviction (increase reliability). In the real world, there is a large variety in the

reliability and performance of resources and often a scheduler cannot find both a high performance

and high reliability resource. In these cases, a scheduler may have to choose between a slow reliable

resource or a fast unreliable resource or anywhere in between. Section 5.3 explores heuristics for

dealing with this scheduling problem. This section demonstrates the tradeoff itself.

To investigate the reliability-performance tradeoff, 6000 simulated applications are executed on

these machines according to the methodology outlined in Section 5.1. Application durations are

uniformly distributed between five minutes and 25 hours. The PPS chooses from among resources

for application execution (Section 5.1) and scores each available resource according to the following

expression (the resource with the highest score is chosen for application execution):

RSi = (1 − W ) · Pi[COMPLETE] + W · (MIPSi/MIPSmax) · (1 − Li)

41

0 0.25 0.5 0.75 110

15

20

25

30

35

Average Makespan vs. Tradeoff Weight

Tradeoff Weight

Mak

espa

n (H

ours

)

0 0.25 0.5 0.75 1500

1000

1500

2000

2500

Total Job Evictions vs. Tradeoff Weight

Tradeoff Weight

Evi

ctio

ns

Figure 5.1: As schedulers consider performance more than reliability (by increasing Tradeoff WeightW), makespan decreases, but the number of evictions increases.

• Pi[COMPLETE] is resource i’s predicted probability of completing the job interval without

failure, according to the TRF predictor (Section 4.2),

• MIPSi is the resource’s processor speed,

• MIPSmax is the highest processor speed of all resources (for normalization), and

• Li is the resource’s current processor load.

Figure 5.1 depicts the effect of varying the Tradeoff Weight and hence the relative influence of

reliability or performance in scheduling. As performance is considered more prominently, makespan

decreases but the number of evictions increases. In the middle of the plot, a tradeoff weight of 0.5 does

achieve makespan within 6.7% of the lowest makespan on the curve, while simultaneously coming

within 18.1% of the fewest number of evictions. Nevertheless, the makespan slope is uniformly

negative, and the evictions slope is uniformly positive.

5.3 Prediction Based Scheduling Techniques

This section explores in more detail a wider range of resource ranking strategies that consider resource

performance and reliability. Section 5.3.1 studies the effect of considering various combinations of

CPU speed, load, and reliability; the section also introduces the idea of scheduling checkpointable

jobs differently from non-checkpointable jobs. The experiments then describe how the best per-

forming scheduler from Section 5.3.1 behaves as the length of the interval for which the predictor

forecasts is varied (Section 5.3.2).

42

5.3.1 Scoring Technique Performance Comparison

In this Section, the PPS scheduler is configured with a range of resource scoring approaches, which

utilize some or all of the following factors.

• CPUSpeed(MIPS) : the resource’s Condor MIPS rating

• CurrentLoad(L) : the machine utilization information at scheduling time

• CompletionProbability(P [COMPLETE]) : The predicted probability of completing the pro-

jected job execution time without becoming unavailable. This is one minus the sum of the

probabilities of the three unavailability states

• UngracefulEvictionProbability(Pi[UNGRACE]) : The predicted probability of exiting job

execution directly to Down (with no chance for a checkpoint).

• Checkpointability : Whether the job could take an on-demand checkpoint before being evicted

from a machine (CKPT) or not (NON CKPT)

Considering whether or not the scoring system incorporates each of the first three criteria defines

eight different kinds of scoring, ranging from considering none of the criteria (Random scheduling),

to including them all. Then, if checkpointable jobs are scheduled differently from non checkpointable

jobs, many more possible approaches emerge. There is also the matter of how to incorporate each

factor into the score. Based on intuition (some combinations make more sense than others) and

background experimental results, the PPS scheduler (Section 5.1) is configured with the following

resource scoring approaches. The expression provides the formula for computing each resource’s

score, RSi

• S0 : MIPSi · (1 − Li)

• S1 : Pi[COMPLETE]

• S2 : MIPSi · Pi[COMPLETE]

• S3 : MIPSi · (1 − Li)Pi[COMPLETE]

• S7 : CKPTMIPSi·(1−Li)(1−Pi[UNGRACE]) NON−CKPTMIPSi·(1−Li)Pi[COMPLETE]

• S1 : CKPTMIPSi · (1 − Li) NON − CKPTMIPSi · (1 − Li) · Pi[COMPLETE]

43

20 40 60 80 100 1200

0.5

1

1.5

2

Makespan vs. Job Length

Job Length (Hours)

Mak

espa

n

20 40 60 80 100 120−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0Evictions vs. Job Length

Job Length (Hours)

Evi

ctio

ns

S0S1S3S7S9

Figure 5.2: The resource scoring decision’s effectiveness depends on the length of jobs. For longerjobs, schedulers should emphasize speed over predicted reliability.

S0, S1, S2, and S3 schedule checkpointable jobs the same way they schedule non-checkpointable

jobs with S0 considering speed, S1 considering reliability, and S2 and S3 considering both. However,

the fact that checkpointable jobs react differently to various types of resource unavailability suggests

that scheduling them differently could improve overall Grid performance. S7 and S9 are considered

for this reason.

This section executes simulations for the same environment and setup as in Section 5.1, but

varies the maximum job length from 12 to 120 hours. Pi[COMPLETE] (an availability prediction)

is made by TRF (Section 4.2) and used as input to the resource scoring decisions that utilize it. The

results obtained and presented in the remainder of the paper are based on simulating each scenario

once. However, the consistency and repeatability of the results has been verified through repetition.

Figure 5.2 plots average makespan vs. maximum job length, and evictions vs. maximum job

length for all of the proposed resource scoring decisions as percentage difference versus S0. Percent-

ages are used for comparison throughout this chapter. The percentage difference of one scheduler

versus a chosen baseline scheduler is plotted for each particular metric (e.g. operations lost or evic-

tions). Percentages are used to emphasize the differences in scheduling quality and to facilitate visual

comparison of performance, given the large range of y values. The text includes the magnitude of

the results (e.g. the actual number of evictions) in several places.

S1, which considers only reliability, performs significantly worse than all the others in terms

of makespan for all maximum job lengths, but also for number of evictions for longer jobs. This

demonstrates that schedulers must consider resource performance when scheduling. The figure also

shows that makespans are relatively similar for the other scheduling approaches, but that S3 does

the best job of avoiding evictions. For example, for jobs up to 84 hours, S3 obtains 5417 evictions

44

0 50 100−0.2

0

0.2

0.4

0.6

0.8Makespan vs. Checkpointability

Checkpointability (%)

Mak

espa

n

S0S1S3S7S9

0 50 100−0.6

−0.4

−0.2

0

0.2Evictions vs. Checkpointability


Evi

ctio

ns0 50 100

−0.5

−0.4

−0.3

−0.2

−0.1

0Operations Lost vs. Checkpointability


Ope

ratio

ns L

ost

0 50 100−0.5

−0.4

−0.3

−0.2

−0.1

0Percent of Total Operations Lost vs. Checkpointability

Checkpointability (%)P

erce

nt o

f Tot

al O

pera

tions

Figure 5.3: Scoring decision effectiveness depends on job checkpointability. Schedulers that treatcheckpointable jobs differently than non-checkpointable jobs (S7 and S9) suffer more evictions, butmostly for checkpointable jobs; therefore, makespan remains low.

whereas S1, S7 and S9 obtain approximately 6450 evictions (19% more). For a shorter job length of

6 hours, S3 causes 860 evictions and S7 and S9 cause about 1070 evictions increasing the disparity

to 24%. These results demonstrate that schedulers must consider both reliability and performance

while scheduling, to simultaneously produce fewer evictions and shorter job makespan.

Figure 5.1’s results are for 25% checkpointable jobs. When that percentage varies, the value of

considering reliability in the scoring function changes, as does the value of scoring checkpointable

and non-checkpointable jobs differently.

Figure 5.3 plots makespan, evictions, operations lost, and percentage of total operations lost,

as the percentage difference compared with S0. As the percentage of checkpointable jobs increases,

makespan difference increases more quickly for S1 and S3, compared to schedulers S7 and S9, which

schedule non-checkpointable jobs on more reliable resources. Figure 5.3 shows that the difference in

the number of evictions does increase more for S7 and S9; these evictions, however, are increasingly

for checkpointable jobs, so the effect on makespan is small. When a small percentage of jobs are

checkpointable, schedulers do well to consider reliability (like S3). The hybrid approaches achieve

balance between reliability and performance, at each extreme.

The remainder of this chapter considers only the case where 25% jobs are checkpointable, and

therefore primarily uses S3, which performs similarly to S7 and S9 for this checkpointability mix.

45

0 50 100−0.1

0

0.1

0.2

0.3

0.4Makespan vs. Job Length

Job Length (Hours)

Mak

espa

n

0 50 100−0.5

0

0.5

1

1.5Evictions vs. Job Length

Job Length (Hours)

Evi

ctio

ns

0.250.5123Comp

Figure 5.4: Schedulers can improve results by considering predictions for intervals longer than theactual job lengths. This is especially true for job length, between approximately 48 and 72 hours.

5.3.2 Prediction Duration

In tests described so far, the scheduler asks the predictor about behavior over a prediction interval

that is intended to match application runtime. A prediction for the next N hours may not necessarily

reflect the best information for scheduling an N hour job. This section investigates the effect of

scheduling an N hour job using predictions made for the next (M x N) hours, where M is the

“interval multiplier.” M is set at 0.25, 0.5, 1, 2, 3, with 0.25 plotted as a baseline, along with

two “hybrid” multipliers. The “Comparison” (Comp or TRF-Comp) scheduler uses M=3 for jobs

less than 28 hours, and M=0.25 for longer jobs. The “Performance” (Perf or TRF-Perf) scheduler

ignores reliability on jobs less than 28 hours (instead selecting the fastest, least loaded resources);

for longer jobs, it uses M=0.25.

Figure 5.4 shows that larger multipliers perform better for jobs up to 28 hours in duration. How-

ever for longer jobs, making predictions for intervals longer than the application runtime increases

the number of evictions and the average job makespan. In fact, the longer the job, the smaller

the optimal multiplier. This observation motivated the design of the hybrid scheduler, which does

well in keeping evictions low, through 80-hour jobs, and makespan low, especially for longer jobs.

For example, for short jobs of length 6 hours, M=0.25 produces 1173 evictions, M=1 produces 878

evictions and M=3 produces 754 evictions. For 84 hour jobs, M=0.25 has 80,118 evictions, M=1

has 84,662 and M=3 has 87,433. These figures support the conclusion that the longer the job, the

shorter the optimal prediction length multiplier should be.

46

5 10 15

13

13.5

14

14.5

15

15.5

16

Average Job Makespan vs. Ren Days Analyzed

Number of Days

Ave

rage

Job

Mak

espa

n (H

ours

)

5 10 15

1600

1800

2000

2200

2400

Evictions vs. Ren Days Analyzed

Number of Days

Evi

ctio

ns

Ren MTTF

1 2 3 4 5 6 7

12.5

13

13.5

14

14.5

15

Average Job Makespan vs. Prediction Length Weight

Prediction Length Weight

Ave

rage

Job

Mak

espa

n (H

ours

)

1 2 3 4 5 6 71100

1200

1300

1400

1500

1600

1700

Evictions vs. Prediction Length Weight


Evi

ctio

ns

TRF

Figure 5.5: TRF and Ren MTTF show similar trends for both makespan and evictions. For anyparticular makespan, TRF causes fewer evictions, and for any number of evictions, TRF achievesimproved makespan.

5.4 Multi-State Prediction Based Scheduling

This section compares the proposed predictor and scheduler combination (the PPS scheduler utiliz-

ing the TRF predictor, PPS-TRF) with Ren’s multi-state prediction based scheduler (described in

Section 2.3) [56], in terms of the ability to tradeoff reliability and performance. To isolate the effect

of the predictor from that of the scheduler, PPS-TRF is compared with Ren’s predictor using the

PPS scheduler.

5.4.1 Multi-State Prediction Based Scheduling

This section explores the Transitional Recent-hours Freshness-weighted (TRF) predictor using the

S3 prediction-based scheduler (Section 5.3.1) and Ren’s MTTF scheduler (see Section 2.3 for a

full description of Ren MTTF). The number of days that Ren’s predictor uses is varied, while

simultaneously varying TRF’s interval multiplier (Section 5.3.2) for comparison. Both of these

parameters allow each scheduler to trade off reliability and performance.

Figure 5.5 shows the average makespan and number of evictions obtained by TRF and Ren

MTTF as the interval multiplier varies for TRF and the number of days analyzed varies for Ren

47

2 4 6

0

0.2

0.4

0.6

0.8

Average Job Makespan vs. Prediction Length Weight


Ave

rage

Job

Mak

espa

n

2 4 60

0.5

1

1.5

Evictions vs.Prediction Length Weight


Evi

ctio

ns

Ren 1

Ren 2

Ren 4

Ren 8

Ren 12

Ren 16

TRF

Figure 5.6: TRF causes fewer evictions than Ren (coupled with the PPS scheduler) for all PredictionLength Weights. TRF also produces a shorter average job makespan than all Ren predictors, exceptthe one day predictor (which produces 45% more evictions).

MTTF. Varying each parameter allows that scheduler to trade off performance (lower makespan)

for reliability (fewer evictions). In selecting a point for Ren MTTF on one curve, however, TRF

does better in terms of the other metric. For example, for approximately 1500 evictions, TRF has

average makespan that is 27% lower. And for makespan of 13 hours, TRF has 52% fewer evictions.

5.4.2 Predictor Quality vs. Scheduling Result

To better understand how predictors affect scheduling, PPS is tested with both the TRF predictor

and with Ren N-Day [55][54] for a variety of interval multipliers (Prediction Length Weights). 6000

jobs are simulated ranging from 5 minutes to 25 hours in duration, 25% of which are checkpointable.

The Ren predictor’s N value is varied, as is the prediction length weights (i.e. multipliers, as outlined

in Section 5.3.2). The weights are varied between 1 and 7, and follow the simulation setup explained

in Section 5.1. The PPS scheduler with the S3 resource scoring approach is used for resource selection

(Section 5.3.1).

Figure 5.6 illustrates the percentage difference in makespan, vs. TRF. For all multiplication

factors, TRF produces the fewest job evictions. Both predictors produce the fewest evictions with

the 7x multiplier; Ren 2-Day produces 1,340 evictions and TRF produces 1,103 (21% fewer). For the

1x multiplier, Ren 12-Day (1,534 evictions) comes closest to TRF’s 1,420 evictions (within 8%). For

makespan, Ren 1-Day beats TRF by at most 18% lower makespan. However, Ren 1-Day sacrifices

reliability and produces about 45% more evictions. Thus, TRF reduces the number of evicted jobs

and can obtain the highest job reliability. Ren 1-Day coupled with the PPS scheduler can obtain a

48

0 50 100−0.2

−0.1

0

0.1

0.2

0.3Makespan vs. Job Length

Job Length (Hours)

Mak

espa

n

0 50 1000

0.5

1

1.5Evictions vs. Job Length

Job Length (Hours)

Evi

ctio

ns

TRF−Comp.

TRF−Perf.

Ren MTTF−1

Ren MTTF−4

Ren MTTF−8

Ren MTTF−16

History Counter

Sliding Window

0 50 100−2

0

2

4

6Makespan vs. Job Length

Job Length (Hours)

Mak

espa

n

0 50 100−2

0

2

4

6Evictions vs. Job Length

Job Length (Hours)E

vict

ions

TRF−Comp.TRF−Perf.RandomCPU−SpeedPseudo−optimalS0

Figure 5.7: TRF-Comp produces fewer job evictions than all comparison schedulers (with the ex-ception of the Pseudo-optimal scheduler) for jobs up to 100 hours long.

shorter makespan than TRF, but only with a large loss in job reliability.

5.4.3 Scheduling Performance Comparison

This section further compares the best performing scheduling approaches, TRF-Comp and TRF-

Perf (using the S3 resource scoring approach), to other scheduling methods. To more thoroughly

understand the characteristics of these schedulers in a variety of conditions, tests are performed

with a diverse set of job lengths. Results are reported for simulating 6,000 jobs, 25% of which are

checkpointable, over the six month Notre Dame trace. The jobs range from five minutes to the job

length indicated on the x-axis.

Figure 5.7’s top two graphs compare the TRF schedulers to Ren-MTTF (with a variety of days

analyzed) and to the History and Sliding Window prediction-based schedulers. The graphs plot the

percentage difference in both makespan and the number of evictions, vs. TRF-Comp. The History

and Sliding Window predictors utilize the Comp with the S3 resource scoring scheduler as well.

TRF-Comp maintains comparable makespan as job length increases, peaking at roughly 11%

higher than Sliding Window, History, TRF-Perf and Ren MTTF-1 for jobs of up to 40 hours in

length. For this same length, TRF-Comp achieves 60% fewer evictions than the next most reliable

scheduler, Ren MTTF-1. For all job lengths up to 40 hours, TRF-Comp achieves at least 15% fewer

49

1 2 3 4 5

−0.1

0

0.1

0.2

Makespan vs. Job Load

Number of Jobs (x 104)

Mak

espa

n

1 2 3 4 5

0

0.5

1

Evictions vs. Job Load


Evi

ctio

ns

TRF−Comp.

TRF−Perf.

Ren MTTF−1

Ren MTTF−4

Ren MTTF−8

Ren MTTF−16

History Counter

Sliding Window

1 2 3 4 5

0

1

2

3

Makespan vs. Job Load


Mak

espa

n

1 2 3 4 5−1

0

1

2

3

4

Evictions vs. Job Load

Number of Jobs (x 104)E

vict

ions

TRF−Comp.TRF−Perf.RandomCPU−SpeedPseudo−optimalS0

Figure 5.8: TRF-Comp produces fewer job evictions than all comparison schedulers (with the ex-ception of the Pseudo-optimal scheduler) for loads up to 40,000 jobs

evictions when compared with the most reliable scheduler, Ren MTTF-16; average job makespan

simultaneously decreases by 20% (27.3 hours versus 32.9 hours). TRF-Comp also decreases the

number of evictions by at least 57% compared with all other schedulers, for jobs up to 6 hours long

(355 evictions versus 557). TRF-Perf comes within 1% of the shortest makespan (Sliding Window)

for shorter lengths, and achieves the shortest makespan for jobs of 80+ hours.

The bottom two graphs in Figure 5.7 compare TRF-Comp and TRF-Perf with non-prediction-

based scheduling approaches including a Pseudo Optimal scheduler which selects the available re-

source that will execute the application in the smallest execution time, without failure, based on

omnipotent future knowledge of resource availability. When all machines would become unavail-

able before completing the application, the Pseudo Optimal Scheduler chooses the fastest available

resource in terms of MIPS speed. For average job makespan, TRF-Comp follows the non-optimal

schedulers and produces the fewest evictions for all job lengths, by at least 110%.

Figure 5.8 investigates the effect of varying the load on the system (the number of jobs). The job

length is fixed to be uniformly distributed between five minutes and 25 hours and vary the number

of jobs injected into the system over the six month trace from 100 to 50,000. Again, the top two

graphs compare the TRF schedulers to other prediction-based schedulers including Ren-MTTF. The

figured demonstrates that TRF-Comp. produces fewer evictions across all loads except in the high

50

load case of 50,000 jobs. TRF-Perf. ties the lowest average job makespan across all loads, improving

on TRF-Comp. by approximately 10%. The bottom two graphs compare the two TRF schedulers

with traditional scheduling approaches. TRF-Comp. produces the lowest number of evictions across

all loads with the exception of the Pseudo-optimal scheduler. Lastly, TRF and the other traditional

schedulers produce comparable makespan improvements across all loads.

51

Chapter 6

Prediction-Based Replication

Prediction-based resource selection is an important tool for dealing with resource unavailability. In

choosing the “best” resources for executing each job, the scheduler can greatly reduce the number

of job evictions and their effect. However, encountering job evictions due to unforeseen resource

volatility is virtually inevitable. Job replication can be used to reduce the detrimental effect of

resource unavailability and benefit performance in two ways. First, replication can be used to

mitigate the effects of job eviction through creating copies of jobs such that even when one copy

of a job is evicted, one its replicas may still complete without encountering resource unavailability.

Since evicted jobs will have to find another resource on which to execute, possibly restarting its

execution from the beginning (non-checkpointable jobs), replicas which do not experience resource

unavailability can reduce the job’s makespan. Second, when jobs are scheduled using imperfect

resource ranking heuristics, future resource performance cannot always be anticipated. For example,

the local load on a resource may suddenly fluctuate which results in the job currently executing on

that resource taking longer to complete than anticipated at the time of scheduling. Replication can

be used to allow a job to execute on multiple resources such that a resource which completes a

replica of the job before the original job could complete will provide a shorter makespan and earlier

completion time. In these two ways, replication can be seen as a scheduling tool to not only deal

with unpredictable resource performance but also aid in handling volatile resource availability.

One of the main difficulties with replication is the cost associated with it. Replication has two

main costs in this context. First, because there are a limited number of total resources in any

distributed system, creating replicas of jobs will reduce the available resource pool and the replicas

may take desirable resources away from waiting non-replicated jobs. This can lead to situations in

which the desirable resources (fast, available, etc) in a system can be taken by a few jobs and their

replicas leaving only less desirable resources for future incoming jobs. Second, in situations such as

Grid economies, creating replicas can have a real world cost where a user may have to pay per job,

52

per process or per cycle. These two costs mean that users must be judicious in their selection of

which jobs to replicate.

This chapter therefore sets out to test the hypothesis that the TRF availability predictor (Chap-

ter 4) can help select the right jobs to replicate, and can improve overall average job makespan,

reduce the redundant operations needed for the same improved makespan, or both. The replica-

tion techniques are tested in a range of system loads and job lengths while measuring average job

makespan and the number of extra operations performed (overhead). This chapter tests two classes

of replication techniques:

• Static techniques (Section 6.2) replicate based on the characteristics of the job being scheduled

• Prediction-Based techniques (Section 6.3) consider forecasts about the future availability and

load of resources

This chapter first investigates static replication techniques which replicate based on the charac-

teristics of the job being scheduled. These techniques are then improved by incorporating availability

predictions made on the resources chosen to execute a job in order to make better job replication

decisions. Each replication techniques is analyzed in terms of a new metric, that of efficiency.

Prediction accuracy is analyzed for it’s effect the quality of the replication decisions. Lastly, Sec-

tion 6.3.2 investigates system load’s effects the performance of the proposed replication techniques.

Section 6.3.3 proposes three static load adaptive replication techniques which consider overall system

load and desired metric in determining which replication strategy to utilize (Chapter 7 goes on to

introduce an improved load adaptive replication approach).

All replication experiments in this chapter use the PPS Scheduler described in Chapter 5.1,

augmented to support replication. Upon placing each job, the scheduler uses the replication policy

to determine how many replicas to make. In these replication experiments, the scheduler makes 0 or

1 replicas. When a task or its replica completes, the system terminates other copies, freeing up the

resources executing them. Schedulers whose replication policies require an availability prediction

use the TRF predictor, unless specified otherwise.

Scheduling quality is analyzed according to average job makespan, extra operations, and replica-

tion efficiency, which is defined and discussed later. Task lengths are defined in terms of the number

of operations needed to complete them; extra operations refers to the number of additional opera-

tions that the system performs for a job, including all “lost” operations due to eviction, and any

operations performed by replicas (or initially scheduled jobs) that do not ultimately finish because

some other copy of that job finished first.

53

6.1 Replication Experimental Setup

Replication results are based on two traces, the six month Notre Dame Condor availability trace

of 600 nodes analyzed in Section 3.2 [59] and also a trace from the SETI@home desktop Grid

system [29].

Again, for the Condor simulations (as explained in Section 5.1), the MIPS score, the load on each

processor, and the (un)availability states (included or implied by the trace data) all influence the

simulated running of the applications and their replicas. Resources are considered available if they

are running, connected, have no user present and have a local CPU load below 30% as these are the

default settings for the Condor system [75]. A resource may only be assigned one task for execution

at a time. Similarly, for the SETI simulations the double precision floating point speed and the

(un)availability states (implied by the trace data) influence the execution of the applications. In the

SETI trace, a machine is considered available if it is eligible for executing a Grid application. This is

in accordance with the resource owner’s settings which may dictate that a machine is only available

if the CPU is idle and the user is not present. Since the SETI trace contains 226,208 resources over

1.5 years, 600 resources are randomly chosen and only execute the simulation on the first 6 months

of the trace to compare directly to the results obtained from the Condor trace simulations.

Applications are simulated by inserting them at random times throughout the six months, with

durations uniformly distributed between five minutes and 25 hours (this follows the same simula-

tion/scheduling methodology as defined in Section 5.1).1 A uniform distribution demonstrates how

each replication technique performs for a wide range of job lengths, without emphasizing a particular

set of lengths. 25% of the applications are checkpointable, and can take an on-demand checkpoint

in five minutes, if the user reclaims a resource, or if the local CPU load exceeds 50%. If the resource

directly transitions to an unavailable state (for example, by becoming unreachable), no checkpoint

can be taken and the job must restart from its last periodic checkpoint.2 Non-checkpointable appli-

cations must restart from the beginning of their execution when evicted regardless of the eviction

type. In both cases, an evicted job is added to the job queue and immediately rescheduled on an

available resource.

The PPS scheduler, as defined in Section 5.1, is used to schedule jobs. Again, jobs are scheduled

each round (once every 3 minutes) by removing and scheduling the job at the head of the queue

until a job can no longer be scheduled. For each job, each available resource is scored and the

resource with the highest score executes the job. Resource i ’s score (RS(i)) is computed as (scoring

1The duration is determined by the runtime on an unloaded resource with average speed.2Periodic checkpoints are taken every 6 hours (Condor’s default).

54

approached S0 as defined in Section 5.3.1):

RS(i) = mi · (1 − li)

where mi is i ’s MIPS score, and li is the CPU load currently on resource i.

When a job is first scheduled, the scheduler determines if a single replica of the task should be

created based on the current replication policy. 3 If the scheduler chooses to create a replica, an

identical copy of the job is then scheduled. This replica is not considered for further replication.

When any replica of a task completes, all other replicas of the task are located and terminated,

freeing up the resources executing those replicas.

Some replication policies use an availability prediction. For this the TRF predictor, as described

in Section 4.2.1 [60], is used to forecast resource availability. TRF takes as input the expected

duration of the job and the name of a resource, and returns a percentage from 0 to 100 to represent

the predicted probability that the resource will complete the expected duration without becoming

unavailable.

In these experiments, the total number of jobs in the system is varied using 1K, 14K, 27K, and

40K total jobs over the 6 month traces, in four separate sets of simulations. This translates to 0.001,

0.13, 0.26, and 0.39 Jobs per Resource per Day, respectively. These same values were used for the

rest of the tests described in this chapter. They are referred to as the Low, Medium Low, Medium

High, and High load cases.

6.2 Static Replication Techniques

This section explores the effect that replicating jobs based on checkpointability can have on both

extra operations and on job makespan. The total number of jobs varies from Low to High load as

defined in Section 6.1 cases.

Figure 6.1 includes the following replication policies:

• 1x: Replicates each job exactly once. If either the main job or the replica runs on a resource

that becomes unavailable, it is rescheduled but never re-replicated. Thus, two versions of each

job are always running.

• Non-Ckpt: Replicates only jobs that are not checkpointable.

• Ckpt: Replicates only jobs that are checkpointable.4

3Although results have been obtained for an increased number of replicas per job, this study is restricted by limitingthe number of replicas to one per job due to space constraints.

4This replication policy is not based on intuition, but instead serves as a useful comparison for the Non-Ckpt policy

55

Low Med−Low Med−High High−50

−40

−30

−20

−10

0

10

20

30Makespan vs. Job Load

Job Load

Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

No−Rep

1x

Non−Ckpt

Ckpt

50% Probability

Low Med−Low Med−High High0

100

200

300

400

500

600

700Extra Operations vs. Job Load

Job Load

Per

cent

Ext

ra O

pera

tions

vs. N

o R

eplic

atio

n

Figure 6.1: Checkpointability Based Replication: Makespan (left) and extra operations (right) fora variety of replication strategies, two of which are based on checkpointability of jobs, across fourdifferent load levels.

• 50% Probability: Replicates half of the jobs, at random.

• No-Rep: Does not make replications

Figure 6.1 illustrates that under low loads, increasing replicas achieves more makespan improve-

ment, a benefit that falls off for higher loads because replication forces subsequent jobs to use even

less desirable resources. Replicating only non-checkpointable jobs (Non-Ckpt) improves makespan

for all but the highest load test, and at significantly less cost than 1x replication. Moreover, repli-

cating the half of the jobs that are non-checkpointable is better than replicating half of the jobs at

random, which indicates that using checkpointability for replication policies has benefit, as expected.

6.3 Prediction-based Replication

As shown in the previous section, replicating non-checkpointable jobs yields larger rewards than

replicating checkpointable jobs due to the inherent fault tolerance of checkpointable jobs. Replicating

longer jobs also yields higher improvements although those results are not included here.

These replication strategies can be improved by identifying and replicating the jobs that are

most likely to experience resource unavailability. For this work, a resource is selected based on

its projected near-future performance (speed and load) as described in Section 6.1, and jobs are

replicated based on the probability of the job completing on that resource. Jobs that are more likely

to experience resource unavailability are replicated to decrease job makespan. For this reason, it is

important to consider makespan improvement and overhead within the same metric. This metric

quantifies how much of makespan improvement each replication policy achieves per replica created.

Replication Efficiency is defined as:

56

ReplicationEfficiency = MakespanImprovement

ReplicasPerJob

where Makespan Improvement is the percentage makespan improvement over not creating any repli-

cas. As defined, increasing makespan improvement with the same number of replicas will increase

efficiency, and increasing replicas to get the same makespan improvement will decrease efficiency.

Another way of considering efficiency is to see how much makespan improvement can be achieved

with a certain number of replicas. The best replication strategies will make replicas of the “right”

jobs, and achieve more improvement for the same cost (number of replicas).

Replication efficiency is important in any environment that applies a cost to inserting and exe-

cuting jobs on a system. One user may wish to complete a job as soon as possible (reduce makespan)

regardless of cost, whereas a frugal user may only wish to replicate if the replica created is likely to

provide significant makespan improvement; the frugal user may choose an extremely efficient repli-

cation policy. Efficiency can also be useful to administrators in choosing which replication policy is

most suited to the goals of their system. Efficient replication strategies reduce the overhead and in-

crease the throughput of the overall system by reducing the number of wasted operations performed

by replicas that do not complete or that are evicted. This section analyzes methods for using the

availability predictions to make better replication decisions in terms of both makespan improvement

and efficiency.

6.3.1 Replication Score Threshold

This section first investigates how the prediction threshold at which the system makes a replica

affects performance. The Replication Score Threshold (RST) is the predicted completion probability,

as calculated by TRF, below which the system makes a replica, and above which it does not. As

demonstrated in this section, RST can effect both makespan improvement and replication efficiency.

For example, a system with an RST of 100 replicates all jobs scheduled on resources with a predicted

completion probability below 100%.5 Additionally, the RST-L-NC strategy chooses jobs to replicate

based their predicted completion probability (RST), their Length (longer jobs are replicated), and

whether they are checkpointable (only Non-Checkpointable jobs are replicated). A job length of

greater than ten hours was chosen based on empirical analysis of replication approaches with a

varying length parameter. Non-fixed length approaches have not been investigated and are left to

future work. These results are compared with a policy (1x) that replicates each job, and a policy

5An RST of 100 will not replicate all jobs but only jobs whose predicted probability of completion is below 100%.If a resource has not yet become unavailable within the period of time being traced, the predictor may output a 100%probability of completion.

57

40 60 80 1005

10

15

20

25Makespan vs. RST: Condor

Replication Score Threshold

Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

40 60 80 10020

25

30

35

40Efficiency vs. RST: Condor


Effi

cien

cy

1x

WQR−FT

RST

RST−L−NC

40 60 80 10014

16

18

20

22

24Makespan vs. RST: Seti


Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

40 60 80 10020

25

30

35

40

45

50Efficiency vs. RST: Seti

Replication Score ThresholdE

ffici

ency

1x

WQR−FT

RST

RST−L−NC

Figure 6.2: Replication Score Threshold’s effect on replication effectiveness - Low Load

that replicates on available resources only (WQR-FT, described in Section 2.4) [7].

To summarize,

• 1x: Creates a single replica of each job

• RST: Given an availability prediction from the TRF predictor, replicates all jobs whose pre-

dicted completion probability falls below the configured RST value

• RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than ten hours (L),

and given an availability prediction from the TRF predictor, whose predicted completion

probability falls below the configured RST value

• WQR-FT: First schedules all incoming jobs, then uses the remaining resources for replicas

by creating up to one replica of each job until either all jobs have a single replica executing or

there are no longer any available resources

Figure 6.2 plots makespan and efficiency versus RST in the “Low” load case. The top two graphs

represent the results from executing the simulation on the Condor trace and the bottom two graphs

present the results from the SETI trace.

As RST increases, more jobs are replicated as more predicted completion probabilities fall below

58

the RST. In the low load case, this increases makespan improvement for both RST and RST-L-NC.

The effect is more prominent in the Condor simulations, but still evident in the SETI simulations.

1x and WQR-FT achieve the highest makespan improvement in the Condor simulations but are

matched by RST in SETI.

The efficiency graphs exhibit the opposite trend. For all RST values, both RST and RST-L-NC

produce higher efficiencies than WQR-FT and 1x. In fact, RST-L-NC produces a 39.5 (Condor)

and a 47.5 (SETI) replication efficiency versus WQR-FT’s efficiency of 21 (for both SETI and

Condor). RST-L-NC’s efficiency outperforms RST due to the incorporation of job characteristics

in choosing to replicate longer, non-checkpointable jobs. The more intelligently and selectively the

scheduler chooses which jobs to replicate, the higher the achieved efficiency. Clearly, considering

both predicted completion probability and an application’s characteristics yields the most efficient

strategy. For example in SETI, RST matches the makespan improvement of WQR-FT but creates

35% fewer replicas while RST-L-NC comes within 6% of the makespan improvement of WQR-FT

while creating 67% fewer replicas. Similarly in Condor, RST-L-NC comes within 6% of the makespan

improvement of WQR-FT but creates 76.8% fewer replicas.

6.3.2 Load Adaptive Motivation

This section examines how increasing load on the system influences the replication policies’ effec-

tiveness. Figure 6.3 plots makespan and efficiency versus RST in the “Medium High” load case.

Again, the top two graphs represent the results from the Condor simulations and the bottom two

graphs represent the results from the SETI simulations. Under Medium High load, increasing RST

actually decreases makespan improvement; creating additional replicas in higher load situations de-

nies other jobs access to attractive resources. RST70 (RSTi, where i is a value between 0 and 100

replicates all jobs whose predicted probability of completion is below i%), for example produces the

largest makespan improvement for Condor and RST40 produces the largest makespan improvement

for SETI. RST and RST-L-NC produce much higher efficiencies than either 1x or WQR-FT across

all RST values in both the Condor and SETI simulations. At this load level, RST-L-NC produces

a 30 (Condor) and a 20 (SETI) replication efficiency versus WQR-FT’s 7 (Condor) and 8.5 (SETI),

respectively. In SETI, RST-L-NC comes within 0.5% of the makespan improvement of WQR-FT

while creating 58% as many replicas and RST improves on the makespan by 2.5% and creates 13.7%

fewer replicas. For Condor, RST-L-NC improves on the makespan of WQR-FT by 1.5% while cre-

ating 72.5% fewer replicas. Again, RST-L-NC produces higher replication efficiency than RST due

59

40 60 80 1006

7

8

9

10Makespan vs. RST: Condor


Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

40 60 80 1005

10

15

20

25

30



Effi

cien

cy

1xWQR−FTRSTRST−L−NC

40 60 80 1007

7.5

8

8.5

9

9.5

10

10.5Makespan vs. RST: Seti


Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

40 60 80 1008

10

12

14

16

18

20Efficiency vs. RST: Seti

Replication Score ThresholdE

ffici

ency

1x

WQR−FT

RST

RST−L−NC

Figure 6.3: Replication Score Threshold’s effect on replication effectiveness - Medium High Load

to its higher replica creation selectivity. Figure 6.3 demonstrates that under a higher load, selective

replication strategies produce larger makespan improvements. Even under low load (Figure 6.2),

these more selective and efficient replication techniques result in more improvement per replica.

Figure 6.4 further investigates the effect that varying the load in the system has on the perfor-

mance achieved by the replication techniques proposed in Section 6.3.1. Figure 6.4 plots the RST

replication technique with RST values of both 100% and 40% and compares them with not replicat-

ing (No Replication) and replicating a random 20% of jobs (20% Replication) only for the Condor

case due to space constraints (although the result holds for the SETI case). As load increases, the

makespan improvement and efficiency of all replication strategies initially increases slightly but then

falls as Med-High and High load levels are reached. Intuitively, this is because under lower loads

replicas will not interfere by taking quality resources away from other jobs. As load increases, repli-

cas can cause new jobs to wait. Another important result is comparing the proposed techniques to

blind replication. For example, the RST40-L-NC technique creates almost exactly the same number

of replicas as randomly replicating 20% of jobs across the loads levels. Given the same number of

replicas, the RST technique achieves a higher makespan improvement and efficiency across all load

levels; it is choosing the “right” jobs to replicate.

60


−5

0

5

10Makespan vs. Job Load: Condor

Job Load

Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion


0

10

20

30

40Efficiency vs. Job Load: Condor

Job Load

Effi

cien

cyLow Med−Low Med−High High

0

0.5

1

1.5Replicas vs. Job Load: Condor

Job Load

Num

ber

of R

eplic

as C

reat

ed (

x 10

4 )

No−RepRST100−L−NCRST40−L−NC20% Probability

Figure 6.4: Replication strategy performance across a variety of system loads

Figure 6.4 also shows that the replication strategy that provides the highest achievable makespan

or efficiency varies based on the load of the system. For example, depending on the load level, either

RST40-L-NC or RST100-L-NC produces the largest makespan improvement.

6.3.3 Load Adaptive Techniques

Since load varies unpredictably, the replication policy should adapt dynamically. And because differ-

ent users may target different metrics, the replication policies must also vary by job, suggesting three

load adaptive replication techniques that respond to varying load levels by choosing the replication

strategy that best achieves three different desired metrics. A load measurement system counts the

number of jobs submitted to the system in the past day. This value then determines the current

load level and helps select a replication strategy. Low load is defined as fewer than 40 jobs per

day, Med-Low load as fewer than 115 jobs per day, Med-High load as fewer than 200 jobs per day

and High load as 200 or more jobs a day. RST-based strategies consider different combinations of

checkpointability and job length depending on the current system load and desired metric.

• Performance: Reduce average job makespan across loads

– Low load: 4x - Create 4 replicas of each job

– Med-Low load: RST100 - Replicate a job if the completion probability is less than 100%

61


0

10

20

30

Makespan vs. Job Load: Condor

Job Load

Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

No−RepWQR−FTPerformance−RepEfficiency−RepCompromise−Rep


0

10

20

30

40


Job Load

Effi

cien

cyLow Med−Low Med−High High

−10

0

10

20

30

Makespan vs. Job Load: Seti

Job Load

Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

No−RepWQR−FTPerformance−RepEfficiency−RepCompromise−Rep


0

10

20

30

40

50Efficiency vs. Job Load: Seti

Job LoadE

ffici

ency

Figure 6.5: Load adaptive replication techniques

– Med-High load: RST40-NC - Replicate a job if the completion probability is less than

40% and the job is non-checkpointable

– High load: No replication

• Efficiency: Increase replication efficiency across loads

– Low, Med-Low, Med-High: RST40-L-NC - Replicate a job if the completion probability

is less than 40%, the job is non-checkpointable and the length is over 10 hours


• Compromise: Compromise between average job makespan and efficiency across loads

– Low load: RST100 - Replicate a job if the completion probability is less than 100%

– Med-Low load: RST100 - Replicate a job if the completion probability is less than 100%

– Med-High load: RST40-NC - Replicate a job if the completion probability is less than

40% and the job is non-checkpointable


Figure 6.5 explores the results of the three proposed load adaptive replication policies across

varying load levels and compares their results to the closest related work replication strategy, WQR-

62

FT. Again, the top two graphs represent the results from the Condor simulations and the bottom

two illustrate the SETI results. For both cases, all of the load adaptive techniques produce the

largest (or extremely close to the largest) achievable value for their chosen metric at each given

load level with the exception of the high load case for SETI. The Efficiency technique produces the

largest efficiency across all load levels (except in the high load case in SETI) while simultaneously

producing the smallest makespan improvement. The Performance technique produces the opposite

trend. The Compromise technique produces an intermediate makespan improvement and efficiency

across all loads. The results demonstrate these replication techniques can determine the replication

strategy that is best suited to achieving the desired metric at various loads.

Compared to WQR-FT, all three load adaptive techniques produce higher replication efficiencies

across all loads except for the Performance technique in the low load case and the high load case in

SETI. The Efficiency technique produces an average replication efficiency increase of 10.8 across all

loads with a maximum increase of 22 compared with WQR-FT. Similarly, the Efficiency technique

produces at most an increase of 25 in efficiency for the Condor simulations and on average increase

of 15 across all loads compared with WQR-FT. The Performance technique beats the makespan

improvement achieved by WQR-FT by an average of 1.27% across all loads in the Condor simulations

while simultaneously achieving a higher efficiency in all but the low load case. The Performance

technique also improves on the makespan improvement of WQR-FT in all but the high load case

in SETI. The Compromise technique improves or matches the efficiency and makespan achieved by

WQR-FT across all loads with the exception of the high load case in the SETI simulations.

63

Chapter 7

Load Adaptive Replication

The previous chapter investigated using availability predictors to decide which jobs to replicate. The

goal of this prediction-based replication approach is to replicate those jobs which are most likely

to experience resource unavailability. These jobs benefit the most from the fault tolerance which

replication provides and produce the largest benefit to the system. As seen in Chapter 6, it is

especially important to choose the “right” jobs to replication because of the costs associated with

replication. As seen in Section 6.3.2, the load on the system has a large effect on the performance of a

replication strategy. In particular, as load increases, jobs are competing for fewer and fewer resources.

Given this lack of available resources under higher loads, replication strategies must become even

more selective about which jobs are replicated. This is in contrast to lower load situations which

can benefit from more aggressive replication policies.

Section 6.3.3 proposed three static load adaptive replication policies. These policies measure the

load on the system in the past 24 hours and choose the replication policies appropriate for that

load out of a set of four replication policies. The major drawback to this approach is that these

replication policies are chosen statically ahead of time based on testing a range of policies with a

given resource availability trace and workload. The “best” performing policy is then chosen for that

load level range. Unfortunately, this means the load-adaptive approach is tailored to that specific

availability and workload. The four strategies are statically chosen. As shown in this chapter, this

strategy does not handle new workloads or new resource environments well. An additional issue is

that only having four load levels means a relatively high granularity of load adaption. This large

granularity is not able to account for the wide range of loads within each granularity level.

This chapter proposes a new load adaptive approach which addresses these issues and delivers

improved performance. The chapter is organized as follows. Section 7.2 investigates the effect that

varying system load has on the performance achieved by the prediction-based replication techniques

proposed in the previous chapter and motivate the need for load-adaptive replication. Section 7.3

64

introduces a load adaptive replication approach, Sliding Replication (SR) and goes on to test various

configurations for SR in Sections 7.3.1, 7.3.2 and 7.3.3. Lastly, Section 7.4 tests SR against the

load adaptive approach proposed in Section 6.3.3 and then against WQR=FT using two real world

load traces.

7.1 Load Adaptive Experimental Setup

The same simulation methodology described in Section 5.1 is used for studying these load adaptive

replication policies. This methodology dictates how resources are considered available based on the

availability trace. Section 6.1’s experimental setup is used including using the Prediction Product

Score (PPS) scheduler with the following resource ranking function:

RS(i) = mi · (1 − li)

where mi is i ’s MIPS score, and li is the CPU load currently on resource i. When a job is first

scheduled, the scheduler determines if a single replica of the task should be created based on the

current replication policy. The Condor and SETI resource availability traces are used to dictate

when and how resources become unavailable.

These experiments use real world workload traces from the Grid Workload Archive [27] (GWA),

an online repository of 11 distributed system workload traces including TeraGrid, NorduGrid, LCG,

Sharcnet, Das-2, Auvergrid and Grid5000 to drive the simulations. The Grid5000 and NorduGrid

traces are used in the load-adaptive replication simulations. Grid5000 consists of between 10,000

and 25,000 system cores with over 1 million jobs. NorduGrid consists of over 25,000 system cores

with between 500,000 and 1 million total job submissions. NorduGrid therefore represents a lower

overall load level.

7.2 Prediction based Replication

Section 6.3.1 describes a replication strategy that utilizes predictions to choose which jobs to repli-

cate [62], those that are most likely to experience resource unavailability (RST). Again, if a job’s

likelihood of failure exceeds some threshold, its replicated. The Replication Score Threshold (RST)

is the predicted completion probability, as calculated by the TRF predictor (Section 4.2.1), below

which the system makes a replica, and above which it does not. As shown in the last chapter, RST

affects both makespan improvement and replication efficiency.

65

This section motivates the need for load aware replication strategies and investigates the effect

of varying RST. A pool of synthetic jobs is simulated by inserting each job at a random time

throughout the six month Condor trace, with application durations uniformly distributed between

five minutes and 24 hours using the simulation methodology described in Section 5.1. The number

of jobs is varied according to the “Low” (1000 jobs) and “Med-High” loads (27,000 jobs) described

in Section 6.1. 1 The goal is to determine the effect that system load and RST have on replication

performance.

Figure 7.1 plots makespan and efficiency versus RST for “Low” and “Med-High” load. Under

“Low” load, as RST increases, more jobs are replicated because more predicted completion proba-

bilities fall below the RST; this increases makespan improvement. The efficiency graphs exhibit the

opposite trend. As RST increases, extra job replicas contribute slightly less to makespan improve-

ment, leading to lower efficiency as RST increases.

Under Medium High load, increasing RST actually decreases makespan improvement; creating

additional replicas in higher load situations denies other jobs access to attractive resources. RST70

(RSTi, where i is a value between 0 and 100, replicates all jobs whose predicted probability of

completion is below i%), for example produces the largest makespan improvement for Condor.

Efficiency decreases as RST increases but with a sharper slope than the Low load case, indicating

the additional replicas are decreasing efficiency more under the higher load.

Figure 7.1 illustrates that the replication strategy that creates the largest makespan improvement

varies with the load on the system. As load increases, excess replicas steal quality resources from non-

replicated jobs; schedulers should become more selective about which jobs to replicate. Figure 7.1

shows that varying RST can be used as a tool to determine the selectivity of the replication policy.

Higher RST values translate to less selectivity and lower RST values mean more selectivity (less

jobs being replicated). This idea is used later in developing a load adaptive replication approach.

7.3 Load Adaptive Replication

Sliding Replication (SR) is a load-adaptive replication approach that uses the current load on the

system to influence its replication strategy choice. One simple load adaptive replication approach is

to directly translate load to a probability of replication. A function takes system load as input and

produces output that dictates the probability of replication. The Div-0.7 load adaptive replication

approach calculates the probability of replication (RP) with the following equation:

1“Med-Low” and “High” load scenarios are not included in this study for clarity.

66

40 60 80 1008

10

12

14

16

18

20Makespan Improv. vs. RST: Condor


Per

cent

Mak

espa

n Im

prov

emen

tvs

. No

Rep

licat

ion

40 60 80 1005

10

15

20

25



Effi

cien

cy

RST: Low LoadRST: Med−High Load

Figure 7.1: Replication Score Threshold’s effect on replication effectiveness across various loads

20 40 60 80 1004

6

8

10

12

14Makespan Improv. vs. Job Load: Condor

Percentage of Job Trace

Per

cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

20 40 60 80 100150

200

250

300



Effi

cien

cy

Div−0.7

Thresh−0.3

Thresh−0.4

Figure 7.2: Load adaptive replication with random replication probability

67

Figure 7.3: Sliding Replication’s load adaptive approach

RP = (1 − (JDR/0.7)) · 100

where JDR is the number of jobs per day per available resource. JDR is calculated by examining the

most recent 24 hours of behavior. The static value 0.7 represents a typical upper bound on the JDR

measurement for a high load scenario. Div-0.7 is compared with Thresh-0.3 and Thresh-0.4 which

replicate all jobs if the JDR measurement is below 0.3 or 0.4, respectively. Figure 7.2 analyzes the

performance of these three approaches in terms of makespan improvement and efficiency using the

Condor resource availability trace paired with the Grid5000 job workload trace. Importantly, the

Grid5000 workload does vary load throughout it’s traced time in terms of how many jobs are injected

into the system, but this doesn’t effect the overall load dictated by the trace onto the system. To

simulate a range from lower overall system loads to higher system loads, the simulation chooses a

different percentage of jobs from the trace to execute in relation to the desired load level. K% means

that the simulator keeps a random K% of jobs from the trace. In Figure 7.2, Thresh-0.4 produces

the highest makespan improvement while also producing the highest efficiency for the two higher

load scenarios. Thresh-0.3 produces the highest efficiency for the two lower load scenarios.

However, these load adaptive replication strategies only choose how many jobs to replicate given

the current load level. These results can be further improved if the replication strategy also chooses

the best jobs to replicate. For example, for a given load, replicating the jobs that will fail will result

in higher gains than replicating jobs that will run to completion without replication. This is the

motivation behind the Sliding Replication (SR) approach. SR not only attempts to regulate how

many replicas are produced given the current load level, but given that number of replicas, it seeks

to choose the proper subset of jobs to replicate, attempting to choose those jobs that are most likely

to fail.

SR employs prediction based replication strategies that replicate all jobs whose predicted likeli-

68

hood of failure exceeds a certain threshold. Since a single failure threshold will replicate a similar

percentage of jobs across most loads, and since load-adaptive replication approaches seek to vary

the percentage of jobs replicated with system load, SR first uses load to determine which strategy

to use to decide whether to replicate. This choice of strategy influences both the number of replicas

and which jobs get replicated.

Figure 7.3 illustrates Sliding Replication strategy’s methodology. Sliding Replication uses an

ordered list of replication strategies. The current load is passed to an Indexing Function to deter-

mine an index into that Ordered Replication Strategy List. The index corresponds to a particular

replication strategy which is used to determine if the job is replicated.

7.3.1 Replication Strategy List

This section investigates various ordered lists of replication strategies. As shown in Figure 7.1,

higher system loads require more selective replication strategies. The first intuitive solution is to

propose an ordered list of replication strategies called 21 where the strategies are ordered from the

most aggressive (always replicate) to most selective (only make a replica of the job if it’s completion

probability is below 40%, it’s non-checkpointable and it’s duration is over 10 hours). This list is

drawn from previously described replicate strategies in Chapter 6 [62]. The indexed list is as follows:

• Index 0: 1x: Creates a single replica of the job

• Index 1-7: RST: Given an availability prediction from the TRF predictor, replicates all jobs

whose predicted completion probability falls below 100% (index 1) to 40% (index 7)

• Index 8-14: RST-NC: Replicates all non-checkpointable (NC) jobs whose completion prob-

ability (as predicted by TRF) falls below 100% (index 8) to 40% (index 14)

• Index 15-21: RST-L-NC: Replicates all non-checkpointable (NC) jobs that are longer than

ten hours (L), and given an availability prediction from the TRF predictor, whose predicted

completion probability falls below 100% (index 15) to 40% (index 21)

• Index > 21: 0x: Never replicates

21 is compared with three other ordered lists:

• 21 rev: Uses the same list as 21 except in reverse order (most selective to most aggressive)

• 7: Uses the same list as 21 except only the first 7 indexes. Any index larger than 7 indicates

the job will not be replicated.

69

21 21_rev 7 21_rand0

0.5

1

1.5

2

2.5

3Makespan Improv. vs. Replication List

List

Ave

rage

Job

Mak

espa

n(S

econ

ds x

105 )

21 21_rev 7 21_rand0

0.2

0.4

0.6

0.8

1

1.2

1.4Evictions vs. Replication List

List

Evi

ctio

ns x

105

Figure 7.4: Sliding Replication list comparison

• 21 rand: Has 21 indexes but uses probabilistic replication. The probability that each replica-

tion strategy in the 21 list has of creating a replica is determined and then based on the index

chosen, the strategy randomly creates replicas with a probability equal to the corresponding

strategy in the 21 list. This isolates choosing the ’right’ jobs to replicate from just choosing

the ‘right’ ’number’ of jobs to replicate at each load.

Figure 7.4 demonstrates the performance of all four ordered replication strategy lists for average

job makespan and number of evictions using the Condor availability trace and 100% of the jobs from

the Grid5000 workload. The 21 list strategy produces the fewest job evictions and lowest average

job makespan. Beating 21rev demonstrates that making fewer replicas under higher loads is better,

and beating 21rand demonstrates that the list of policies is making replicas of the right jobs.

7.3.2 Replication Index Function

This section investigates varying the function used by SR to alter the condition under which indices

are chosen. SR can use both linear and exponential functions when calculating these indices. The

exponential function (Exp-X) calculates an index into the ordered list of strategies as follows:

Index = (JDR · 10)X

X determines the strategy’s sensitivity to system load. Smaller values of X result in smaller indices

and therefore more replicas. The linear function (Lin-X) is:

Index = (JDR · 10) · X

Again, lower X values result in lower indices and more replicas.

70

0 50 10010

15

20

25Makespan Improv. vs. Job Load

Percentage of Job TracePer

cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 100100

150

200

250Efficiency vs. Job Load


Effi

cien

cy

Exp−1.1

Exp−1.6

Exp−2.2

0 50 10010

15

20

25Makespan Improv. vs. Job Load


cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 10050

100

150

200

250Efficiency vs. Job Load


Effi

cien

cy

Lin−0.75

Lin−1

Lin−1.25

Lin−1.5

Figure 7.5: Sliding Replication function comparison

Figure 7.5 compares linear indexing (Lin-X) in the top two graphs with exponential indexing

(Exp-X) in the bottom two, for the Condor resource availability trace paired with the Grid5000

job workload trace. The percentage of jobs injected from the Grid5000 workload is varied from

25% to 100%. The X parameter is also varied in both equations. For both functions, larger values

of X produce higher replication efficiencies, especially at higher loads. The exponential expression

produces similar but slightly higher replication efficiency when compared with the linear function.

7.3.3 SR Parameter Analysis

This section analyzes how the X parameter in the exponential index expression effects which index,

and therefore, which replication policies are chosen. The simulation executes the Condor trace

coupled with the Grid5000 workload. The X parameter is varied from 1.1, 1.6 and 2.2 to see which

replication policies (indices) are chosen throughout the simulation.

Figure 7.6 plots the frequency with which each value of the X parameter selects each index

and how many replicas are created at each index. Values over 21 result in no replicas and are not

shown. The only unexpected result is index 8 (RST-40) creating a larger number of replicas than

the surrounding strategies.

71

0 5 10 15 20

500

1000

1500

Frequency Count vs. Index

Index

Fre

quen

cy C

ount

0 5 10 15 200

500

1000

1500

Replicas Created vs. Index

Index

Rep

licas

Cre

ated

Exp−1.1Exp−1.6Exp−2.2

Figure 7.6: Sliding Replication index analysis

7.4 Sliding Replication Performance

This section compares Sliding Replication with other replication strategies. Section 7.4.1 compares

SR against the prediction-based load-adaptive replication approach Compromise-Rep, proposed in

Section 7.4.1. Section 7.4.2 then compares SR with WQR-FT using two real world workload traces.

7.4.1 Prediction-based Replication Comparison

This section compared the SR strategy with the previously proposed replication strategy Compromise-

Rep as defined in Section 6.3.3 [62]. Recall that Compromise-Rep measures the number of jobs in the

last day and if the number of jobs is less than 115, it replicates the job if it’s completion probability

is lower than 100%. If the number of jobs is greater than or equal to 115 but less than 200, it creates

a replica if the job’s completion probability is less than 40% and the job is non-checkpointable. Any

more than 200 jobs per day and Compromise-Rep will not create a replica. To compare these two

strategies, again the Condor availability trace coupled with the Grid5000 workload is used. Again,

to simulate a range from lower overall system loads to higher system loads, different percentages of

jobs are used from the Grid5000 trace in relation to the desired load level. K% means that a random

K% of jobs are kept from the trace.

Figure 7.7 shows the results of the comparison between the two load adaptive replication strate-

gies across a variety of loads. The left subgraph depicts makespan improvement over creating no

replicas. For makespan improvement, SR (Exp-2.2) outperforms Compromise-Rep under low loads;

under other loads the two strategies perform similarly. For replication efficiency, SR significantly

outperforms Compromise-Rep for all loads.

72

20 40 60 80 10010

12

14

16

18

20



Per

cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

20 40 60 80 10050

100

150

200



Effi

cien

cy

Exp−2.2Compromise Rep

Figure 7.7: Static vs. adaptive replication performance comparison

7.4.2 SR Under Various Workloads

This section analyzes SR for a variety of workloads and availability traces across various exponential

values and compares it to the WQR-FT replication strategy (defined in Section 2.4) [7]. SR is tested

using three values of X , namely, 1.1, 1.6 and 2.2. Figure 7.8 depicts results from the Grid5000

workload under both the Condor (top two graphs) and SETI (bottom two) availability traces.

For Condor, SR produces a comparable makespan improvement to WQR-FT (2.75% on average

across the loads) while producing a 225% increase in Replication Efficiency on average indicating SR

is creating significantly less replicas to produce a similar makespan improvement. WQR-FT creates

a replica if any resource is available, an aggressive but potentially inefficient strategy that improves

makespan unless replicas leave significantly fewer quality resources for future jobs.

For SETI, SR produces slightly over 1% makespan improvement over WQR-FT for all load levels

while simultaneously increasing replication efficiency by an average of 547%. Increasing the value

of X increases the replication efficiency in both the Condor and SETI simulations while slightly

decreasing makespan improvement across all load levels.

Figure 7.9 depicts results from the NorduGrid workload. Again, the top two graphs represent

results from the Condor trace and the bottom two from SETI. SR produces a higher makespan

improvement and replication efficiency in all but the SETI lower load case.

73

0 50 10010

15

20



cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 1000

50

100

150



Effi

cien

cy

Exp−1.1

Exp−1.6

Exp−2.2

WQR−FT

0 50 10012

14

16

18

20

22Makespan Improv. vs. Job Load: SETI


cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 1000

100

200

300Efficiency vs. Job Load: SETI


Effi

cien

cy

Exp−1.1

Exp−1.6

Exp−2.2

WQR−FT

Figure 7.8: Sliding Replication under the Grid5000 workload

0 50 100−5

0

5



cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 100−50

0

50

100



Effi

cien

cy

Exp−1.1

Exp−1.6

Exp−2.2

WQR−FT

0 50 1004

6

8

10

12

14Makespan Improv. vs. Job Load: SETI


cent

age

Mak

espa

n Im

prov

emen

tvs

No

Rep

licat

ion

0 50 1005

10

15

20

25Efficiency vs. Job Load: SETI


Effi

cien

cy

Exp−1.1

Exp−1.6

Exp−2.2

WQR−FT

Figure 7.9: Sliding replication under the NorduGrid workload

74

Chapter 8

Multi-Functional Device Utilization

Conventional workstations and clusters are increasingly supporting distributed computing, including

applications that require distributed file systems [9][10] and parallel processing [64][34][36]. For com-

putationally intensive workloads, schedulers can characterize and predict workstation availability,

and use the information to effectively use pools of distributed idle resources as done in previous chap-

ters [64][34][54][58]. In fact, many projects have successfully harvested idle cycles from workstations

and general purpose compute clusters.

This chapter describes efforts to leverage unconventional computing environments, defined here as

collections of devices that do not comprise general purpose workstations or machines. In particular,

the aim is to leverage special purpose instruments and devices that are often deployed in great

numbers by corporations and academic institutions. These devices have increasingly fast processors

and large amounts of memory, making them capable of supporting high performance computing.

They are often connected by high speed networks, making them capable of receiving and sending

large data intensive jobs. Importantly, and as seen in Section 8.4, these devices often have significant

amounts of idle time (95% to 98% idle). These characteristics provide potential to federate and

harvest the unused cycles of special purpose devices for compute jobs that require high performance

platforms [57].

8.1 Multi-Functional Devices

One concrete example of an unconventional environment containing highly capable special purpose

devices is a pharmaceutical company and its multi-function document processing devices (MFDs).

MFDs, workgroup printers, and coupled servers are common in businesses, corporations, and aca-

demic institutions. All three types of institutions fulfill printing and document creation needs with

fleets of such devices, which provide ever increasing speed, functionality, and print quality.

75

0 10 20 30 40 50 60 70 80 90 100 110 1200

5

10

15

20

25

30

35

40

45

50

# of devices per floor

freq

uenc

y

Figure 8.1: MFD count per floor at a large pharmaceutical company

To support this additional functionality—including email, scanning, faxing, copying, document

character recognition, concurrency and workflow management, and more—the devices now contain

processing power, RAM, secondary storage, and even GPUs, making them as computationally ca-

pable as general purpose servers. A single MFD, for example, includes at least a 1 GHz processor,

2 GB of memory, 80 GB HDD, Gigabit Ethernet, other processing cores and daughter boards that

function as co-processors, phone and fax capability, and mechanical components. Such a device

would currently cost approximately $25,000. This configuration compares favorably to a typical

high performance workstation or even one or more cluster nodes (the components of typical Grid

and cloud computing platforms).

Large organizations deploy a significant number of such devices, in part because to be useful they

must be located in physical proximity to users. Figure 8.1 includes the number of MFDs on each

floor of a large pharmaceutical company, a customer of Xerox Corporation. Each bar represents

the number of floors of a building that contain the number of MFDs indicated on the X axis. The

company has a total of 1655 MFDs, and over 65% of the floors in its buildings have five or more

such MFDs.

As a second example, MFDs from a business division of a large company from the computer

industry (managed by Xerox Corporation) are depicted in Figure 8.2. This organization consists of

14 floors split across several buildings.1 The organization has a total of 80 MFDs, 54% of the floors

have five or more MFDs, and 67% of the buildings have 13 or more MFDs.

1A typical building’s floor is 40,000 square feet.

76

0 2 4 6 8 10 12 14 16 180

1

2

3

4

5

6

7

# of devices per floor

freq

uenc

y

Figure 8.2: MFD count per floor at a division of a large computer company.

8.2 Unconventional Computing Environment Challenges

Unfortunately, simply recognizing a new hardware platform and plugging in existing cycle harvesting

solutions will not work. Unconventional computing environments may differ significantly from the

environments that are normally used to support Grid jobs. These differences require new approaches

and solutions. In particular, the differences include

• characteristics of the local jobs that the environments support,

• expectations that local users have for their availability and performance,

• the effect on Grid jobs when the resources are reclaimed for local use, and

• the kinds of Grid jobs the devices can support well (when they are not processing local jobs).

These differences are addressed below.

Local jobs: Desktop and workstation availability can track human work patterns; stretches

of work keep the machines busy, with short periods corresponding to breaks. In contrast, shared

printing resources (workgroup printers and MFDs) and print servers can be available for a relatively

high percentage of time, but may be frequently interrupted to process short jobs. Moreover, these

interruptions tend to arrive in bursts, as shown later in the chapter. Figure 8.3 depicts the burstiness

of job arrivals of a sample MFD, by tracking the state of the device through one day. When the

resource is printing, it is shown in the figure as Busy, when it is processing email, faxing, or scanning

a document, it is shown as Partially Available, and when it is idle, it is shown as Available. Figure 8.3

shows that MFDs in large organizations can be available for Grid use 95% or more of the time. But

device usage patterns exhibit volatility; short high priority jobs arrive intermittently.

77

Available

Busy/Partially Available

Total Time

Busy

Partially Available

Non−available Time

10am 11am Noon 1pm 2pmAvail.

Busy

Part. Avail.MFD State vs. Time

Time of day

Figure 8.3: Trace data from four MFDs in a large corporation. The devices are Available over 98%of the time, and requests for local use are bursty and intermittent.

Local user expectations, and the effect on Grid jobs: Workstations and cluster nodes can

often suspend Grid or cloud jobs when a local job preempts them, keeping the Grid jobs in memory

in case the resource may soon become idle again. Further, in conventional environments, the Grid

job may be able to take an on-demand checkpoint, or even migrate to another server. Typically local

users demand fast turnaround time for their jobs, and local jobs typically require the full capabilities

of the MFD. Therefore, when a user requests service from an MFD, any Grid jobs must immediately

be halted, without the possibility of checkpointing or migrating.

Types of jobs: Partly because of these local user expectations, MFDs are better able to support

shorter Grid jobs (e.g. tens of minutes) that can be squeezed into relatively short periods of avail-

ability. These relatively short Grid jobs may be related to printing, raster (image) processing, or

other services such as optical character recognition (OCR). These jobs are processing intensive, but

they can be divided amongst the devices in a floor or building. The jobs naturally lend themselves

to being processed in a page parallel fashion (MFD jobs typically arrive as a number of separate

pages). For example, a Grid job that is squeezed into some period of availability might represent

several pages of a long OCR job from a nearby MFD. For the OCR example, a job that may have

taken five to ten minutes might then require under two minutes if cycles from five nearby MFDs

could be harvested.

78

A doctor’s office is another example of an environment with non-traditional HPC requirements.

Several federated MFDs could effectively support a processing intensive job such as OCR and archiv-

ing of patient medical records. During every visit, patients fill in and sign privacy notices, prior

medical history, and other identifying information. For these to be automatically correlated to elec-

tronic medical records, OCR is necessary. At the same time, doctors’ offices print and fax frequently,

making the job traffic to MFDs highly intermittent and varied. MFDs are better suited to scanning,

handling paper forms, image related workflows, and secure prescription printing at the doctor’s of-

fice. However, as outlined above, they benefit from acting in concert for processing intensive OCR

jobs whenever a subset of devices detect a period of availability. Traditional compute clusters are

not only overkill for this domain, but they also do not handle paper workflows such as form scanning

and prescription printing.

This chapter characterizes this domain and proposes a burstiness-aware scheduling technique

designed to schedule Grid or cloud jobs to make better use of cycles that are left free in this en-

vironment. The devices also occasionally need bursts of computational power, perhaps from one

another, to perform image processing, including raster processing, image corrections and enhance-

ments, and other matrix operations that could be parallelized using known techniques [23] [10].

Sharing the burden of demanding jobs would allow fewer devices to meet the peak needs of bursty

job arrivals. Using known programming models [10] also enables leveraging the devices for general

purpose distributed computations.

This chapter investigates the viability of using MFDs as a Grid for running computationally

intensive tasks under the challenging constraints described above. Section 8.3 first defines an MFD

multi-state availability model similar to the multi-state availability model presented in Section 3.1.1.

The key difference is that this new model is geared towards the availability states of unconventional

computational resources such as MFDs whereas the previous model represents the states of tradi-

tional general purpose computational resources (e.g. workstations). Next, Section 8.4 gathers an

extensive MFD availability trace from a representative environment and analyzes it to determine

how MFDs in the real world behave in terms of the states states in the proposed model. Section 8.5

proposes an online and computationally efficient test for MFD “burstiness” using the trace. This

proposed burstiness metric is used to capture the dynamic and preemptive nature of jobs encoun-

tered on the devices. Burstiness is a tool to show how MFDs can be used as computational resources

and the scheduling improvements that can be achieved by considering resource burstiness in making

Grid job placements. Section 8.6.2 computes statistical confidence intervals on the lengths of Grid

jobs that would suffer a selected preemption rate. This assists in selecting Grid jobs appropriate for

79

Scan/Fax/Email

Printing

Available toGrid

Unavailable

Figure 8.4: MFD Availability Model: MFDs may be completely Available for Grid jobs (when theyare idle), Partially Available for Grid jobs (when they are emailing, faxing, or scanning), Busy orunavailable (when they are printing), or Down.

the bursty environment. On average across all loads depicted, the proposed scheduling technique

comes within 7.6% makespan improvement of a pseudo-optimal scheduler and produces an average

of 18.3% makespan improvement over the random scheduler.

8.3 Characterizing State Transitions

This section develops an availability model with which to categorize the types of availability and

unavailability that an unconventional computational resource such as an MFD experiences. In

particular, this section is interested in how the job patterns of a certain core set of native MFD jobs

(core or local jobs) affect the availability of CPU cycles to a second elastic set of Grid jobs, which are

assumed to have to run at a lower priority. Devices oscillate between being available and unavailable

for Grid jobs, depending on the core job load they experience from local users. MFDs can occupy

four availability states based on processing their core job load. As shown, these states can have a

significant impact on Grid jobs.

Figure 8.4 identifies the four availability states suggested by MFD activity. An Available machine

is currently running with network connectivity, and is not performing any other document related

task such as printing, scanning, faxing, emailing, character recognition or image enhancement. It

may or may not be running a Grid application in this case. A device may transition to a Partially

Available state if a local user begins to use it for scanning, faxing, or email. These tasks require

some but not all of the device’s resources. When an MFD is printing a document, it typically

80

requires the full capability of the machine’s memory and processing power, and so a separate Busy

state is designated for MFDs that are currently processing print jobs. Finally, if an MFD fails or

becomes unreachable, it is in the Down state. These states differentiate the types of availability and

unavailability a resource will experience.

When scheduling jobs onto a Grid of MFDs, it is best to avoid devices that are likely to become

Busy before the application has finished executing. If the Busy state is entered, the Grid job must

be preempted and that it loses all progress. This assumption reflects the expectations of typical

MFD users which are highly concerned with local user job turn around time. Entering the Partially

Available (e.g. scan/fax/email) state also affects Grid jobs, which will be left with fewer cycles.

However, because functionalities such as scanning, faxing and emailing require significantly fewer

computational resources compared to printing, Grid jobs need not be immediately halted; they may

continue to execute with a significantly smaller percentage of the total CPU capacity while the user’s

scan, fax, or email job completes.

Given the different types of unavailability a device may experience, applications that attempt to

utilize them as computational resources should attempt to avoid those devices most likely to enter

the Busy state because of the large amount of lost cycles which occur if an application must restart

on a different resource. Furthermore, those applications should also attempt to avoid a MFD likely

to enter the Partially Available state due to the slow down although the consequences of entering

this state compared with the Busy state are much less severe.

A scheduler that is aware of each MFD’s behavior and likelihood of entering these states can

make better Grid job placement decisions by attempting to avoid the Busy state altogether and

by limiting the number of times applications are slowed down by MFDs that enter the Partially

Available state. Section 8.5 introduces a metric called burstiness, which allows schedulers to exploit

the fact that transitions into Busy and Partially Available states are typically clustered together in

time.

8.4 Trace Analysis

A 153 day MFD job trace was gathered and analyzed from four MFDs located at a large enterprise

company between mid-2008 and mid-2009. The trace consists of a series of time-stamped MFD jobs.

Each job includes the job type (print, scan, fax or email), job name, number of pages, submit time,

completion time and the user submitting the job. This data directly implies the Busy and Partially

Available states. The trace only identifies when jobs are submitted and completed; it therefore

81

precludes determining exactly when the Down state was entered. As results show, MFDs are rarely

Down, and these particular MFDs were more than 99% available, by service agreement; MFDs were

Available if they were not otherwise being used. This section analyzes this trace data to characterize

MFDs according to the states of the proposed model with the goal of identifying trends that could

enhance the scheduling of Grid jobs.

8.4.1 Overall State Behavior

To determine how much of an MFD’s time is available to Grid jobs, Figure 8.3 analyzes the overall

amount of time spent in each state (this is the time that an MFD is available and not printing).

Figure 8.3 depicts the overall percentage of time spent by the four MFDs in each availability state

and an example of that availability. Figure 8.3a shows that availability is the dominant state with

98.2% of the MFD’s overall time being spent in the Available state; just 1.8% of the time is spent in

the Busy and Partially Available states. In regards to how much time is spent in the non-available

states shown in the right sub-graph, 85.7% of non-availability is Busy, leaving 14.3% in Partially

Available. Figure 8.3 shows an example of the typical behavior of an MFD during mid-day on a

weekday. It is mostly available but has sporadic bursts of unavailability. Thus, MFDs are highly

available resources with a large potential for use by Grid jobs.

Although there are more than 1,200 instances of availability of more than one hour, two-thirds

of the availability periods during daytime hours are 20 minutes or less. For daytime Grid jobs, it

can therefore be difficult to find an uninterrupted availability period. Even though there are many

available time periods during the day, scheduling greedily to fill most of the availability period might

lead to frequent application failures and restarts. As a result, the failure rate gets compounded—it

includes the failure rate for restarted jobs as well. Therefore, Section 8.6.2 conducts an experiment

to determine the size of Grid jobs that the MFD pool can handle well—that is, for which the pool

can keep Grid job failure rates between 5% and 10%.

8.4.2 Transitional Behavior

This section analyzes the behavior of the MFDs in terms of when and what types of transitions they

make. State transitions can affect the execution of an application because they can trigger reduced

application performance (the Partially Available state) or even require application restart (the Busy

state). Figure 8.5 shows the number of transitions to each state by the day of the week. Transitions

to Available, Busy and Partially Available states are highest on Monday, taper off slightly by the

82

M T W T F S S0

500

1000

1500

Day of WeekN

umbe

r of

Tra

nsiti

ons

to A

vaila

ble

Sta

te

M T W T F S S0

500

1000

1500

Day of Week

Num

ber

of T

rans

ition

sto

Bus

y S

tate

M T W T F S S0

50

100

150

200

Day of Week

Num

ber

of T

rans

ition

s to

Par

tially

Ava

ilabl

e S

tate

Figure 8.5: MFD state transitions by day of week

0 10 20

200

400

600

Hour of Day

Num

ber

of T

rans

ition

sto

Ava

ilabl

e S

tate

0 10 200

200

400

600

Hour of Day

Num

ber

of T

rans

ition

sto

Bus

y S

tate

0 10 200

20

40

60

80

100

Hour of Day

Num

ber

of T

rans

ition

s to

Par

tially

Ava

ilabl

el S

tate

Figure 8.6: MFD state transitions by hour of day

end of the week, and are significantly reduced on weekends.

Figure 8.6 shows the number of transitions to each state by hour of day. The numbers of

transitions to each of the three states show clear diurnal patterns, and are similar to one another.

Transitions peak at around 10am and 2pm with a trough between the two peaks corresponding to the

lunch break. Transitions are much lower overnight. Thus, MFD utilization follows strong weekly

and diurnal patterns. These patterns lend themselves to increased predictability and scheduling

potential.

8.4.3 Durational Behavior

This section analyzes the behavior of the MFDs in terms of how long they remain in each state

relative to the day of the week and the hour of the day. Longer periods of availability would allow

M T W T F S S0

5

10

15

20

25

Day of Week

Avg

. Ava

ilabl

e D

urat

ion

(Hou

rs)

M T W T F S S0.025

0.03

0.035

0.04

0.045

Day of Week

Avg

. Bus

y D

urat

ion

(Hou

rs)

M T W T F S S0.01

0.02

0.03

0.04

0.05

Day of Week

Avg

. Par

tially

Ava

ilabl

eD

urat

ion

(Hou

rs)

Figure 8.7: MFD state duration by day of week

83

0 10 20

6

8

10

12

14

16

Hour of Day

Avg

. Ava

ilabl

e D

urat

ion

(Hou

rs)

0 10 200

0.01

0.02

0.03

0.04

0.05

Hour of Day

Avg

. Bus

y D

urat

ion

(Hou

rs)

0 10 200

0.02

0.04

0.06

Hour of Day

Avg

. Par

tially

Ava

ilabl

eD

urat

ion

(Hou

rs)

Figure 8.8: MFD state duration by hour of day

20 40 600

100

200

300

400

Available State Duration (Minutes)

Num

ber

of O

ccur

ance

s

20 40 600

1000

2000

3000

4000

Busy State Duration (Minutes)

Num

ber

of O

ccur

ance

s

20 40 600

200

400

600

Partially Available State Duration (Minutes)

Num

ber

of O

ccur

ance

s

Figure 8.9: Number of occurrences of each state duration

longer Grid jobs to complete without interruption.

Figure 8.7 depicts the average duration of each state according to the day of the week. The

average duration of an Available state is quite low during the weekdays at approximately 1.4 hours

but becomes much larger during the weekend reaching a peak of 20 hours on Saturday. Figure 8.8

investigates the average duration of each state according to the hour of the day. As expected, the

average duration of the Available state is at its lowest in the middle of the day, since this is when

users are most likely to utilize the MFDs. The duration of the Busy state shows no such trends.

These results indicate that longer jobs are best scheduled later in the day when availability periods

are typically much longer.

Figure 8.9 shows the distribution of durations spent in each state. It plots the number of

occurrences of each duration versus the length of that duration. The most significant result shown

in Figure 8.9 is the duration of the Available state. Clearly, the vast majority of availability periods

are extremely short, lasting 20 minutes or less with an extremely long tail, indicating decreasing

occurrences of longer available durations.

84

8.5 A Metric for Burstiness

This section proposes a burstiness metric and an algorithm with which to determine the burstiness

of a given resource. Burstiness captures temporary surges in activity that occur sporadically in

the MFD domain. Activities tend to occur in “bursts” in the sense that MFD use is clustered in

time and lasts for short durations. This makes it difficult to detect and counter effectively. For

example, the frequent occurrence of burstiness certainly impedes the completion of any long Grid

job and makes scheduling short jobs difficult. This is because shorter jobs are purged or restarted

frequently, thereby causing resource thrashing and wasted network and/or processor cycles.

Therefore, to perform high performance distributed Grid computing on bursty devices (due to

preemption by higher priority jobs), burstiness can be used as a form of availability prediction. The

proposed burstiness metric enables the system to schedule jobs when the burstiness of a resource is

low therefore greatly lowering the likelihood of job eviction.

8.5.1 Burstiness Test

The following notations are used:

At any given time tnow, let T refer to a recently elapsed time window where burstiness is to

be estimated (T issetto2hours). Let µ and σ refer to the mean and standard deviation of the time

between jobs (that is, the interarrival time) respectively, in the time-period T .

To quantify burstiness at time tnow,, the algorithm looks back at a period T and estimates

whether the most recent block of time within T (chosen as the average arrival time µ between jobs

in T ) (a) has experienced more transitions to the busy state of the device compared to other blocks

of time in T ; and (b) if the spread in the most recent block is significantly different from others in T .

These comparisons detect whether the most recent block has had an abnormally high job arrival rate

with spreads just enough to frequently disrupt Grid tasks. More formally, the steps are as follows:

1. Starting from tnow, go back and divide the period T into contiguous blocks of time each of

length µ, the average interarrival time. Call each time block bp where p is the index of the

block counting backwards from tnow.

2. Let the number of job arrivals in a block bp be np.

3. Compute the mean and standard deviation of arrival times across all jobs in all blocks, and

refer to them as µB and σB .

85

4. For a given block p, the spread in jobs is given by the standard deviation σp of the time

between jobs.

5. Comparison 1: Determine whether np is within one σB of µB (meaning there are relatively

more transitions into the unavailable state in this time block).

6. Comparison 2: Determine whether σp < σ (meaning the jobs are more clustered and less

spread out than normally observed within period T ).

8.5.2 Degree of Burstiness

If the answers to both Comparison 1 and 2 are Yes, then the period p is considered bursty.2 Other-

wise, the block bp does not exhibit burstiness. Two dimensionless quantities characterize burstiness

quantitatively. The strength S of the burst is defined as

Sp =np

µB

(8.1)

where a quantity greater than one denotes that a time block p is experiencing a burst. Spread ratio

(D) is defined as

Dp =σ

σp

(8.2)

where a quantity greater than one denotes that the time block p is experiencing a tighter cluster

than the whole time history window.

The burstiness of a window p is defined as

Bp = Sp + Dp (8.3)

Other forms of this metric could weigh the two quantities in Equation 3 differently from one another.

One drawback of this metric is that bursty events may spill over into adjacent time blocks {bp, bp−1}

and/or {bp, bp+1}. This is addressed in future work by establishing time block combination strate-

gies.

8.5.3 Burstiness Characterization

To quantify the burstiness of devices, Figure 8.10 computes burstiness values at 107 sample points

(that sampling is slightly less than once every 2 seconds for the trace). Figure 8.10 plots the calcu-

2The degree of burstiness is defined in Equation 4.

86

0 5 10 15 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Hour of Day

Bur

stin

ess

M T W T F S S

0.5

1

1.5

2

Day of Week

Bur

stin

ess

Figure 8.10: Burstiness versus Day of Week and Hour of Day

lated burstiness values versus both the time of the day and the day of the week for the four MFDs.

The left subgraph illustrates that the device burstiness peaks at about mid-day and dramatically falls

off early and late in the day. The right graph indicates that burstiness reaches a peak on Monday,

at which point it slowly tapers off until it falls dramatically on the weekends. These results confirm

the intuition that MFDs experience maximum burstiness on weekdays around mid-day. As shown in

the next section, the burstiness of a resource closely follows it’s transitional behavior. Hence it is a

good indicator of volatility and can be extremely useful in scheduling to avoid preemptions brought

on by MFD transitions to unavailable states (e.g. printing).

8.6 Burstiness-Aware Scheduling Strategies

This section discusses the viability of scheduling jobs on MFDs given the volatility they exhibit.

This section proposes a new scheduling technique that leverages the burstiness metric proposed in

Section 8.5 to make better job placement decisions. These placement decisions reduce the number of

evictions/preemptions experienced by jobs and thereby reduce the average job makespan, increasing

the overall efficiency of the system. When a preemption is encountered, the executing job will halt

and restart execution elsewhere (thereby increasing makespan).

8.6.1 Experiment Setup

To investigate the scheduling techniques proposed in this section, late sections simulate jobs exe-

cuting on the MFDs; each uses its own recorded availability in accordance with the gathered (core

job) workload trace. The simulations in this chapter insert Grid jobs (the aforementioned elastic

class) at random times in the trace such that for the overall simulation, job injection is uniformly

87

distributed for the considered period of 153 days. However, jobs are only injected between 9am

and 5pm during weekdays to reflect the types of jobs that are expected to be executed on such a

system as discussed earlier. The simulation assigns each job a duration (i.e. a number of operations

needed for completion) where the duration is an estimate based on the execution of the job on an

idle device with average CPU speed (device attributes are not unlike server attributes and are based

on published design/model specifications).

Similarly to the simulation setup in Section 5.1, the CPU speed, the load on each MFD, and

the (un)availability states (determined by the trace) influence the running of the jobs. MFDs are

considered available if they are not executing a print job (are not in the Busy state). An MFD

may only be assigned one task for execution at a time. During each simulator tick, the simulator

calculates the number of operations completed by each working MFD, and updates the records for

each executing job. If a device executing a job enters the Busy state due to the arrival of a core job,

the Grid job is preempted and put back in the queue for rescheduling.

MFDs are assumed to have no load on their CPU unless the Partially Available state is entered.

While in this state, an executing job will not be preempted, but the job imposes a load of 0.5 on

the processor. Note, however, that this state can model any portion of availability. In this case,

jobs executing on an MFD in the Partially Available state will effectively take twice as long as they

would if they were run on an MFD in the Available state.

Section 8.6.2 tests the following scheduling strategies:

• Random: Randomly selects an MFD for execution

• Upper Bound3: Selects the available MFD that will execute the job in the smallest execution

time, without preemption, based on global knowledge of MFD availability. When all MFDs

would become unavailable before completing the job, the scheduler chooses the fastest available

MFD.

• Burst: For each job, each MFD’s burstiness is calculated at the time the job is being scheduled.

The MFD with the smallest burstiness value is greedily selected for job execution.

• Burst+Speed: For each job, each MFD’s burstiness is calculated (Bi) at the time the job is

being scheduled. The algorithm greedily choose the MFD with the highest score RSi where

the score of MFD i is calculated according to the following expression:

3Used for comparison purposes only; global knowledge will not usually be available.

88

0 1 2 30

10

20

30

40

50

60

Average Job Length (Hours)P

erce

ntag

e M

akes

pan

Impr

ovem

ent

0 1 2 30

500

1000

1500

2000

2500

Average Job Length (Hours)

Pre

empt

ions

Random

Upper Bound

Burst

Burst+Speed

Figure 8.11: Comparison of MFD schedulers as Grid job length increases

RSi =100 − Bi

100·CPUmax

CPUi

(8.4)

where Bi is MFD i’s burstiness, CPUi is it’s CPU speed and CPUmax is the highest CPU

speed of all resources (for normalization).

Random is a scheduler with no information and represents the least sophisticated scheduling ap-

proach (a lower bound). Upper Bound is a scheduler with global knowledge and represents an upper

bound on possible scheduling results (maximum makespan improvement with the fewest number of

preemptions). Both Burst and Burst+Speed are greedy schedulers - Burst only takes MFD volatility

into account whereas Burst+Speed considers both MFD volatility and speed.

8.6.2 Empirical Evaluation

This section investigates the performance of the proposed scheduling strategies with two types of

experiments.

Impact of (Grid) Job Length

Figure 8.11 illustrates the performance of the scheduling strategies as average job length varies. Job

length is the time required to complete the job on a machine with an average dedicated CPU. 1,000

jobs are injected randomly and uniformly over the trace while varying the average length of the jobs.

For each simulation, the four schedulers are given an identical set of jobs to schedule. In trace with

the the shortest jobs, job lengths vary from five to 30 minutes (average length of 17.5 minutes). In

the trace with the longest jobs, job lengths vary from five minutes to six hours (average length of

3.04 hours). Intermediate average lengths varied jobs between five minutes and 76, 147, 218, and

289 minutes, respectively.

89

Figure 8.11’s left subgraph plots the percentage makespan improvement over the random sched-

uler versus the average job length. Figure 8.11 demonstrates that simply utilizing the burstiness

metric (Burst) to consider the volatility of resources and avoid preemptions greatly reduces the av-

erage job makespan by 24% for the shorter average job length case and 3.6% as average job length

increases to 3 hours. Considering volatility and resource speed when scheduling with Burst+Speed

further reduces the average job makespan by 36.6% for the shorter average job length case and 7.5%

as average job length increases to three hours. Burst+Speed comes within an average of 7.6% of

Upper Bound’s makespan improvement across all job lengths; it comes within 13.7% for shorter jobs

gradually improving to within 4.3% for longer jobs. Burst+Speed produces an average makespan

improvement of 18.3% over the random scheduler. This indicates that the burstiness metric closely

captures the true volatility of the devices and allows schedulers to choose MFDs that are far less

likely to experience print preemptions, greatly increasing the efficiency of the system by lowering

average job makespan. The overall trend for all schedulers is that as the average job length increases,

the possible makespan improvement over random scheduling lessens. This is because longer jobs are

simply more likely to fail and given the volatility of the devices in general, even an ideal scheduler

such as Upper Bound cannot avoid preemptions for these longer jobs.

Figure 8.11’s right subgraph plots the number of preemptions (jobs interrupted by the MFD

encountering a print job and being forced off the MFD) versus the average job length. As average

job length increases, all scheduling strategies naturally encounter more job preemptions. Note the

direct correlation between preemptions encountered and average job makespan—fewer preemptions

reduce the average job makespan. Interestingly, Burst and Burst+Speed encounter a similar number

of preemptions through the range of average job lengths with Burst+Speed encountering slightly

more. Incorporating CPU speed into the MFD selection criteria (in Burst+Speed) greatly increases

job makespan improvement (from a 3.8% improvement for longer jobs to a 12% improvement for

shorter jobs) over only considering volatility with Burst.

Figure 8.12’s left subgraph shows the maximum job length such that it will only be preempted

k% of the time. 10 replications of the aforementioned experiments were executed and the Box-

Whisker diagram plots in Figure 8.12a provide a 95% confidence interval. For example, using the

Burst+Speed scheduler the diagram indicates with a95% confidence that a job of 7.5 minutes will

only fail around k = 10% of the time. Users could read the figures directly off the graph to choose

Grid job lengths that could be run on the system, given some tolerance for application failure and a

particular statistical confidence interval. Figure 8.12’s right subgraph shows how the average failure

rate varies with job length as computed from the ten experiment replications mentioned above.

90

1 7.5 14 20.5 27 33.5 40 46.5 53 59.50

10

20

30

40

50

Job

Fai

lure

Rat

e (%

)

Job Length (Minutes)0 20 40 60

0

10

20

30

40

50

Job Length (Minutes)

Job

Fai

lure

Rat

e (%

)

RandomUpper BoundBurstBurst+Speed

Figure 8.12: Determination of Grid job lengths with 95% confidence intervals for given failure rates

0 0.5 1 1.5 2

x 104

−10

0

10

20

30

40

50

Number of Jobs

Per

cent

age

Mak

espa

n Im

prov

emen

t

0 0.5 1 1.5 2

x 104

0

1000

2000

3000

4000

5000

6000

Number of Jobs

Pre

empt

ions

Random

Upper Bound

Burst

Burst+Speed

Figure 8.13: Comparison of MFD schedulers as Grid job load increases

Impact of (Grid) Job Load

Figure 8.13 investigates the effect of varying the load. This test fixes the job length between five

minutes and one hour (average Grid job length of 32.5 minutes) and varies the number of jobs in-

jected over the course of the trace from 1,000 to 20,000 jobs. Again, Figure 8.13’s left subgraph plots

average job makespan improvement as the load increases using the random scheduler as a baseline.

Overall, the makespan improvement decreases as job load increases because as MFD contention in-

creases, eventually all MFDs become utilized by all the schedulers, disallowing a scheduler to choose

an MFD which will not experience preemption. Therefore, the schedulers converge on the same aver-

age job makespan. In the low load situations, Burst produces a 19.2% makespan improvement while

Burst+Speed produces a 31.4% makespan improvement compared to the Random scheduler, coming

within 13.3% of Upper Bound’s 44.7% makespan improvement. As load increases, the makespan

improvement gap between these techniques lessens, but Burst consistently produces a makespan

improvement over Random and in turn Burst+Speed consistently produces a makepan improvement

over Burst. On average across all loads depicted, Burst+Speed comes within 4.3% makespan im-

91

provement of the Upper Bound and produces an average of 7.11% makespan improvement over the

random scheduler, with the gap lessening as load increases.

Figure 8.13’s right subgraph plots the number of preemptions produced by each scheduling strat-

egy versus the job load on the system. As job load increases, all scheduling strategies produce more

preemptions eventually converging on roughly the same number of preemptions in the extremely

high load case. The largest gaps come at lower loads with the gap narrowing as load increases.

Again, the figure demonstrates that the number of preemptions produced by a scheduling strategy

directly influences the average job makespan of the jobs. In all job load cases, Random produces the

largest number of preemptions followed by Burst+Speed, then Burst and finally Upper Bound.

These scheduling results demonstrate that the Burst+Speed scheduling mechanism greatly re-

duces the number of preemptions experienced by jobs compared with the random scheduler and

thereby decreases the average job makespan for a variety of job lengths and job load situations.

92

Chapter 9

Simulator Tool

This chapter discusses the motivation and implementation of the PredSim simulation framework.

PredSim was used to obtain all the results presented throughout this work. It is a discrete-event,

multi-purpose simulator with capabilities ranging from analyzing resource availability traces to test-

ing job scheduling and availability prediction algorithms. PredSim’s primary purpose is to allow

users to easily add or remove components such as predictors and schedulers for easy comparison.

PredSim also enables researchers to use real-world resource availability and workload traces to drive

their simulations, thereby better approximating the actual behavior of a real system. In this way,

PredSim allows for fast prototyping and development of reliability aware and fault tolerant predic-

tion, and scheduling techniques driven by real world trace scenarios.

9.1 Simulation Background

This section presents background information on simulation based research. Section 9.1.1 discusses

the state of the art and fundamental limitations and Section 9.1.2 discusses related distributed

system simulators.

9.1.1 Simulation Based Research

Simulation based results are used in a variety of fields such as construction, manufacturing, industrial

processes, environmental sciences and computer science to drive effective resource management,

production and research. Simulators are particularly useful to researchers for the following reasons.

• Scalability: Simulators, when implemented well, can scale to incorporate many more resources

and jobs than may be feasible or practical to achieve in the real world. The number of jobs

injected into the system can exceed observed system load, to test the system under more

stressful, rarely seen circumstances, which may become more common in the future.

93

• Ease of Deployment: The overhead of creating, distributing and setting up a real world

system can be prohibitively costly and time consuming. Simulations have the advantage of an

extremely low cost of deployment. Furthermore, testing unproven ideas in a real world system

can cause an unnecessary slowdown or loss of service to the users of the system.

• Repeatability and Consistency: As opposed to real world situations, in simulation-based

testing, a given scenario can be run repeatedly till an optimal or close to optimal solution to a

given scenario can be found. This testing and retesting can allow techniques to be fine-tuned

and furthermore, allows for the demonstrable repeatability essential to the scientific process.

This allows researchers to test competing ideas in a consistent environment with common

scenarios to determine the most effective solution(s).

• Speed of Results: Many testing scenarios can take months or years to run to completion.

Testing these scenarios in the real world means that results come extremely slowly. Further-

more, it means that the turnaround time for testing, modifying the technique and retesting

becomes extremely tedious. Simulation can allow these long scenarios to be tested in minutes

or hours, giving the researcher extremely timely feedback with which to further their work.

This allows techniques to be developed several orders of magnitude faster.

• Accessibility: Simulators can easily be made available to the research community and dis-

tributed to interested parties. This is especially important given that it is difficult to have

ready access to an actual Grid system. This easy of access and distribution can allow more

collaboration between researchers on a problem.

Although simulators have many benefits, they still have limitations and weaknesses compared

with other approaches such as real world testing.

• Compatibility: Simulations are often written for the express purposes devised by the coder

of that simulator. In particular, their exact design will often be unique and results from one

simulator may not be directly comparable with results from another simulator.

• Bugs: Human error in the coding can cause the simulator to produce inaccurate results.

• Artificiality: The scenarios examined in simulation can be far fetched and not representative

of any foreseeable scenario. This may lead to the results being misinterpreted.

• Applicability: Even the most thorough and exhaustive simulation cannot possibly represent

the complexities of the real world. By their very nature, simulations are simplifications of

94

real world situations which not only make many assumptions about how the virtual world will

operate but also make simplifying assumptions to facilitate their development.

In the end, nothing can completely take the place of real world testing but simulation-based

research has proved to be an invaluable tool.

9.1.2 Distributed System Simulators

Computer Science as a field has developed a whole range of simulation based tools for use by

researchers in all of it’s various subfields. For example, SimOS can simulate an entire computer

including operating system and application programs to study aspects of computer architecture

and operating systems [63]. TOSSIM is a simulation based framework in which mobile sensor

devices executing the TinyOS software system can be virtually deployed to investigate ad-hoc routing

protocols and environmental data gathering techniques [38]. The range and functionality of these

simulators in quite extensive so in this section focuses on related Grid-based simulation work.

Distributed and Grid systems have evolved to become powerful platforms for solving large scale

problems in science and other related fields. As these systems become larger and more complex, the

management of the resources they comprise becomes more important. Coupled with the difficulty

in obtaining access to a large scale Grid testbed, it becomes imperative to provide researchers with

tools for devising and analyzing algorithms before they are deployed in real world environments.

With this aim, many distributed and Grid system simulators have been developed [14] [12] [73] [71].

This section summarizes the most relevant related work.

GridSim, the most closely related simulator, is a java-based discrete-event Grid simulation toolkit

developed at the University of Melbourne to enable the study of distributed resource monitoring

and application execution [12]. GridSim was specifically designed to incorporate unavailability of

Grid resources during runtime, and provides the functionality to utilize workload traces taken from

real world environments to create a more realistic Grid simulation. GridSim also enables the user to

specify the Grid network topology complete with background traffic. GridSim is built on top of the

SimJava discrete event infrastructure and provides the modeling of resources and jobs executing on

the discrete events defined by the underlying SimJava layer. GridSim incorporates application mod-

eling and configuration, resource configuration, job management and resource brokers and schedulers

to allow for the system to be highly configurable for the user and to enable simulations which are

reflective of their real world counterparts.

SimGrid is another event-driven simulation toolkit designed to provide a testbed for the evalua-

95

tion of distributed system scheduling algorithms [14]. It enables the researcher to study application

behavior in a heterogeneous distributed environment with the key aim of rapid prototyping and evalu-

ation of different scheduling approaches. It incorporates the ability to model resource availability be-

havior based on synthetic resource and network behavior models fine-tuned with parameter-constants

or by inputting resource availability traces. Furthermore, SimGrid supports task dependencies in

the form of Directed Acyclic Graphs (DAG) jobs. DAGSim is build on top of this infrastructure

and is used to evaluate scheduling algorithms for DAG-structured applications [28].

Bricks is a java-based discrete-event simulation system geared towards providing components

for scheduling and monitoring applications through the use of resource and network monitoring

tools [73]. It consists of several components including a ServerMonitor, NetworkMonitor, Scheduling

Unit and Network Predictor. These components work in tandem to monitor resource and network

activity and then actively make application scheduling decisions.

The MicroGrid toolkit provides a platform on which to run Globus applications, allowing the

evaluation of middleware and applications in a Grid based environment [71]. MicroGrid creates a

virtual Grid infrastructure and supports scalable clustered resources with configurable Grid resource

performance attributes. The key functionality provided is to enable users to accurately simulate the

execution of Globus applications by presenting the applications with an identical Application Pro-

gram Interface (API) such that they will behave accurately in the simulation. In this way, MicroGrid

allows applications that have been designed to use the Globus Grid middleware infrastructure to be

tested before deployment.

The most fundamental difference between the previous approaches and PredSim is that PredSim

allows a multi-state resource availability model. As noted in Section 3.1, resources may become

unavailable in a multitude of different ways and because of the different effects that these unavail-

abilities can have on executing applications, it’s important that a simulator incorporate these types

of unavailability. Alongside this functionality, PredSim supports simulating checkpointable applica-

tions including periodic and on demand checkpoints along with checkpoint migration. Also, PredSim

is purpose-built around supporting availability prediction-based scheduling approaches and as such

has built in services for monitoring and cataloging resource history to enable prediction. Addition-

ally, PredSim has a built in mode for testing a predictor’s accuracy which can be used to isolate the

accuracy of that predictor under a given scenario. PredSim also has built in mechanisms for creating

and managing task replicas to test prediction and non-prediction based replication techniques.

96

9.2 PredSim Capabilities and Tools

This section summarizes the primary capabilities and features of PredSim and discusses its included

tools.

9.2.1 Capabilities

PredSim allows for the following core functionalities

• Multi-state availability model compatible

• Task scheduling and prediction accuracy modes

• User specified components such as schedulers and predictors can easily be added

• Compatible with FTA and Condor resource availability trace formats

• GWA workload format capable

• Fully configurable synthetic resource availability and task creation options

• Provides statistical and resource information services for analysis

• Automatic execution of ranges of scenarios (load, resource characteristics, etc)

In these ways, PredSim creates a functional testbed on which to compare prediction, prediction-

based scheduling and replication techniques.

9.2.2 Tools

In addition to the functionalities provided by the core PredSim modules, PredSim also includes

several tools for further analysis and ease of use.

Multi-exec Tool

Often it becomes important to test a given technique in multiple environments under a series of

conditions. The core modules of PredSim allow the simulator to automatically execute a range of

conditions for a series of schedulers/predictors such as varying load, checkpointability or replicata-

bility within a given trace. However, its also important to test techniques under different resource

availability sets and workloads. Multi-exec allows the user to set PredSim to test a given technique

or set of techniques utilizing different resource availability and workload traces and within each

97

trace, the conditions for that simulation can then be varied. Additionally, each run can be taken a

configurable number of times to, for example, develop confidence intervals for the results.

Combine Tool

Given a series of simulation runs, the Combine Tool allows the output files to be combined into a

single set of output files. This can be useful for aggregating results and understanding the overall

behavior of techniques in a variety of circumstances and environments. Also, sometimes results can

be inconsistent between runs. Executing many copies of a run and then combining the results can

be helpful in determining the underlying performance trends.

Availability Analysis Tool

Section 3.2 investigated the availability behavior of the Notre Dame Condor pool. This data was

created by the Availability Analysis Tool which inputs a resource availability trace and outputs a

whole range of availability analysis data. It incorporates the ability to analyze both single state and

multi-state availability traces. The analysis data it produces includes resource behavior versus day

of the week and time of day. It also analyzes aggregate resource behavior such as time spent and

transitions to each state over a trace. This tool can be useful for describing and characterizing the

multi-state availability behavior of a set of traced resources.

9.3 PredSim System Design

At it’s core, PredSim was designed to allow and account for the different types of availability and

unavailability that a resource may exhibit. Section 3.1 discussed these different states of availability

including the user becoming present on the machine, the local load going over a certain threshold

and the machine becoming unavailable unexpectedly (ungraceful failure). PredSim was designed

to incorporate these different types of unavailability while having the functionality to simulate a

variety of application types. PredSim supports both application checkpointing and migration in-

cluding periodic checkpointing and on-demand checkpointing. PredSim also has built in services

for supporting resource availability prediction and prediction-based scheduling and task replication

algorithms.

Figure 9.1 shows the architectural components of PredSim grouped into three categories: Input

Components, User Specified Components and PredSim Components. The following sections discuss

the design of these three component groups.

98

Figure 9.1: PredSim System Architecture

9.3.1 Input Components

In order to test prediction algorithms and scheduling techniques, both resource availability informa-

tion and task information must be available to the simulator. PredSim has multiple methods for

both accepting and/or generating this information.

Resource Input

As noted earlier, PredSim is designed from the ground up to support a multi-state resource avail-

ability model. As such, it’s primary resource input format is that of Condor formatted resource

availability traces. By default, the Condor distributed system logs resource availability information

including user presence information, local CPU load measurements and overall resource connectiv-

ity/availability. This information can be extracted from the system via the condor stats command.

At this point an analysis tool included with PredSim, then analyzes the condor formatted trace

information and converts it into a format readable by PredSim. The PredSim format consists of

newline separated unix epoch, availability state and CPU load items. The availability state is an

integer from 1 to 5 indicating the availability state according to the multi-state model. In this way,

the input format retains all the information about resource availability, user presence and CPU load.

99

PredSim also accepts resource availability information stored in the Failure Trace Archive (FTA)

format [35]. The FTA is an online repository of 26 resource availability traces all stored in a common

format. Each FTA formatted trace consists of a series of tables output to files contained within a

folder. These text files contain information on the nodes such as CPU speed, available memory as well

as availability information. The key input file is event trace.tab which contains newline delimited

event information corresponding to each resource’s availability via unix epoch time stamped events.

Unfortunately, the FTA format does not include the ability to differentiate the user present state

but PredSim adjusts by simply ignoring the user present state.

PredSim stores all input traces in available memory via a set of time stamped resource states.

Each resource’s availability history is represented as an array of events. Each event contains a unix

time stamp, an integer representing it’s availability state according to the presented multi-state

availability model and a CPU load float value. FTA traces are converted to this format by parsing

through the event trace.tab, and for each resource, recording periods of availability or unavailability

as dictated by the event information contained in the FTA trace.

Using real world resource availability traces is typically the preferred method for algorithm

testing, but often a researcher will want to design his or her own resource availability scenario

which is not present in any existing resource availability traces. For this reason, PredSim includes

a Synthetic Job Creation component. This component allows the user to create a set of resources

and is fully configurable by the user. It incorporates the following parameters.

• Quantity: Dictates the total number of resources

• Availability: The user can specify the exact availability characteristics of each machine

specifically or by setting the “class” of each resource. The “class” dictates the probability

distribution a resource will follow in regards to it’s availability behavior including:

– State Duration: Each state is given a parameter for a probability distribution. This

parameter dictates once a state is entered, how long the resource will stay in that state.

Distributions include Weibull and Uniform.

– State Frequency: Each state is also given a parameter for a probability distribution

indicating how likely that state is to be entered. This dictates how frequently each state

is entered.

• Performance: The speed and load on each resource is also configurable by the user.

100

– Resource speed: Speed, specified in terms of Millions of Operations Per Second (MIPS)

of each resource can be individually specified or the “class” of the resource can dictate

its speed.

– CPU Load: The local load on the processor is a configurable parameter. It can be set

to follow a distribution or always remain at zero.

Task Input

The other key input component is the tasks which the simulation system will execute. PredSim

includes two methods for this input component. The first method is that of accepting workload

traces in the Grid Workload Archive format (GWA) [27]. The GWA is an online repository of 11

workloads all stored a common format. Importantly, the GWA format indicates both when jobs

arrive and the length of each job. The length of the job is logged as the number of seconds it took to

execute. This approach has the drawback that the execution time of the job will be dependant on

the speed and local load of the machine on which it executed. For this reason, PredSim assumes that

all the machines executing jobs on the traced system are the same speed. Another key drawback

to the format is that it does not indicate if jobs are checkpointable or replicatable. To bypass this,

PredSim allows the user to input a traced workload and then apply checkpointable or replicatable

attributes to jobs as the user sees fit. For example, the user could specify that 25% of the jobs

input are checkpointable, which would cause PredSim to randomly assign 25% of the jobs with the

attribute “checkpointable”.

As with resource availability traces, sometimes a desired workload trace scenario may not exist.

In this case, PredSim allows a synthetic set of jobs to be created and executed on the system.

PredSim allows the user to adjust the following job attributes:

• Quantity: The total number of jobs injected into the system over the course of the simulation.

• Job Length: The total number of operations required to complete the job can be given as a

range of lengths. The length of each job can be chosen via a probability distribution in that

range. Job length distributions include uniform and Pareto.

• Checkpointability: Determines what percentage of jobs are checkpointable.

• Replicatability: Determines what percentage of jobs are replicatable.

• Day or Night: If all the jobs are injected only during the day or may be injected at any time.

101

Combining Resource and Task Inputs

Given a resource availability trace and a workload trace, PredSim must then match up these two

input components which may lead to certain difficulties. For example, the traces may be taken at

different times. To adjust for these differences, PredSim will match up the beginning of each trace

in time. Additionally, traces may have different lengths. In these cases, PredSim will shorten the

longer of the two traces and throw out the excess. The last issue is slightly more complex. When a

workload is gathered from a system, then applied to another traced system, the two systems may be

different sizes. For this reason, PredSim can dynamically scale the workload to match the number

of resources in the availability trace. It does so by calculating the number of jobs per resource for

the workload, then given the number of resources in the availability trace, it will scale the number

of jobs per resource from the workload by randomly throwing out excess jobs. The last concern is

that of varying load level. Often, a researcher will want to test an approach for a variety of load

levels but a given workload trace will only specify a narrow range of loads. PredSim can dynamically

adjust the load level of a GWA input trace by only keeping a certain percentage of jobs from it. This

enables testing with a given workload type as specified by a workload trace while still facilitating

tests for a large variety of loads.

9.3.2 User Specified Components

PredSim strives to be versatile through giving the user the freedom to swap in and out system and

algorithmic components. PredSim provides the user with a library of built in techniques for a given

component but also allows for new components to be added.

Availability Predictors

PredSim defines an interface for prediction techniques in Table 9.1. The user can define a new

prediction module using this interface or use/modify existing prediction modules. The input interface

specifies the resource to predict, a future interval of time for which the prediction is to be made

and the availability history of the resource. The predictor then outputs four probabilities which

sum to 100%. These probabilities represent the likelihood of the resource entering each respective

unavailability state, or completing the interval in the available state. Single-state predictors need not

return four probabilities but can just specify the Completion and Ungraceful Eviction probabilities.

The simulator comes built in with a whole range of multi-state and single state resource avail-

ability predictors and prediction tools. These built in prediction modules can be used to develop

102

Input Interface

Resource Index of resource for which the prediction is to be madeCurrent Time The current simulation time in unix epoch format

Prediction Length The length of the interval to make the prediction for in numberof minutes

Resource History An array of event structs containing unix epoch stamped avail-ability state history information for the resource

Output Interface

Resource Index of resource for which the prediction was made

Probability Outputs

Completion- probability that the machine will remain availablefor the duration of the intervalUngraceful Eviction- probability that the machine will fail un-gracefullyUser Eviction- (Optional) probability that the user will return tothe machineCPU Eviction- (Optional) probability that the local CPU load onthe machine will exceed the threshold

Likely Availability Duration (Optional) The predicted duration that the resource will remainavailable

Predicted CPU Load (Optional) The predicted average CPU load during the period ofavailability

Table 9.1: Prediction Interface

new prediction techniques and include:

• Prediction Time Frame: This tool dictates at what points in a resource’s history to exam

such as examining the previous X hours or examining the prediction period during the last N

days.

• Availability Analysis Tools: Once the period of resource history to analyze is chosen, the

user can then can specify how those periods of history are analyzed. Analysis techniques

include durational examination and transition counting schemes.

• Transition Weighting Schemes: If transition weighting is the chosen form of historical

analysis, the transitions can be weighted according to a variety of factors such as time of day,

day of week and freshness.

The included prediction techniques can be parameterized and combined in various configurations

to provide the user with a wide range of availability predictors (such as those proposed and investi-

gated in Chapter 4 like TRF and TDE). These predictors can also be compared with built-in related

work prediction modules such as Ren, Saturating Counter, History Counter, Sliding Window Single

and Sliding Window Multi (discussed in detail in Chapter 4).

103

Input Interface

Job

Job number- Unique job identifierInjection time- Unix time stamp of the time the job was injectedinto the systemJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable

Resource Information ArrayAvailability- The current availability state of the resourceMIPS- The speed of the resource in Millions of Instructions PerSecondCPU Load- The current local load on the resource’s processor

Output Interface

Resource Index of resource chosen to execute the job. -1 indicates the jobcould not be scheduled.

Table 9.2: Scheduling Interface

Scheduling Algorithms

The user must address how resources should be chosen for the tasks injected into the system.

PredSim’s scheduling interface is defined in Table 9.2. The input to a scheduler is a job struct

containing information about the job including it’s unique identifier, injection time, the number of

operations to complete the job and it’s ability to be checkpointed or replicated. A scheduler can

then select from the resource array list provided based on whatever criteria chosen by the scheduler

and return the selected resource via an index value.

PredSim accommodates new scheduling modules but also has a set of built in modules including:

• Random: Randomly choose an available resource.

• CPU-Speed: Choose the resource with the fastest CPU and lowest local CPU load.

• Pseudo-optimal: With omnipotent future knowledge, choose the resource which will com-

plete the task the earliest without resource unavailability.

• Resource Score Based Schedulers: Score all resources according to some expression, then

choose the resource with the highest score. An existing scoring method can be chosen or a

new one can be implemented.

• Prediction-Based Schedulers: Any prediction module implemented in the system can be

used by the scheduler for choosing which resource to allocate the task.

– Performance/Reliability Tradeoff Schedulers: These prediction-based schedulers

can be set on a sliding scale to determine what emphasis it places on performance, relia-

104

Input Interface

JobJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable

Completion Probability Predicted probability that the machine executing the job will re-main available for the duration of the job

Scheduled Resource InformationMIPS- The speed of the resource in Millions of Instructions PerSecondCPU Load- The current local load on the resource’s processor

System Load Current system load measured as the number of jobs per day peravailable resource

Table 9.3: Replication Interface

bility or both. There are also included pre-defined schedulers set to known tradeoff values

(e.g. Comp, Perf, etc).

– Prediction Length Algorithms: The length of the prediction to be made for a job of

length X can be modified by any number of included expressions.

• Ren MTTF scheduler: Ren et al. Mean Time Till Failure scheduler [55].

PredSim’s Scheduling components can be chosen by the user from the included library of schedul-

ing algorithms, created from included scheduling algorithm tools or added by the user.

Task Replication Algorithms

PredSim also incorporates the functionality to support task replication. The replication algorithm

will not only choose which jobs are to be replicated, but also, how many replicas of each job are to be

created. Currently, PredSim’s replication functionality is built into the job scheduling infrastructure.

When a job is scheduled, the replication component can then choose to create any number of replicas

of that job or none at all. Table 9.3 lists the input interface to the replication component. The choice

to replicate or not can be based on the job’s characteristics, the predicted probability of the job

completing without failure, the performance characteristics of the resource chosen to execute the

job or based on the current system load. The replication policy has no output interface because it

directly queues any replicas it wishes to create and has no return values.

These factors can be combined in numerous ways by the user or pre-built replication policies can

be used. The replication algorithm components include:

• Static Techniques: Replication policy is determined by some static characteristic:

– Fixed Percentage: Replicate a fixed percentage of jobs.

105

– Ckpt/Non: Replicate jobs based on their checkpointability (e.g. replicate all non-

checkpointable jobs).

– Static Multiple: Create a pre-defined number of replicas for each job (e.g. 3 replicas

per job).

• Prediction-based Techniques: Replicate jobs based on their their host resource’s reliability:

– Replication Score Threshold (RST): Replicate jobs based on their predicted chance

of completion.

– RST and Job Characteristics: Replicate jobs based on their prediction chance of

completion and their characteristics such as checkpointability and length.

• WQR and WQR-FT: WorkQueue with Replication and WorkQueue with Replication-Fault

Tolerant [70].

• Static Load Adaptive: Prediction-based Replication at set failure thresholds for each load

level.

• Sliding Replication: Prediction-based Load Adaptive Replication on a sliding scale (see

Chapter 7).

Tools and statistics are also provided to analyze the effectiveness and replication efficiency of each

replication policy. Additionally, load measurement services are provided for use with load adaptive

replication approaches.

Job Queuing Strategies

When a job is injected into the PredSim system, it’s placed in a queue. Jobs are then taken from

the head of the queue and scheduled onto available resources. The order in which the jobs are

queued and scheduled can have a significant effect on performance. The user can specify a queuing

component based on the interface shown in Table 9.4. The queuing strategy is provided with the

job to be queued which contains meta-information about the job for scheduling purposes. A pointer

to the head of the job queue is also provided. No output interface is specified as it is assumed that

the job will be queued for execution.

PredSim includes the following queuing strategy components

• Rear: New jobs always go to the rear of the queue.

106

Input Interface

Job

Job number- Unique job identifierInjection time- Unix time stamp of the time the job was injectedinto the systemJob length- Number of operations required to complete the jobCheckpointability- Boolean value indicating if the job is check-pointableReplicatability- Boolean value indicating if the job is replicatable

Job Queue Pointer A pointer to the head of the job queue linked list

Table 9.4: Job Queuing Interface

• Front: Jobs are placed at the front of the queue.

• Min-Min: The queue is arranged from shortest to longest job.

• Max-Max: The queue is arranged from longest to shortest job.

• Min-Min Ckpt: Checkpointable jobs are scheduled before non-checkpointable jobs with

shorter jobs being scheduled before longer jobs in each sub-queue.

• Oldest: Oldest jobs (according to initial injection time) are placed at the head of the queue.

• Newest: Youngest jobs (according to initial injection time) are placed at the head of the

queue.

These queuing strategies may be used or new queuing components can be added by the user.

The system can also be outfitted to schedule in batch mode instead of on-demand queue mode. In

batch mode, jobs are scheduled at given intervals (e.g. every 15 minutes) instead of when they are

first injected.

Job Killing Strategies

Load on a system can vary over the course of time. When this occurs, decisions which were made

under previous conditions may not be relevant to current conditions. This is particularly true for

replication decisions. As discussed in Chapters 6 and 7, the load on the system greatly effects the

performance of the current replication strategy. Particularly when load increases, if many repli-

cas were created under the lower load situation, there may be an inadequate number of available

resources left for executing incoming tasks. In these situations, it becomes beneficial to “kill” previ-

ously created replicas to free up resources to execute new tasks. PredSim includes a replica killing

component to facilitate this adaptive approach.

107

9.3.3 PredSim Components

PredSim takes both the input components and the User Specified components and executes the sim-

ulation utilizing its core components. This section summarizes the functionality of each of PredSim’s

core components.

Resource Monitoring

The resource monitoring service keeps track of all the resources in the system, not only gathering

statistics about their uptime, availability and job execution performance, but also providing resource

information to other components. Resource monitoring entails keeping track of which resources are

available or unavailable and logging that information. When resources become unavailable, the job

management service must be notified so that the proper actions can be taken such as checkpointing,

migration or re-queuing. The service also provides information to predictors regarding resource

availability history for prediction analysis. Lastly, the service provides information to the scheduling

component notifying it which resources are available for job execution.

Job Management

The Job Management component handles jobs by first injecting them into the system according to

the workload trace component specified by the user. When a job is injected, it is first be placed in a

queue according to the current user supplied queuing component. Periodically these jobs are sched-

uled from the queue according to the user supplied scheduling component. The job management

component also monitors job progress by calculating how many operations have been completed

by each executing job per simulator tick while also handling resource unavailability. When a re-

source executing a job becomes unavailable, the job management component enables a checkpoint

to be taken (if the job is checkpointable and the transition to unavailability was graceful) and then

places that job back into the queue for migration/rescheduling. Completed jobs are placed into a

completed job queue, execution statistics are logged and those resources are freed for further task

execution. When the simulation ends, those jobs still executing or in the queue are removed by the

job management system and their statistics are logged.

Information Services

Within the simulator, Information Services provides low level data to other components within the

system by logging various aspects of system execution. Before jobs are injected into the system, they

108

are stored and then injected into the Job Management system at the proper time. Also, resource

availability trace events are stored until they are presented to the Resource Monitoring component.

Additionally, techniques which use omnipotent future knowledge (e.g. the pseudo-optimal scheduler),

can gather future resource information from this component. Information services also tracks high

level simulation information such as the current simulation time, current simulation run, current

scheduler/predictor, etc. which is made available to other components as needed.

Statistics

It’s important to provide users with feedback about the operation and performance of the system.

This information can be used to compare techniques’ effectiveness and develop new techniques.

PredSim logs a significant amount of information throughout each run cataloging system events.

This simulation feedback is then made available to the user in summary form on the terminal as

well as in a series of statistical output files. PredSim logs high level information on graceful and

ungraceful evictions, completed jobs, replicas created, operations lost and checkpoints taken. More

detailed information is also logged and output to files including scheduling results versus day, hour

and job length as well as individual machine statistics for evictions, job completions and uptime.

Scheduling Framework

The Scheduling Framework handles all decisions for resource allocation and deallocation. It provides

the Job Management component with the set of Scheduling Components and Replication Compo-

nents used to choose which resources are chosen for job allocation and which jobs are chosen for

replication.

9.4 PredSim Performance

This section investigates PredSim’s performance. As it is critically important that PredSim deliver

timely results, this analysis focuses on factors that influence simulation completion time. Execution

time is most affected by factors that scale up the simulation size. When testing scheduling techniques

against each other while executing sets of jobs, the two main scaling factors are the number of

schedulers being compared and the number of jobs being scheduled. This section reports PredSim

execution times with PredSim executing on a dual processor 2.66 GHz dual core Xeon 5150 processor

with eight GBs of RAM running Debian 6.0.1.

Figure 9.2 investigates varying the number of jobs executed in the simulation and its effect on

109

0 1 2 3 40

10

20

30

40

50Synthetic Workload


Exe

cutio

n T

ime

(Min

utes

)

0 1 2 3 40

10

20

30

40

50Grid5000 GWA Workload


Exe

cutio

n T

ime

(Min

utes

)

Condor TraceSeti Trace

Figure 9.2: PredSim execution time versus the number of jobs executed

simulation execution time. The simulations execute a single scheduler, the PPS scheduler using

the TRF predictor discussed in Section 5.2. The number of jobs executed varies according to the

X-axis and two availability traces are used to drive the simulations, the Condor trace and the FTA

formatted Seti trace. The left subgraph investigates using a synthetic workload to drive the traces

and demonstrates that increasing the job load on the system leads to linear increases in execution

time. The Seti traces take significantly longer for all load levels. The right subgraph plots the same

experimental setup executing the Grid5000 GWA workload. This workload executes significantly

faster than the synthetic workload due to the shorter job lengths contained within the Grid5000

trace. In both scenarios, the results show that even under extremely high load scenarios, PredSim

execution time does not exceed 45 minutes.

Figure 9.3 investigates how simulation execution time is affected by varying the number of sched-

ulers being compared against each other. The simulator executes 1000 synthetic jobs using both the

Condor and FTA formatted Seti traces. The results show that PredSim execution time increases

linearly as the number of schedulers being compared increases while executing both availability

traces, demonstrating good scalability characteristics. Execution time does not exceed 22 minutes

even when executing 15 separate scheduling algorithms for comparison.

110

0 5 10 150

5

10

15

20

25Condor With Synthetic Workload

Number of Schedulers

Exe

cutio

n T

ime

(Min

utes

)

Condor TraceSeti Trace

Figure 9.3: PredSim execution time versus the number of schedulers investigated

111

Chapter 10

Conclusions

10.1 Summary

The volatility of resources can have serious performance consequences for executing applications. To

categorize this volatility, Section 3.1 introduces four resource availability states, which capture why

and how (not just when) resources become unavailable over time. This approach suggests predictors

and schedulers that can distinguish between graceful and ungraceful transitions to unavailability.

Section 3.2 describes four months of data from a university Condor pool and how its resources

behave in terms of these states, over time, and how these resources can then be classified according

to trends and behavior. Section 3.2.3 demonstrates by simulation that mapping jobs of different

lengths onto resources in these different categories has a significant effect on whether they can run

to completion.

Chapter 4 introduces four different methods for predicting the future multi-state availability of

a resource. These algorithms combine two methods of analyzing a resource’s behavior (Duration in

each state, and the number of Transitions to that state) with two methods to select which periods

of a resource’s history to analyze (recent Days at the same time of day, and Recent hours). Five

different weighting techniques further enhance the predictor’s accuracy: Recentness, Time of Day,

Day of Week, Recentness & Day of Week and All. Section 4.2 analyzes the proposed prediction

methods to determine the most effective prediction algorithm, and then tests the most successful

predictors against several related work predictors. The results demonstrate that the TDE predictor

is the most accurate of the proposed prediction algorithms, and is 4.6% more accurate than the best

performing related work predictor, the Ren predictor [54].

These availability predictors allow schedulers to consider resource availability for resource allo-

cation. Chapter 5 shows that prediction-based schedulers that are aware of the various types of

unavailability can take advantage of this information by making better job placement decisions.

112

The TRF-Comp scheduler outperforms other traditional schedulers, and other schedulers that trade

off reliability for performance, including the only other scheduler based on multi-state availability

prediction. Results show reduced makespan and fewer evictions, for a variety of scenarios.

However, even with reliability aware scheduling, resource unavailability is still inevitable. Check-

pointing and replication are the two primary tools that Grids can use to mitigate the effect of these

resource unavailabilities. Schedulers can replicate jobs to reduce the effect of resource unavailability

and ultimately reduce makespan. However, replication requires additional Grid cycles, which can

have direct cost within a Grid economy, and an indirect cost in tying up attractive resources espe-

cially under relatively higher loads. Chapter 6 describes techniques that use availability predictions

to influence a scheduler’s decision about whether to replicate a job. Availability predictors can

forecast resource behavior, and allow schedulers to consider resource reliability when making job

replication decisions. Section 6.3 introduces the notion of “replication efficiency,” which combines

the benefit and cost of replication into a single metric, and studies a variety of replication policies

under various loads, using simulations based on two representative Grid resource traces to demon-

strate the improved performance the TRF predictor and replication techniques produce. A strategy

that considers process checkpointability, job length, predicted resource reliability and current system

load does the best job of using additional replicated operations to reduce job makespan efficiently.

Replication decisions are also affected by the current load on the system. Section 7.3 pro-

poses Sliding Replication, a load adaptive replication technique that dynamically chooses an ap-

propriate prediction based replication strategy given the load on the system. Section 6.3.3 com-

pares the proposed replication technique’s performance to previously proposed prediction based and

non-prediction based load adaptive replication strategies and shows that SR produces comparable

makespan improvement while significantly reducing the number of required extra operations. These

replication techniques can improve Grid user experiences not only by reducing the time it takes for

their tasks to complete but also by significantly reducing the real world cost incurred by executing

replicas on the Grid through intelligent job replication.

Chapter 8 extends these fault tolerant computing paradigms to an unconventional distributed

computing environment. Section 8.3 characterizes the availability of MFDs and demonstrates their

applicability as Grid computation devices for non-data intensive tasks. This includes developing a

MFD specific multi-state availability model, characterizing when the various types of jobs arrive,

proposing a generic metric to describe their burstiness, and scheduling around that burstiness.

Section 8.6 proposes a computationally efficient method for online calculation and scheduling. These

burstiness-aware scheduling strategies significantly improve makespan and reduce the percentage of

113

Grid job preemptions.

Chapter 9 introduces and describes PredSim, the experimental testbed used in this study. Pred-

Sim is an event-driven task scheduling simulation platform specifically designed for multi-state pre-

diction, scheduling and replication. PredSim accepts real world resource availability and workload

traces as input. It includes modes for testing prediction accuracy and task scheduling algorithms,

and provides detailed statistical feedback to the user.

10.2 Future Work

This section lists future research directions grouped by category.

1. Prediction

(a) Determine other uses for multi-state availability prediction other than making job place-

ment and replication decisions (e.g. data placement, frequency and timing of checkpoint-

ing, etc).

(b) Predict time until resource unavailability (not just completion probability during a given

interval).

(c) Predict time until recovery (to be used when choosing whether to suspend or migrate for

example).

(d) Use patterns in uptime behavior instead of just counting transitions.

(e) Predict rare events such as machine hardware failure differently than user presence and

local CPU load, perhaps by using a separate prediction method.

(f) Account for reasons that an application may have failed previously such as recurring is-

sues. An example of this may be not enough system RAM for the application to complete.

(g) Use a mixture-of-experts prediction approach where predictions are made for each resource

with a pool of predictors and the predictions from the predictor that has been working

best for that particular resource is used for scheduling

2. Scheduling

(a) Determine the criteria that makes a predictor good for scheduling. Results indicate

that overall accuracy is not necessarily the best indicator of a predictor’s usefulness in

scheduling.

114

(b) Investigate prediction based scheduling techniques that do not require the expected run-

time of the application.

(c) Actively migrate executing applications that are expected to experience resource unavail-

ability to other resources as they become available.

(d) Investigate waiting to schedule jobs until “better” resources become available instead of

scheduling them immediately.

(e) Investigate scheduling sets of jobs with a variety of priorities. Higher priority jobs could

be assigned to faster resources that experience less resource unavailability.

(f) Develop a non-greedy scheduling approach that attempts to match the available resource

pool to the current job pool using batch queuing.

(g) Schedule I/O bound jobs instead of compute bound jobs (data placement and availability,

etc).

3. Replication

(a) Investigate prediction based replication policies that can create more than a single replica

for each job.

(b) Investigate replication with different underlying scheduling policies. Currently, schedulers

place jobs after considering only resource speed and load, not reliability for example.

(c) Investigate replicating jobs after they have begun running, upon eviction, or when a new

resource becomes available.

4. Trace

(a) Gather new traces from other sources and test the ideas proposed in this study in these

varying environments.

(b) Develop synthetic trace availability models that can be varied in terms of their volatil-

ity, and investigate how previously proposed techniques behave as the the volatility and

heterogeneity of the synthetic resources change over time.

5. Meta-scheduling

(a) Investigate policies and techniques that could support availability prediction in a metaschedul-

ing environment such as several clusters managed with a meta scheduler.

6. Real-world Implementation

115

(a) Implement the prediction, scheduling and replication techniques proposed in this study

in a real distributed or Grid system and investigate their performance.

116

Bibliography

[1] Abu-Ghazaleh, N., and Lewis, M. Toward self organizing grids. In International Conferenceon High Performance Distributed Computing Hot Topics Session (2006), pp. 324–327.

[2] Acharya, A., Edjlali, G., and Saltz, J. The utility of exploiting idle workstations forparallel computation. In SIGMETRICS (1997), pp. 225–234.

[3] Amin, A., Ammar, R., and Gokhale, S. An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures. In Symposium on Computersand Communications (2003), p. 1207.

[4] Anderson, D. Boinc: A system for public-resource computing and storage. In IEEE/ACMWorkshop on Grid Computing (2004), pp. 4–10.

[5] Anderson, D., and Fedak, G. The computational and storage potential of volunteer com-puting. In International Symposium on Cluster Computing and the Grid (2006), pp. 73–80.

[6] Androutsellis-Theotokis, S., and Spinellis, D. A survey of peer-to-peer content distri-bution tech. Journal of American College of Medical Coding Specialists 36, 4 (2004), 335–371.

[7] Anglano, C., and Canonico, M. Fault-tolerant scheduling for bag-of-tasks grid applications.In Advances in Grid Computing - EGC 2005 (2005), pp. 630–639.

[8] Arpaci, R., Dusseau, A., Vahdat, A., Liu, L., Anderson, T., and Patterson, D. Theinteraction of parallel and sequential workloads on a network of workstations. In InternationalConference on Measurement and Modeling of Computer Systems (1995), pp. 267–278.

[9] Bolosky, W., Douceur, J., Ely, D., and Theimer, M. Feasibility of a serverless dis-tributed file system deployed on an existing set of desktop pcs. In SIGMETRICS (2000),pp. 34–43.

[10] Borthakur, D. The Hadoop Distributed File System: Architecture and Design. The ApacheSoftware Foundation, 2007.

[11] Braun, T., Siegel, H., and Beck, N. A comparison of eleven static heuristics for mappinga class of independent tasks onto heterogeneous distributed computing systems. Journal ofParallel and Distributed Computing 61, 6 (2001), 810–837.

[12] Buyya, R., and Murshed, M. Gridsim: A toolkit for the modelling and simulation of dis-tributed resource management and scheduling for grid computing. The Journal of Concurrencyand Computation: Practice and Experience (CCPE) (2002).

[13] Cardinale, Y., and Casanova, H. An evaluation of job scheduling strategies for divisibleloads on grid platforms. In High Performance Computing and Simulation Conference (2006),pp. 705–712.

[14] Casanova, H. Simgrid: a toolkit for the simulation of application scheduling. In Proceedingsof the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid2001) (2001).

117

[15] Casanova, H., Zagorodnov, D., Berman, F., and Legrand, A. Heuristics for schedul-ing parameter sweep applications in grid environments. In HCW ’00: Proceedings of the 9thHeterogeneous Computing Workshop (Washington, DC, USA, 2000), IEEE Computer Society,p. 349.

[16] Chun, B., and Vahdat, A. Workload and failure characterization on a large-scale federatedtestbed. Tech. Rep. IRB-TR-03-040, Intel Research Berkeley, 2003.

[17] Dail, H., Casanova, H., and Berman, F. A decoupled scheduling approach for grid appli-cation development environments. Journal of Parallel and Distributed Computing 63, 5 (2003),505–524.

[18] Dinda, P., and O’Hallaron, D. An extensive toolkit for resource prediction in distributedsystems. Tech. Rep. CMU-CS-99-138, Carnegie Mellon University, 1999.

[19] Dogan, A., and Ozguner, F. Biobjective scheduling algorithms for execution time-reliabilitytrade-off in heterogeneous computing systems. The Computer Journal 48, 3 (2005), 300–314.

[20] Erdil, D. C., and Lewis, M. J. Grid resource scheduling with gossiping protocols. InInternational Conference on Peer to Peer Computing (2007), pp. 193–202.

[21] for EsciencE, E. G. http://public.eu-egee.org/, 2008.

[22] Foster, I., and Iamnitchi, A. On death, taxes, and the convergence of peer-to-peer andgrid computing. In International Workshop on Peer-To-Peer Systems (2003).

[23] Foster, I., Kesselman, C., and Tuecke, S. The anatomy of the grid - enabling scalablevirtual organizations. International Journal of Supercomputer Applications 15 (2001), 2001.

[24] Frey, J., Tannenbaum, T., Livny, M., Foster, I., , and Tuecke, S. Condor-g: Acomputation management agent for multi-institutional grids. In International Conference onHigh Performance Distributed Computing (2001), pp. 55–63.

[25] Fujimoto, N., and Hagihara, K. A comparison among grid scheduling algorithms forindependent coarse-grained tasks. In International Symosium on Applications and the Internet(2004), IEEE Computer Society Press, pp. 674–680.

[26] Grid, O. S. http://www.opensciencegrid.org/, 2008.

[27] Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., and Epema, D.

H. J. The grid workloads archive. Future Gener. Comput. Syst. 24, 7 (2008), 672–686.

[28] Jarry, A., and Casanova, H. Dagsim: A simulator for dag scheduling algorithms. InLaboratoire de l’informatique du paralllisme (2000).

[29] Javadi, B., Kondo, D., Vincent, J., and Anderson, D. Mining for statistical avail-ability models in large-scale distributed systems: An empirical study of seti@home. In 17thIEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer andTelecommunication Systems (MASCOTS) (September 2009).

[30] Kang, W., and Grimshaw, A. S. Failure prediction in computational grids. In SimulationSymposium (2007), pp. 275–282.

[31] Kartik, S., and Murthy, C. Task allocation algorithms for maximizing reliability of dis-tributed computing systems. IEEE Transactions on Computers 41, 9 (1992), 1156–1168.

[32] Kondo, D., Anderson, D., and McLeod, J. Performance evaluation of scheduling policiesfor volunteer computing. In International Conference on e-Science (2007), pp. 415–422.

118

[33] Kondo, D., Chien, A., and Casanova, H. Resource management for rapid applicationturnaround on enterprise desktop grids. In International Conference on High PerformanceComputing (2004), p. 17.

[34] Kondo, D., Fedak, G., Cappello, F., Chien, A. A., and Casanova, H. Resourceavailability in enterprise desktop grids. Tech. Rep. 00000994, INRIA, 2006.

[35] Kondo, D., Javadi, B., Iosup, A., and Epema, D. The failure trace archive: Enablingcomparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid,IEEE International Symposium on 0 (2010), 398–407.

[36] Kondo, D., Taufer, M., III, C. B., Casanova, H., and Chien, A. Characterizing andevaluating desktop grids: An empirical study. In IEEE International Parallel and DistributedProcessing Symposium (2004), p. 26b.

[37] Lamehamedi, H., Szymanski, B., and Shentu, Z. Data replication strategies in gridenvironments. In in Proceedings of the Fifth International Conference on Algorithms and Ar-chitectures for Parallel Processing (ICA3PP02 (2002), Press, pp. 378–383.

[38] Levis, P., Lee, N., Welsh, M., and Culler, D. Tossim: accurate and scalable simulationof entire tinyos applications. In Proceedings of the 1st international conference on Embeddednetworked sensor systems (New York, NY, USA, 2003), SenSys ’03, ACM, pp. 126–137.

[39] Lewis, M., and Grimshaw, A. The core legion object model. In International Conferenceon High Performance Distributed Computing (1996), pp. 551–561.

[40] Li, Y., and Mascagni, M. Improving performance via computational replication on a large-scale computational grid. In CCGRID ’03: Proceedings of the 3st International Symposiumon Cluster Computing and the Grid (Washington, DC, USA, 2003), IEEE Computer Society,p. 442.

[41] Litke, A., Skoutas, D., Tserpes, K., and Varvarigou, T. Efficient task replication andmanagement for adaptive fault tolerance in mobile grid environments. Future Gener. Comput.Syst. 23, 2 (2007), 163–178.

[42] Litzkow, M., Livny, M., and Mutka, M. Condor - a hunter of idle workstations. InInternational Conference on Distributed Computing Systems (1988), pp. 104–111.

[43] Menasce, D. A., Saha, D., da Silva Porto, S. C., Almeida, V. A. F., and Tripathi,

S. K. Static and dynamic processor scheduling disciplines in heterogeneous parallel architec-tures. J. Parallel Distrib. Comput. 28, 1 (1995), 1–18.

[44] Mickens, J., and Noble, B. Predicting node availability in peer-to-peer networks. InInternational Conference on Measurement and Modeling of Computer Systems (2005).

[45] Mickens, J., and Noble, B. Exploiting availability prediction in distributed systems. InNetwork Systems Design and Implementation (2006), pp. 73–86.

[46] Mickens, J., and Noble, B. Improving distributed system performance using machineavailability prediction. In International Conference on Measurement and Modeling of ComputerSystems Performance Evaluation Review (2006).

[47] Nurmi, D., Brevik, J., and Wolski, R. Modeling machine availability in enterprise andwide-area distributed computing environments. In Europar (2005), pp. 432–441.

[48] Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,

and Zagorodnov, D. The eucalyptus open-source cloud-computing system. In Proceedingsof the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid(Washington, DC, USA, 2009), CCGRID ’09, IEEE Computer Society, pp. 124–131.

119

[49] open platform for developing debugging, P. L. A., and accessing planetary scale

services. http://www.planet-lab.org/, 2008.

[50] Pietrobon, V., and Orlando, S. Performance fault prediction models. Tech. Rep. CS-2004-3, University of Venice, 2004.

[51] Qin, X., Jiang, H., Xie, C., and Han, Z. Reliability-driven scheduling for real-time taskswith precedence constraints in heterogeneous distributed systems. In International Conferenceon Parallel and Distributed Computing (2000), pp. 617–623.

[52] Ramakrishnan, L., and Reed, D. A. Performability modeling for scheduling and faulttolerance strategies for scientific workflows. In HPDC ’08: Proceedings of the 17th internationalsymposium on High performance distributed computing (New York, NY, USA, 2008), ACM,pp. 23–34.

[53] Ranganathan, K., and Foster, I. Identifying dynamic replication strategies for a highperformance data grid. In In Proc. of the International Grid Computing Workshop (2001),pp. 75–86.

[54] Ren, X., and Eigenmann, R. Empirical studies on the behavior of resource availability infine-grained cycle sharing systems. In International Conference on Parallel Processing (2006),pp. 3–11.

[55] Ren, X., Lee, S., Eigenmann, R., and Bagchi, S. Resource failure prediction in fine-grained cycle sharing system. In International Conference on High Performance DistributedComputing (2006).

[56] Ren, X., Lee, S., Eigenmann, R., and Bagchi, S. Prediction of resource availability infine-grained cycle sharing systems empirical evaluation. Journal of Grid Computing 5, 2 (2007),173–195.

[57] Rood, B., Gnanasambandam, N., Lewis, M., and Sharma, N. Toward high performancecomputing in unconventional computing environments. In to appear in the Workshop on Chal-lenges of Large Applications in Distributed Environments 2010 (in conjuction with HPDC 2010)(2010).

[58] Rood, B., and Lewis. Scheduling on the grid via multi-state resource availability prediction.In International Conference on Grid Computing (2008).

[59] Rood, B., and Lewis, M. Multi-state grid resource availability characterization. In Interna-tional Conference on Grid Computing (2007), pp. 42–49.

[60] Rood, B., and Lewis, M. Resource availability prediction for improved grid scheduling.In Workshop on Advances in High-Performance E-Science Middleware and Applications (inconjunction with e-Science 2008) (2008).

[61] Rood, B., and Lewis, M. Grid resource availability prediction-based scheduling and taskreplication. Journal of Grid Computing (2009).

[62] Rood, B., and Lewis, M. Availability prediction based scheduling strategies for grid envi-ronments. In CCGrid 2010: The 10th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (2010).

[63] Rosanblum, M., Herrod, S. A., Witchel, E., and Gupta, A. Complete computer systemsimulation: The simos approach. IEEE Parallel and Distributed Technology (1995).

[64] Ryu, K., and Hollingsworth, J. Unobtrusiveness and efficiency in idle cycle stealing forpc grids. In IEEE International Parallel and Distributed Processing Symposium (2004), p. 62a.

120

[65] Sahoo, R., Oliner, A., Rish, I., Gupta, M., Moreira, J., Ma, S., Vilalta, R., and

Sivasubramaniam, A. Critical event prediction for proactive management in large-scale com-puter clusters. In Special Interest Group on Knowledge Discovery and Data Mining (2003),pp. 426–435.

[66] Sahoo, R. K., and Squillante, M. S. Failure data analysis of a large-scale heterogeneousserver environment. In In Proceedings of the 2004 International Conference on DependableSystems and Networks (2004), pp. 772–781.

[67] Santos-neto, E., Cirne, W., Brasileiro, F., Lima, R., and Grande, C. Exploit-ing replication and data reuse to efficiently schedule data-intensive applications on grids. InProceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing (2004),pp. 210–232.

[68] Schroeder, B., and Gibson, G. A. A large-scale study of failures in high-performancecomputing systems. In Proc. 40 th Annual IEEE/IFIP International Conference on DependableSystems and Networks (DSN 2006 (2006), pp. 249–258.

[69] Schroeder, B., and Gibson, G. A. Disk failures in the real world: What does an mttf of1,000,000 hours mean to you?, 2007.

[70] Silva, D. P. D., Cirne, W., Brasileiro, F. V., and Grande, C. Trading cycles forinformation: Using replication to schedule bag-of-tasks applications on computational grids. InApplications on Computational Grids, in Proc of Euro-Par 2003 (2003), pp. 169–180.

[71] Song, H. J., Liu, X., Jakobsen, D., Bhagwan, R., Zhang, X., Taura, K., and Chien,

A. The microgrid: a scientific tool for modelling computational grids. In IEEE Supercomputing(SC2000) (2000).

[72] Srinivasan, S., and Jha, N. Safety and reliability-driven task allocation in distributedsystems. In International Conference on Parallel and Distributed Systems (1999), pp. 238–251.

[73] Takefusa, A., Matsuoka, S., and Nakada, H. Overview of a performance evaluationsystem for global computing scheduling algorithms. In 8th International Symposium on HighPerformance Distributed Computing (HPDC8) (1999).

[74] Teragrid. http://www.teragrid.org, 2008.

[75] University of Wisconsin-Madison. Condor Version 7.0.4 Manual, 2008.

[76] Vilalta, R., and Ma, S. Predicting rare events in temporal domains. In InternationalConference on Data Mining (2002), p. 474.

[77] Wang, L., Tao, J., Kunze, M., Castellanos, A. C., Kramer, D., and Karl, W.

Scientific cloud computing: Early definition and experience. 2008 10th IEEE InternationalConference on High Performance Computing and Communications (2008), 825–830.

[78] Weiss, G., and Hirsh, H. Learning to predict rare events in categorical time-series data. InInternational Conference on Machine Learning (1998), pp. 83–90.

[79] Weissman, J. B. Fault tolerant computing on the grid: What are my options. Tech. rep.,University of Texas at San Antonio, 1998.

[80] Wolski, R., Spring, N., and Hayes, J. The network weather service: A distributed resourceperformance forecasting service for metacomputing. Future Generation Computer Systems 15(1999), 757–768.

121

Documents

GRID RESOURCE AVAILABILITY PREDICTION-BASED …