Learning Based Admission Control and Task Assignment in ...web2py.iiit.ac.in/publications/default/download/... · and indexing etc. Many organizations utilize cloud computing services

Learning Based Admission Control and Task Assignmentin MapReduce

Thesis submitted in partial fulfillmentof the requirements for the degree of

MS by Researchin

Computer Science and Engineering

by

Jaideep Datta Dhok200707007

[email protected]

Search and Information Extraction LabInternational Institute of Information Technology

Hyderabad - 500 032, INDIAJune 2010

Copyright c© Jaideep Dhok, 2010

All Rights Reserved

International Institute of Information TechnologyHyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Learning Based Ad-mission Control and Task Assignment in MapReduce” by Jaideep Dhok, has beencarried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Vasudeva Varma

To my parents

Acknowledgments

I would like to thank my advisor Dr. Vasudeva Varma for his continuous andinspiring guidance during the course of my thesis, without which this thesis wouldnot have been possible. His constant support motivated me to pursue research inMapReduce. His insights in machine learning and its application also proved to beof a tremendous help. I also owe gratitude to Dr. Prasad Pingali, whose technicalguidance during the initial phases of my thesis has been invaluable in shaping mythesis. He introduced me to MapReduce, and suggested to look into the internalsof MapReduce for research problems. I also thank my parents, for their continuousencouragement during the course of my masters. Their insistence on pursuing aresearch degree kept me motivated during the research work. Finally, I thank all ofmy friends, Akshat, Gururaj, Nihar, Nitesh, Mihir, Rahul, Sandeep and Sarat fortheir companionship, which made my stay at IIIT-H enjoyable and a lot of fun. Ialso thank the ever so helpful M. Babji for his tremendous enthusiasm, and zeal forhelping lab students.

v

Abstract

The MapReduce paradigm has become a popular way of expressing distributeddata processing problems that need to deal with large amount of data. MapReduceis used by a number of organizations worldwide for diverse tasks such as applica-tion log processing, user behavior analysis, processing scientific data, web crawlingand indexing etc. Many organizations utilize cloud computing services to acquireresources needed to process MapReduce jobs on demand. As a result of the popu-larity of using MapReduce in the cloud, cloud services such as the Amazon ElasticMapReduce have also become available. The Apache Hadoop framework is theleading open source implementation of the MapReduce model. It also provides adistributed file system (HDFS or the Hadoop Distributed File System). Hadoopis also the most popular MapReduce framework, and is being used by more than75 organizations worldwide. It is designed to be scalable and robust, and runson small single node clusters, to very large clusters containing several thousandcompute and storage nodes.

Despite the popularity and stability of Hadoop, it presents many opportunitiesfor researching new resource management algorithms. Admission Control, TaskAssignment and Scheduling, Data Local Execution, Speculative Execution andReplica placement are some of the key challenges involved in resource manage-ment in Hadoop. Resource management in Hadoop is complicated by the fact thatthe resources available are dynamic in nature because of frequent node failures.

In this thesis, we approach two of the above problems: Admission Control andTask Assignment.

Admission Control is the problem of deciding if a new job submission shouldbe accepted for execution on a MapReduce cluster. An admission controller, or themodule that handles admission control should make sure that jobs accepted into

vi

vii

the cluster do not overload resources. Another important requirement is that theaccepted jobs should try to maximize the utility of the service provider, which inthis case is the Hadoop cluster owner.

We propose an admission control algorithm that selects incoming jobs basedon the principle of expected utility hypothesis. Based on this principle, the admis-sion controller always chooses jobs that are expected to provide maximum utilityafter their successful completion. To predict whether or not the job will be suc-cessful, we observe the effects of past decisions made under similar situations anduse this knowledge to predict outcome of incoming jobs. For this purpose we usea Nave Bayes classifier that labels incoming jobs as potentially successful or po-tentially unsuccessful. Out of the jobs that are labeled potentially successful, wechose the job that has the maximum value of utility function at the time of job sub-mission. If none of the incoming jobs are labeled as potentially successful, we donot admit any job into the cluster.

Utility functions are supplied by users and express the value earned by usersafter successful completion of jobs as a function of time taken to complete theirrequests. We consider a job submission to be successful if it does not overloadresources on the cluster. Once the effects of the admission are observed, we updatethe classifier so that this experience is used while performing next decision.

The next problem that we approach is that of Task Assignment. Each MapRe-duce job is subdivided into a number of map and reduce tasks. The tasks of ajob are executed concurrently. Task assignment is the problem of deciding whichtask should be allocated to a node in the MapReduce cluster. The task assignmentalgorithm should consider the current state of resources at the concerned node aswell as the resources requirements of tasks in the schedulers queue. Given thisinformation, the task assignment algorithm should choose a task that maximizesresource utilization on the concerned node while still making sure that the node isnot overloaded.

We used a similar approach as that used in the admission controller for taskassignment as well. While choosing a task, we first classify the queued tasks intotwo categories, good and bad. Good tasks do not overload resources at a nodeduring their execution. From the tasks that are labeled good, we choose the taskthat has the maximum utility, which in this case is the priority of a task. Again, if

viii

none of the tasks are classified as good, we do not assign any task to the concernednode.

Task priority is set by the cluster administrator, and can be used for policyenforcement. We thus decouple task assignment from policy enforcement. Once atask is assigned, we observe its effects, and if it results in overloading of resourcesat the concerned node, we conclude that the task assignment decision was incorrectand update the classifier accordingly. This makes sure that chances of assignmentsthat result into overloading are reduced over time. Overload rules are specified bysystem administrator.

We evaluate our implementation (Learning Scheduler) for a number of differ-ent MapReduce jobs. The jobs are chosen to mimic the behavior of real life usecases for which MapReduce is usually employed. Our results show that the sched-uler is able to learn the impact of jobs on node utilization rather quickly, i.e. duringfirst few runs of a modest sized job. The scheduler is also able to achieve userspecified level of utilization on the cluster node.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Big Data and MapReduce . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Big Data: The Data Explosion . . . . . . . . . . . . . . . 21.1.2 MapReduce: Large Scale Data Processing . . . . . . . . . 2

1.2 Resource Management in MapReduce . . . . . . . . . . . . . . . 31.3 Problem definition and scope . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Similarity and interdependency between the two problems 71.3.2 Admission Control . . . . . . . . . . . . . . . . . . . . . 81.3.3 Task Assignment . . . . . . . . . . . . . . . . . . . . . . 101.3.4 Guiding principles . . . . . . . . . . . . . . . . . . . . . 12

1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . 12

2 Context: Grid Resource Management . . . . . . . . . . . . . . . . . . 142.1 Brief Overview of Grid resource management . . . . . . . . . . . 142.2 Related Work: Admission Control . . . . . . . . . . . . . . . . . 17

2.2.1 Utility Functions . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Existing Admission Control Algorithms . . . . . . . . . . 18

2.3 Related Work: Task Assignment . . . . . . . . . . . . . . . . . . 222.3.1 Independent Task Scheduling . . . . . . . . . . . . . . . 222.3.2 Computational Intelligence based approaches . . . . . . . 252.3.3 Learning based approaches . . . . . . . . . . . . . . . . . 28

2.4 Existing Hadoop Schedulers . . . . . . . . . . . . . . . . . . . . 312.4.1 Native Hadoop Scheduler . . . . . . . . . . . . . . . . . 322.4.2 LATE Scheduler . . . . . . . . . . . . . . . . . . . . . . 352.4.3 FAIR Scheduler . . . . . . . . . . . . . . . . . . . . . . . 352.4.4 Capacity Scheduler . . . . . . . . . . . . . . . . . . . . . 362.4.5 Dynamic Priority Scheduler . . . . . . . . . . . . . . . . 37

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

x CONTENTS

3 Learning Based Admission Control . . . . . . . . . . . . . . . . . . . 393.1 Service Oriented MapReduce . . . . . . . . . . . . . . . . . . . . 39

3.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.2 Usage examples . . . . . . . . . . . . . . . . . . . . . . 413.1.3 Utility Functions: Expressing User Expectations . . . . . 42

3.2 Learning Based Admission Control . . . . . . . . . . . . . . . . . 443.2.1 Recap: Hadoop Architecture . . . . . . . . . . . . . . . . 453.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . 503.3.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . 503.3.2 Algorithm Correctness . . . . . . . . . . . . . . . . . . . 513.3.3 Comparison with Baseline Approaches . . . . . . . . . . 523.3.4 Meeting Deadlines . . . . . . . . . . . . . . . . . . . . . 523.3.5 Performance with Load Cap . . . . . . . . . . . . . . . . 543.3.6 Job Response Times . . . . . . . . . . . . . . . . . . . . 553.3.7 Job Arrival Rates . . . . . . . . . . . . . . . . . . . . . . 573.3.8 Effect of Utility Functions . . . . . . . . . . . . . . . . . 57

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Task Assignment in MapReduce . . . . . . . . . . . . . . . . . . . . . 604.1 LSCHED: Learning Scheduler for Hadoop . . . . . . . . . . . . . 60

4.1.1 Feature Variables . . . . . . . . . . . . . . . . . . . . . . 634.1.2 Utility Functions . . . . . . . . . . . . . . . . . . . . . . 634.1.3 Avoiding resource starvation . . . . . . . . . . . . . . . . 654.1.4 Benefit of doubt . . . . . . . . . . . . . . . . . . . . . . 654.1.5 Using a Naive Bayes Classifier . . . . . . . . . . . . . . . 664.1.6 Separability of feature vector space and classifier conver-

gence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . 684.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.2.1 Cluster Details . . . . . . . . . . . . . . . . . . 694.2.2.2 Workload Description . . . . . . . . . . . . . . 69

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.1 Demonstrating Learning Behavior . . . . . . . . . . . . . 714.3.2 Maintaining Desired Utilization . . . . . . . . . . . . . . 734.3.3 Comparing Learning Rates . . . . . . . . . . . . . . . . . 744.3.4 Comparison with Hadoop Native Scheduler . . . . . . . . 75

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

CONTENTS xi

5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 785.1 Similarity in the two approaches . . . . . . . . . . . . . . . . . . 785.2 Admission Control . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.1 Future directions . . . . . . . . . . . . . . . . . . . . . . 805.3 Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . 825.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

List of Figures

Figure Page

1.1 Centralized architecture of a MapReduce system. Data storage andcomputation are co-located on worker nodes. . . . . . . . . . . . 4

1.2 A typical MapReduce work flow . . . . . . . . . . . . . . . . . . 5

2.1 CPU usage patterns of MapReduce application(wordcount). Meanand variance of the resource usage distributions become recogniz-able characteristics of a particular MapReduce job. . . . . . . . . 28

2.2 Hadoop MapReduce Architecture . . . . . . . . . . . . . . . . . 31

3.1 The architecture of MapReduce as a Service. Our model is basedon the Hadoop open source MapReduce framework. . . . . . . . . 40

3.2 Utility Functions for different values of decay parameters. . . . . 443.3 Architecture of MapReduce in Hadoop. . . . . . . . . . . . . . . 463.4 Admission Controller. . . . . . . . . . . . . . . . . . . . . . . . . 473.5 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . 523.6 Achieved and expected load ratio . . . . . . . . . . . . . . . . . . 533.7 Comparison of Achieved Load Averages . . . . . . . . . . . . . . 533.8 Performance while meeting user deadlines . . . . . . . . . . . . . 543.9 Achieved Load Average with load cap . . . . . . . . . . . . . . . 543.10 Comparing mean job runtimes . . . . . . . . . . . . . . . . . . . 553.11 Comparing runtime distribution . . . . . . . . . . . . . . . . . . 563.12 Effect of Job Arrival Rate (λ) on Job Acceptance . . . . . . . . . 573.13 Effect of Utility Function on Job Acceptance . . . . . . . . . . . 58

4.1 Task assignment using pattern classification. Evaluation of last de-cision, and classification for current decision are done asynchronously. 61

4.2 Hadoop settings used in evaluation . . . . . . . . . . . . . . . . . 694.3 Prominent Use Cases for Hadoop. (percentages are approximate) . 70

xii

LIST OF FIGURES xiii

4.4 Resource usage of evaluation jobs as estimated on the scale of 10,a value of 1 indicates minimum usage . . . . . . . . . . . . . . . 71

4.5 Learning behavior of the scheduler for WordCount job . . . . . . 724.6 Learning behavior of the scheduler for WordCount job . . . . . . 734.7 Achieved utilization for different user requirements . . . . . . . . 744.8 Classifier accuracy for URLGet and CPUActivity . . . . . . . . . 754.9 Comparison of task assignment by Learning Scheduler and Hadoop’s

native scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

List of Tables

Table Page

xiv

Chapter 1

Introduction

Cloud computing has emerged as one of the most interesting new paradigmsin the computing community in recent times. The ability of the cloud services toprovide apparently unlimited supply of computing power on demand to users hascaught the attention of industry as well as academia. A 2008 survey by the InternetData Consortium found that 4% of all enterprises had already implemented someform of cloud computing, and this share is expected to double by 2012 [28]. Duringthe last year itself more than a dozen academic conferences were organized whichhad cloud computing as an important area of interest.

Cloud computing enables users to use computing as a utility, similar to otherbasic utilities such as electricity which are accessed only on demand, and users arecharged only for the quantities of utilities consumed by them. Users do not haveto own computing infrastructure in order to use it. Computing infrastructure, plat-forms and services are provided to users as services. The services are on demand,meaning that they are available to users at any time and at any location. Cloud ser-vices usually offer users easy to use interfaces to control computing infrastructure,obviating the need of investment in trained personnel and expensive equipment formanagement.

1

1.1 Big Data and MapReduce

1.1.1 Big Data: The Data Explosion

The growth in cloud computing is partly also fueled by the explosion in data.Scientific experiments such as the Large Hadron Collider generate terabytes of dataevery day1 [14]. Scientists need to uncover results from this deluge of data. Thisexplosion of data is being experienced by every sector in the computing industrytoday. The big Internet companies such as Google, Amazon, Yahoo!, Facebook etchave to deal with huge amounts of user generated data in the form of blog posts,photographs, status messages, and audio/video files. However, there is also a largequantity of data that is indirectly generated by web sites in the form of access logfiles, click through events etc. Analysis of this data can uncover useful patternsabout user behavior. Most of this data is generated frequently, and the data setsare stored temporarily for a fixed period and then discarded after they have beenprocessed.

According to Google, there are about 281 Exabytes of data online on the webtoday, up from the 5 Exabytes in 2002. There has been a fifteen fold increase inuser generated content since 2005 [13]. People are sharing more and more infor-mation online, including their personal lives, and their opinions on peoples andproducts. Companies are trying hard to make use of this data in order infer userpreferences, to generate better recommendations for users (example - Amazon andIMDB user recommendation systems), and to just get to know their users more andmore. However, analyzing the flood of data has become one of the greatest chal-lenge of recent times, with traditional approaches including relational databasesand data warehouses failing to match the scale of data.

1.1.2 MapReduce: Large Scale Data Processing

New tools and frameworks have been developed to process data on distributeddata centers. MapReduce [38], which is the most prominent among such paradigms,has garnered much deserved community attention. It was initially designed byGoogle to process web scale data on failure prone commodity hardware. MapRe-

1The LHC experiments are speculated to generate about 27 TB of data every day.

2

duce allowed programmers to focus on application logic and handled the messydetails such as handling failures, application deployment, task duplication and ag-gregation of results automatically. The model proved successful within Google,and today has become the de facto method for modeling large scale data processingproblems, with Hadoop [3], an open source implementation of the original model,being the most popular framework to develop MapReduce applications. MapRe-duce is employed in a range of problems, including, web crawling and indexing,analysis of Bioinformatics data, log processing, and image processing.

The two most important components of Hadoop are: MapReduce, which dealswith the computational operation to be applied on data, and the Hadoop DistributedFile System or HDFS, which deals with reliable storage of the data. HDFS hasbeen designed based on the Google File System [49], developed at Google forsimilar purposes. The MapReduce component is responsible for application exe-cution, guaranteeing execution in case of machine failures, and resource manage-ment. Hadoop is designed to be scalable, and can run on small as well as very largeinstallations [12]. Several programming frameworks including Pig Latin [64], Hive[4] and Jaql [11] allow users to write applications in high level languages (looselybased on the SQL syntax) which compile into MapReduce jobs, which are thenexecuted on a Hadoop cluster.

Hadoop is increasingly being used for data processing in the cloud. As listedon the Hadoop PoweredBy page [10], it is frequently deployed on the AmazonEC2 [1] cloud for cost effective data processing. This has resulted in Amazon of-fering a special Elastic MapReduce Service [2] which allows users to specificallylaunch Hadoop clusters on demand. The popularity of Hadoop in the cloud is alsodemonstrated by the fact that several Linux distributions tuned especially for run-ning Hadoop on public clouds offerings are available [7]. The scale of MapReduceapplications and the diversity of use cases it has been applied for make resourcemanagement in MapReduce a very interesting area of research.

1.2 Resource Management in MapReduce

To understand the essential problems in resource management in MapReduce,it is important to first understand the MapReduce model. In MapReduce a data

3

processing problem is solved by expressing the solution in the form of two func-tions, map and reduce. The idea is borrowed from higher order functions with thesimilar names usually present in most functional programming languages such asLisp and ML. The map function in MapReduce takes as input a set of key-valuepairs [(k, v)], and a function f, that performs a computation on a key-value pair.The function f operates on each of the pairs in the input and outputs a different setof key-value pairs. Output of the map function is then passed to the reduce functionas input. The reduce function then , applies an aggregate function r on its input,and stores its output to disk. The output of reduce is also in the form of key-valuepairs. At the end of reduce, the output is sorted according to the values of keys,and the function for comparison of the keys is usually supplied by the user.

Figure 1.1 Centralized architecture of a MapReduce system. Data storage andcomputation are co-located on worker nodes.

During the execution of a MapReduce job, the input is first divided into a set ofinput splits. The system then applies map functions on each of the splits in parallel.The system spawns one task for each input split, and output of the task is storedon disk for transferring it to the reduce tasks. The system starts reduce tasks onceall the map tasks have been successfully completed. Task or node failures are dealtby relaunching of tasks. Data given as input to the tasks, and generated as outputof the tasks is stored in a distributed file system (HDFS for instance), to make surethat output of the task survives failures.

MapReduce is good at solving data parallel batch processing problems. Ithas been optimized for these use cases. One of the essential criterion during its

4

Figure 1.2 A typical MapReduce work flow

design has been to improve the rate at which data is processed, i.e. to maximizeI/O throughput. It works best for I/O intensive applications.

The above paragraphs very briefly summarize the MapReduce model. Fordetailed information, we encourage the reader to see the original paper describingMapReduce [38]. The MapReduce system as implemented in Hadoop is discussedin further detail in Chapter 2.

Effective resource management is essential in MapReduce to get the mostvalue out of available resources. A resource in our discussion means a compu-tational resource, such as CPU, memory, disk space or network bandwidth. Weidentify the following important problems in resource management in MapReduce:

Job Scheduling: Job scheduling is the problem of deciding the sequence inwhich a set of jobs is to be executed on a MapReduce cluster. The order of jobscould be driven by a user specific criteria which usually include minimizing re-sponse time of a job. Response time is the time from the submission of a job to itscompletion. Scheduling could also be affected by system owner’s goals; usuallythey include optimum resource utilization, and servicing as many users simultane-ously as possible.

Task Assignment: Where as job scheduling deals with high level ordering ofjobs, task assignment is about choosing which task should be assigned to a givenworker node. The task assignment should improve utilization of resources on theworker node, and at the same time should prevent overload. Task assignment de-pends on the current state of resources on the worker node, as well as the predictedoutcome of assigning a task on the concerned node.

5

Admission Control: To prevent overload on a MapReduce cluster, and tomeet QoS requirements of already running jobs, jobs submitted to a MapReducecluster must be selectively accepted. If and which jobs are accepted for executiondepends on the goals of the system owner. Some of these goals include: preventingoverload of computational resources on the MapReduce cluster, and to maximizethe value earned after successful execution of a job, in cases where the users arecharged for using the MapReduce cluster.

Speculative Execution: Speculative execution in the case of MapReduce isabout improving response time of a job by relaunching slow progressing tasks byrelaunching them on nodes with better resource availability. In large and hetero-geneous clusters speculative execution can significantly improve job response time[38]. There are two important problems involved in speculative execution: First,deciding which of the currently executing tasks are progressing slowly relative toother tasks, and second, deciding a node on which a slow task must be relaunched.

Data Local Execution: Data local execution; i.e. the idea of executing tasksas near as possible to their input data is embraced heavily in MapReduce. Themain reason behind this being that it is much cheaper to transmit code to inputdata, than bringing input data to code. Further, it reduces latency issues which mayarise if input data needs to be transferred across the network, thereby significantlyreducing response time [67]. The important problems here are: to discover theunderlying network topology, and choosing task and replica placement in order tomaximize data locality.

All of the above problems are complicated by the fact that a MapReduce clus-ter usually comprises of cluster of commodity off the shelf hardware. Given thesize of clusters, and the quality of hardware, failures are frequent. Thus, anyresource management component must take into consideration dynamism in theavailability of resources, and in the demands of the applications as well. Cater-ing to user requirements of deadlines and guarantee of execution should also beconsidered while designing a solution for the above problems. In Chapter 2, wediscuss in detail the approaches used by Hadoop and some recent research done toaddress these problems.

6

1.3 Problem definition and scope

In this thesis, we tackle two of the problems described in the previous section:Task Assignment and Admission Control. Both of these issues need to be solved inconjunction as the scheduling policy has significant impact on the decision makinginvolved in admission control and vice versa. Both of these problems are alsosimilar: In admission control one must decide which job to accept from a set ofcandidate jobs, and in task assignment one much choose a task to assign to a givennode from a set of candidate tasks.

1.3.1 Similarity and interdependency between the two problems

On the surface, admission control and task assignment may appear to be dif-ferent problems, however, they are very closely related and we will explain how, inthis section.

The main goal of the admission control module in a resource management sys-tem is to guard the system against overloading by selectively accepting incomingjobs. The role of the scheduler, or the task allocator on the other hand is to makesure that the system resources are efficiently utilized. If the admission controlleraccepts too many jobs, the size of scheduler’s queue increases, thereby increasingaverage response time of jobs, since the resources in the cluster are being multi-plexed between jobs. On the other hand, if a scheduler does not utilize resourcesefficiently, even then the size of jobs waiting for resources increases, giving a falseimpression to the admission controller that the system is overloaded. Both of thescenarios are undesirable, as they lead to suboptimal system utilization. Hence, itis important to tackle both the problems together, and the approach we take tries tosolve both these problems using a same algorithm.

If we look closely, both the admission controller and the scheduler have tochoose a particular decision among a set of plausible decisions. The choice madeshould maximize an objective function. This is the basic problem studied in deci-sion theory, where an agent has to make a choice among a range of alternatives.Each of the alternatives has a payback (reward), and the chance (probability ofsuccess) associated with it. Based on the principal of Expected Utility Hypothesis,the agent chooses that outcome which maximizes the expected reward. i.e. the

7

product of the reward and the probability of achieving that reward. The agent mustconsider a number of factors that may affect the outcome of the decision. Thisusually involves past knowledge, and external information about the events.

We implement a decision network using a simple binary classifier that clas-sifies the plausible decisions into two sets: good and bad. The decisions la-belled ‘good’, are only considered for further evaluation, and the rest are discarded.Among the ‘good’ decisions, the one that maximizes expected reward, i.e., the onethat maximizes the product of objective function (called utility function from hereon) and the classifier score is chosen.

In solving both the problems we have used a Naive Bayesian Classifier. Ide-ally, one should use a Bayesian Network to correctly model the interdependen-cies between factors that influence the decision. However, constructing a perfectBaesysian Network is a hard problem, and may require considerable expertise.Also, solving a generic Bayesian Network is also a hard problem, and a NaiveBayesian Classifier ignores both these problems by simply assuming that all fac-tors are independent of each other.

The classifier used in our systems is trained by an external evaluation enginewhich uses simple rules to evaluate the outcome of decisions, once the effects ofa decision have been observed. Validating decisions after their effects are known,is easy and can be accomplished using simple rules. The results of the evaluationengine are fed back into the classifier, and the classifier learns from this data.

Our system thus forms a closed control loop [6], which is an integral part of adistributed autonomic system. Next, we explain both of the problems in detail, andsummarize how a classifier is used in both of these problems to achieve the desiredgoal. Both of the systems are covered in much detail in the later chapters.

1.3.2 Admission Control

Admission Control is the first problem that we attempt to solve. As discussedearlier, admission control is about deciding if and which jobs should be admittedfor execution on a MapReduce cluster. An admission control algorithm receivesas input a set of candidate jobs and is supposed to output the set of jobs whichwill be accepted for execution on the cluster. For choosing the set of jobs to ad-

8

mit, the algorithm must consider the utility that the cluster’s owner will earn aftersuccessful completion of the jobs. The algorithm must also ensure that the newlyadmitted jobs do not overload resources in the cluster, and thus do not adverselyaffect already running jobs.

Utility earned after completing a job is usually specified by a utility function,which is provided by the user that is trying to submit the job. A utility functionsexpresses the utility gained by the submitter of the job as a function of time takento complete the submitter’s request. In cloud computing scenario the utility istypically the amount of money the user is willing to pay after successful completionof his/her request. This also introduces a new problem of price setting, where theusers have to speculate the correct value of the service, which in this case is theshare of the resources in a MapReduce cluster. In this thesis we do not focus onprice setting mechanisms, they are surveyed in more detail in other works [26, 27].

MapReduce is a practical platform for developing distributed applications, andthus cloud computing offerings such as the Amazon Elastic MapReduce fall underthe Platform As A Service (PaaS) paradigm. Although PaaS has its own advan-tages, MapReduce when offered in the Software as a Service (SaaS) paradigm canprove useful to users as well as the service providers. Users can reuse MapReducecomponents developed by others and service providers can expose MapReducejobs as pay-per-use services. Service providers rent computational resources froman infrastructure provider, and allow users to run ready to use services. Effectiveadmission control mechanism is necessary in this setting in order to maximize util-ity from the perspective of service providers, and to ensure quality of the servicesfor users [26, 28]. Admission control has been essential in preventing overload ofcomputational resources thereby maintaining a guaranteed level of service.

The main contributions that we put forward related to admission control inMapReduce are:

• A method for modeling MapReduce jobs as ready to use services, thus ef-fectively bringing MapReduce in the Software as a Service paradigm.

• An extension in the the utility models proposed in related work in order toadopt it to MapReduce. and,

9

• An admission control algorithm that machine learning based approach forpredicting job admission. The algorithm trains itself according to policyrules set by the service provider.

We assume that the service provider, or the cluster owner has complete knowl-edge about the type of jobs being executed on his/her cluster. Further, it is assumedthat any new job submission request is about executing one of the already knownjobs to the service provider. This is in contrast to other works which discuss arbi-trary job execution on a rented cluster. Although arbitrary job execution is possiblein theory, implementing it in reality is difficult given the uncertainty involved indemands of a program, and other practical issues such as a priori resource provi-sioning, security requirements etc.

The admission control algorithm we propose is explained in more detail inChapter 3.

1.3.3 Task Assignment

Task assignment in in MapReduce is an interesting problem, because efficienttask assignment can significantly reduce runtime, or improve hardware utiliza-tion. Both of the improvements can result in reducing costs. Recent works onresource management in MapReduce [79] have focused on improving performanceof Hadoop with respect to the user of the application. The schedulers developedso far for Hadoop, implement different policies, and focus on fair division of re-sources among users of a Hadoop cluster. However they do not address the inflex-ibility inherent in Hadoop’s task assignment, which can result into overloading orunderutilization of resources.

Task assignment in Hadoop is worker driven. Each worker node sends a peri-odic message to the master describing its current state of the resources. The masternode is supposed to choose a task from the tasks in the scheduler’s queue and as-sign it to the worker for execution. The task assignment process has to consider thecurrent state of resources on the worker, predicted state of resources in the future,and demands of the queued tasks while making an assignment decision. A task as-signment algorithm should make sure that resources at the worker node in questionare not being overloaded.

10

Many organizations schedule periodic Hadoop jobs to pre-process raw infor-mation in the form of application logs, session details, and user activity in order toextract meaningful information from them [10]. The repetitive nature of these ap-plications provides an interesting opportunity to use performance data from pastruns of the application and integrate that data into resource management algo-rithms.

In this thesis, we present a scheduler for Hadoop that is able to maintain userspecified level of utilization when presented with a workload of applications withdiverse requirements. Thus, it allows the cluster administrator to focus on highlevel objectives such as maintaining a desired level of utilization. The schedulerfrees the administrator from the responsibility of knowing about the resource re-quirements of submitted jobs. Although, we still allow users to provide ‘hints’to the scheduler if information about resource requirements of the jobs is indeedavailable. The scheduler learns the impact of different applications on system uti-lization rather quickly. The algorithm also allows the service provider in enforcingvarious policies such as fairness, deadline or budget based allocations. Users canplug in their own policies in order to prioritize jobs. Thus, the algorithm decouplespolicy enforcement (scheduling priorities) from task assignment.

Our scheduler uses automatically supervised pattern classifiers for learningthe impact of different MapReduce applications on system utilization. We use aclassifier in predicting the outcome of queued tasks on node utilization. The clas-sifier makes use of dynamic and static properties of the computational resourcesand labels each of the candidate tasks as good or a bad. We then pick the tasksassociated with maximum utility from the tasks that have been labeled good bythe classifier. Utility of the tasks is provided by an administrator specified utilityfunction. We record every decision thus made by the scheduler. A supervisionengine judges the decisions made by the scheduler in retrospect, and validates thedecisions after observing their effects on the state of computational resources inthe cluster. The validated decisions are used in updating the classifier so that expe-rience gained from decisions validated so far can be used while making future taskassignments.

The scheduler design and implementation is covered in more detail in Chapter4.

11

1.3.4 Guiding principles

We followed these guiding principles while attempting to solve task assign-ment and admission control:

Adaptability - The designed algorithms must be adaptable to changes in the amountsof computational resources. It should also adapt to the dynamic changes inthe states of nodes constituting the grid/cluster.

Scalability - The designed algorithm must scale with number of nodes in a MapRe-duce cluster. The algorithm must also scale with the number of users of thesystem.

Meeting user requirements - The algorithms should try to meet users as well assystem administrators QoS requirements. For users this could mean prioritiz-ing their jobs, whereas for administrators it could mean maximizing resourceutilization.

Preventing overload - Overloading of resources should be avoided. Overloadingcan result into device failures. It could also reduce job processing rate ifthe finite amount of resources available are shared among increasingly largenumber of jobs. Finally, overload also results in more energy consumption,thereby increasing the operational cost of the system owner.

As discussed earlier in this section, we use utility functions to allow usersand system administrators to control the algorithms according to their needs. Forpreventing overload and maximizing resource utilization, we use a learning basedapproach.

1.4 Organization of the thesis

Rest of the thesis is organized as follows.Chapter 2 begins by providing an overview of resource management in the

context of grid computing systems. We then the discuss the topics related to re-source management in MapReduce in detail. We then compare our admission con-trol algorithm to existing algorithms applied in the context of utility computing.

12

The last part of this chapter discusses available schedulers for Hadoop, and differ-entiates our scheduler from the existing alternatives.

Chapter 3 presents the Admission Control algorithm. We first propose amodel for providing MapReduce jobs as services. We then explain the proposedalgorithm for admission control, and conclude by presenting the performance ofalgorithms under a number of scenarios.

Chapter 4 discusses the task assignment algorithm, and the learning schedulerfor Hadoop. First, we propose the learning algorithm using pattern classificationfor task assignment. We then present the scheduler, briefly discussing implemen-tation issues involved. We then evaluate the scheduler on a number of real lifeworkloads and demonstrate that the scheduler behaves as expected.

Chapter 5 is the last chapter that summarizes our work and establishes the keylessons learned from our work. We touch upon the future directions for research,and conclude this thesis.

13

Chapter 2

Context: Grid Resource Management

This chapter begins by giving an overview of grid resource management andexplains the admission control as well as task assignment problems in detail bymentioning the related research done in both of the areas. After discussing theprior works in generic task assignment for a computational cluster, we move onto Hadoop specific approaches, and discuss the available Hadoop schedulers andtheir task assignment policies in detail.

2.1 Brief Overview of Grid resource management

Much of the ideas embraced by the cloud computing community have theirroots in grid computing research. Ideas for providing computing as a commodity,on demand provisioning of resources and services, as well as composing distributedapplications from web based services have been proposed in the grid community.Thus, in order to understand our work and differentiate it from proposed algorithmsit is necessary to understand grid resource management as well. In this section,we give a brief overview of grid resource management, and then focus on relatedresearch in admission control and task assignment.

Although various definitions of the Grid exist, we adopt the original defini-tion proposed by Foster and Kesselman [44]. According to them, a grid is a verylarge scale distributed system formed of a network of heterogeneous computersspread across multiple organizations and administrative domains. A grid is usuallyvery large and can comprise of several thousand nodes. Cloud computing could

14

be considered as a special case of grid computing. The focus of grids have beenon collaboration and inter interoperability to address computational needs beyondthe capacity of any single organization. In contrast, the main drive behind cloudcomputing has been the economic benefits experienced by both providers and con-sumers of computational resources.

Grids could be broadly classified as a a computational grid, a data grid or aservice grid. It is very common for data and computation to be co-located, and incase of MapReduce it is an essential requirement.

A computational grid provides a large sum of computational resources ag-gregated from several individual nodes. The total capacity available is larger thanthat of any individual node in the system. The computational grid could carryout computation by distributing it across millions of machines for example theSETI@HOME Project [18, 19],and the Folding@Home [58] projects; or it couldcombine several high performance clusters or super computers for processing, e.g.the TeraGrid Project [29]. Problems that benefit from such a grid include grandchallenges such as weather modeling, nuclear simulations and scientific experi-ments involved in high energy physics.

A data grid provides safe and efficient access to vast amounts of data. Al-though a computational grid also offers data storage, the distinguishing factors incase of the data grid are the facilities for data access which include file catalogmanagement, data retrieval services, high speed data movement and services thatenable data mining and data warehousing.

Finally, the service grid offers high level services which in turn utilize dataor computational grids. Examples include collaborative services that allow usersto form virtual work spaces by connecting users and applications in interactiveenvironments, services that acquire resources on demand on behalf of the users etc.Service grids also offer mechanisms for negotiating and enforcing QoS, resourcebrokering, resource reservation and resource monitoring etc.

A resource in a grid is any reusable entity that is employed to complete ausers request for computation. A resource provider is an entity that controls the re-sources, and a resource consumer is the entity that uses the resources. A resourcemanagement system is thus a system that manages resources pooled from a numberof resource providers and allocates them to complete requests initiated by resource

15

consumers. A resource management system or RMS, in a grid deals with man-aging the pool of resources available in the grid, i.e. it schedules the available pro-cessors, network bandwidth and disk storage. The RMS must be adaptable to thefluctuating amount of resources in the grid. It must be scalable, as well as robust.Supporting quality of service (QoS) and meeting computational cost constraintsare also some of the issues that need to be addressed by a resource managementsystem. Further, the RMS must also consider fault tolerance and stability as theresources in the grid could become unavailable at any time [15].

The grid could be spread across different autonomic organizational domains.The policies adopted by each of the domains could be different which makesresource management in grids more challenging than in other conventional dis-tributed computing systems in control of a single organization. Further, as theorganizations could be geographically distributed across countries, local legal con-straints further complicate the issue. Many of these issues are also prevalent incloud computing.

In the abstract grid RMS architecture proposed in [15], we identify the follow-ing components that could be important in a MapReduce system.

Grid resource broker - A resource broker provides mechanisms for establishingthe QoS as desired by the users. This includes establishing service contract,determining the price of services and computational resources. The job ofthe broker is to select resources and services that best match the requirementsspecified by users and delegate user requests to the chosen service providers[52].

Admission Controller - Within the realm of a single service provider, admissioncontroller checks whether incoming service requests should be given accessto resources under control of the service providers. Incoming requests mightnot be accepted if resources are already overloaded or if the service provideris already over committed. Further, the admission controller must make surethat the requests admitted maximize the utility of service provider [54].

Global Scheduler - Global scheduler manages resource allocation at an aggregatelevel. If the grid consists of multiple sites, the global scheduler might decidewhich site should be allocated for a given set of jobs in its queue. The global

16

scheduler does not have access information about the resources of individualnodes in a local sites. It relies on the interfaces exposed by each of the localsites that provide aggregate resource availability in that site and bases itsdecisions on this aggregate information. [74, 73, 75]

Local Scheduler - A local scheduler manages resources in a local cluster within asingle administrative domain, for example - a local university cluster. Localscheduler has detailed information about the state of resources on each of theindividual nodes in the cluster. It assigns tasks to the individual nodes. Someof the requirements from a local scheduler include maximizing resource uti-lization, load balancing and enforcing local policy for job priorities. Exam-ples include the Sun Grid Engine [48], Condor-G [47], PBS and Maui [24].

Grid Information Service (GIS) - A grid information service provides a direc-tory for lookup of grid services and resources. Service providers publishthe availability of services to a grid information service. Grid brokers ac-cess a GIS to identify potential services for delegating requests on behalf ofusers.[35]

Grid Monitoring Service - Grid monitoring service maintains information aboutthe state of resources in a grid or a local clusters. Monitoring services alsohave features for aggregating resource information, notification in case ofchange of resource state and optionally, prediction on the availability ofresources. Information made available by a monitoring service is used byschedulers and admission controllers. Examples of monitoring systems areare: Ganglia [68] and Network Weather Service [77].

As mentioned in Chapter 1, in this thesis we attempt to solve issues in admis-sion control and local task scheduling for MapReduce clusters. In the next sectionswe discuss the related work of each of these issues in detail.

2.2 Related Work: Admission Control

As discussed in Chapter 1, admission control is about selectively admittingjobs for execution on a cluster. Admission control has been researched well in the

17

utility computing community, where the goal is to provide computation as a utility.Admission control forms an integral part of such systems. Most grid systems havefound admission control useful in preventing overload of resources, and in order tomaximize utility for the service providers and users alike.

2.2.1 Utility Functions

Utility functions are instruments used by the users of a system to express theirexpectations from a service. A utility function gives the value of the utility earnedby the users of a system or service as a function of time. Typically the time ar-gument passed to the utility function is the time taken to complete user’s request.Naturally, users want their requests to be serviced as soon as possible, meaningthat the utility earned by users is high if their requests are served by their expecteddeadline and then deteriorates afterwards.

A utility function that decays with time has been proposed in earlier works:Risk Reward [50], Aggregate Utility [22], and Millennium [32]. They have useda three stage utility model, where the utility remains constant until a deadline, andthen degrades linearly until a second deadline. After the second deadline is crossed,it is assumed that the user is no longer interested in the outcome of his/her request,and hence will not be paying for the completion of it. The service provider may bepenalized if it fails to reach its deadline guarantees in such cases.

A linear decay rate (slope of the utility curve) is specified by the user to expressdissatisfaction with passage of time. We argue that a linear decay in utility rateassumes that user’s expectations also decrease linearly. On the contrary, the rate of

decay in utility (−du(t)

dt) actually increases with time, i.e. the users become more

and more disappointed with each time unit passed after their expected deadlines.This necessitates the need of non linear and more generic utility functions thatallow users to accurately express their expectations from offered services. Theutility functions we use are mentioned in detail in Chapter 3.

2.2.2 Existing Admission Control Algorithms

Usually in the existing literature admission control has been proposed as asubmodule within the scheduler. In our approach we have treated scheduling and

18

admission controller independently. Furthermore the admission controller does notdemand a specific type of scheduler within the cluster and vice versa.

Despite of the need of an effective admission control algorithm, it has receivedvery less attention on its own. We discuss the important algorithms relevant to ourapproach in this section. Other admission control algorithms mostly deal with me-dia services and telecommunication networks, and their approaches are not directlyapplicable in cloud computing.

Millennium

Chun and Culler [31] presented one of the earliest works in admission controlin the utility computing space. They discussed the admission control algorithms forfixed size clusters and jobs with fixed shapes. The roles of service providers andresource providers are expected from the same entity. Their algorithm calculatesthe value of yield/RPT , where yield is the utility earned after completing a joband RPT denotes the remaining processing time of the job. Thus, they select jobsthat maximize the utility earned per unit time. Jobs can be preempted to give wayto more promising requests. The novelty in this work came from evaluating thesystem based on a user centric metric (utility earned) rather than system centricmetrics such as system utilization. An important contribution of their study is todemonstrate the ability of market based approaches in delivering significantly moreuser satisfaction compared to conventional approaches at the time.

Risk Reward

Work by Irwin, Grit and Chase [50] discusses scheduling of tasks in a marketbased task service where users submit task bids. Users are expected to submitthe task, expected time to complete the task, and a value function (same as utilityfunctions described earlier) to the service provider. A three phase value function isused, with linear decay rate. The service provider then runs acceptance heuristicsto decide whether or not to accept a task. Once a task is accepted, tasks in thequeue are scheduled using a FirstReward heuristic.

The acceptance heuristic calculates a slack value for each task and rejects tasksif their slack value falls below a user defined threshold. Slack is the delay that can

19

be tolerated for a given task before its expected utility goes beyond an acceptablelevel. Tasks with higher slack value are preferred as they leave more opportu-nity to accept profitable tasks, i.e. tasks with higher value functions in the future.Experiments performed by the authors suggest that admission control is critical inpreventing overload and also increases yield (utility) earned by the service providerby reducing number of tasks that end up resulting in penalties.

The authors expect users to submit expected runtime of the task at the time ofrequesting task service. Users may not always have an idea of what deadlines to ex-pect from the service. First, because detailed information about service provider’sinfrastructure may not be available (sometimes even to the service provider; espe-cially in cloud computing) and second, the performance information of tasks mightnot be available if the tasks are assembled from task repositories on demand. Ad-ditionally, users might tend to overestimate the performance capacity of the serviceand might demand impossible deadlines.

Positive Opportunity

Positive Opportunity [65] takes an exhaustive approach, where they computeall possible schedules of the new job request with the existing jobs, and schedule ofjust the existing jobs. They select a schedule that results in the most reward (utility)after completing all the tasks. If the new job is in the schedule with maximumpotential yield, then it is accepted, otherwise it is rejected. In other words, a jobis accepted only if it does not decrease the profit that is expected after executingcurrent set of jobs. Once a job is accepted, it is run to completion, as are all jobs,and scheduled according to an independent scheduling policy.

Aggregate Utility Functions

Young et. al. [22] present aggregate utility functions that allow users to con-trol the behavior of the admission controller service. The algorithm they presentedallowed capturing utility of an individual job as well as aggregate utility of a batchof jobs. Two sets of utility functions are used: one for each individual job andanother for the entire work flow of jobs. The authors have approached the prob-lem of admission control for arbitrary jobs in a computational cluster, which the

20

service provider is renting from a third party infrastructure provider. The admis-sion control algorithm can make decisions for each individual job as well as forthe entire contract of a batch of jobs. The scheduling algorithm used a FirstRe-ward [65] heuristic. The algorithm selects the job that is expected to provide theleast declining contribution to profit-rate. The cluster executes only one job at atime. Profit-rate is profit earned per unit time. For accepting contracts, a strategyis chosen from the following five strategies:

1. Oblivious - Accept contracts always

2. Average - Accept contract if the sum of the average load during the contractand the current load average of the cluster is less than that which can besustained by current resource availability.

3. Conservative - Instead of average load use load that is 2 standard deviationsabove the mean.

4. Preempt-Conservative - conservative in addition to preemption of existingcontracts if the current contracts will result into losses after accepting a newcontract.

5. High Value Only - Accept a contract that offers more utility rate than a certainthreshold set by the administrator.

Admission Control in Hadoop

Existing schedulers for Hadoop: FAIR [80], Capacity [5], and Dynamic Pri-ority [8] offer very limited facility of admission control. For example, the FAIRscheduler has a feature to suspend jobs until sufficient free resources are avail-able. Existing schedulers focus on implementing resource sharing policies, withina MapReduce cluster owned internally by an organization. The dynamic priorityscheduler uses market based approaches to control resources shared, by users in acluster, however they also do not address the problem of admission control. Ourapproach borrows from the previous work on admission control in utility comput-ing and provides a model more suited for cloud computing, in a scenario whereMapReduce is offered as a service.

21

In the current Hadoop implementation, admission control is left to the sched-uler, and there is no independent module for it. A drawback with this approachis that the job data structures are already allocated in the master node’s memoryby the time the scheduler takes a decision. Further, the job construction phase isexpensive, and if the master node decides to reject a submission, all the work doneduring job setup is wasted. It is necessary to do admission control before accept-ing a job. The job submission protocol in Hadoop can be extended to include anegotiation phase, in which the admission controller could participate.

2.3 Related Work: Task Assignment

Having surveyed the existing literature on admission control, we now turn totask assignment in compute clusters.

Multiprocessor task assignment is known to be an NP complete problem.[33, 34]. As a result, task assignment in grids and distributed systems remainsan overwhelmingly attempted yet unsolved problem. A vast number of approacheshave been tried. Application of heuristic based algorithms has been popular inthis field. As every distributed system presents a new opportunity to try out do-main specific heuristic, task assignment is still an interesting problem. However,some generic algorithms have been proposed to tackle task assignment for typicalcluster of computers scenario. Out of these we only consider heuristic used in in-dependent task scheduling, as that is the case that has maximum overlap with taskassignment in MapReduce. For a detailed survey of dependent task schedulingand grid scheduling in general, the reader is encouraged to read excellent surveyarticles by Braun et. al. [25], Buyya et. al.[15] and Dong and Akl [40].

2.3.1 Independent Task Scheduling

In MapReduce tasks of the same type (map or reduce) are assumed to be in-dependent of each other. In other words, the output of a map task of a job does notdepend on the output of any other map task of the same job. The same goes for allreduce tasks of the job. Also, at the time of task assignment tasks of disparate jobs

22

in the scheduler’s queue are also assumed to be independent. Task dependencygoes against the philosophy of data parallel computation in MapReduce.

A typical strategy in independent task assignment is to allocate tasks consider-ing the load on resources in order to achieve high system utilization. The heuristicbased algorithms used for independent task scheduling can be broadly classifiedinto two categories:

1. Heuristic with performance estimate - The heuristic used for task assignmentassumes that basic performance information such as mean CPU usage, ex-pected time to complete the task etc are available.

2. Heuristic without performance estimate - algorithms that do not use perfor-mance estimates.

Algorithms with performance estimates

OLB or Opportunistic Load Balancing assigns a task to the next availablemachine. While allocating, the algorithm does not consider the expected runtimeof the task on the chosen machine. Although the algorithm is simple, it can resultin poor runtimes. [21, 45, 46]

The MET (Minimum Execution Time) [25] algorithm assigns tasks to thosemachines which are expected to minimize the execution time of the task. Thus,each task is assigned to its best machine. This can cause load imbalance amongthe resources as machines with more computing capacity are more likely to bethe best matches for most of the tasks, resulting into more load on more powerfulmachines, whereas some machines with less computational capacity could be leftunder utilized.

The MCT (Minimum Completion Time) algorithm [21] assigns tasks to ma-chines that are expected to minimize the expected completion time of tasks. Thealgorithm combines heuristics used in OLB and MET, and tries to avoid the short-comings of both.

The Min-Min algorithm computes the expected runtime of each task for allthe machines in the cluster. This is done for all the tasks in the scheduler’s queue.From the set of runtimes the corresponding minimum time for each task and thecorresponding machine is selected. A new set containing minimum completion

23

times for each of the tasks is constructed. From this set, the the task-machinemapping that has the minimum runtime is chosen and the chosen task is assignedto the corresponding machine. The algorithm minimizes runtime for each task,and task with minimum runtime among all the tasks, hence the name min-min.[21, 45, 46]

Max-Min is very similar to min-min, it computes the set of minimum expectedtimes for all tasks and then selects the task with the maximum expected time fromthis set and assigns it to the corresponding machine. The chosen task is removedfrom the set of candidate tasks and the process is repeated until all tasks have beenallocated. The max-min algorithm tries to match the task with longer expectedtimes to their best machines. Thus, it tries to minimize the penalty incurred byrunning longer tasks concurrently with shorter tasks.

The heuristics in min-min and max-min are combined in Duplex, which runsboth min-min and max-min and chooses the better solution out of the two. Thus, ittries to perform well in situations where either min-min or max-min perform well.

Suffrage and XSuffrage: The Suffrage algorithm [61] computes the differ-ence between the best MCT and the next best MCT for a task. This difference iscalled suffrage. This process is repeated for all the tasks. From the set of suffragesthe task with maximum suffrage is chosen. The philosophy behind the Suffragealgorithm is that the task should suffer as less as possible by incorrect task assign-ment. The task that suffers the most by incorrect assignment is always chosen.However, this algorithm may not work when all tasks are expected to achieve al-most identical runtimes. Further, the Suffrage algorithm does not consider datalocality or the distance between where input data is present and where the task isactually executed. The XSuffrage algorithm [59] tries to balance this by comput-ing a cluster level suffrage value for all the clusters in a grid. XSuffrage workswell in cases where accurate task performance information may not be availablecompared to Suffrage.

Algorithms without performance estimates

Subramani et al. [73] present a duplication based approach where they senda job to K independent sites. Each of the sites then schedules the job locally and

24

informs the global scheduler if it is able to start a job. The global scheduler uponreceiving a start message from one of the K sites, sends a cancel message to all theremaining K − 1 sites to cancel the job. The reasoning behind duplicating tasksis better utilization of idle machines on large clusters, and reduction in expectedmake span of a job.

Another algorithm Work Queue Replication (WQR) [36] tries to duplicatetasks that are already running on other processors by relaunching them on othermachines in the cluster. The difference between the previous algorithm and WQRis that WQR actually duplicates on going work, whereas the previous algorithmcancels all redundant work. WQR algorithm works well without using resourceinformation, and also copes well with dynamic resource quality and applicationperformance variations. This is very similar to the idea of speculative executionin Hadoop.

2.3.2 Computational Intelligence based approaches

Nature inspired heuristics such as Genetic Algorithms (GA), Simulated An-nealing, Swarm Intelligence etc have been also used in task assignment problems.While applying so called computational intelligence algorithms, the task assign-ment problem is modeled as an optimization problem, and a heuristic is applied tofind out a solution that maximizes (or minimizes) an objective function. Next, wediscuss some of such heuristic algorithms and their application in distributed taskassignment.

Genetic Algorithms

Genetic Algorithms fall under the broader class of evolutionary algorithmsand are typically used to discover solutions in large search spaces. The generalmethod used in a GA is as follows:

1. Population Generation: A population contains a set of chromosomes (po-tential solutions). Initial population can be generated using other heuristicsmentioned in Section 2.3.1.

25

2. Chromosome evaluation: Each chromosome is evaluated based on the valueof the objective function. For task assignment, an example of objective func-tion is total make span of all tasks.

3. Crossover and Mutation: During crossover, two chromosomes are selected,and their random substrings are exchanged. In mutation, a single chromo-some is selected and one of the mappings (task-machine) is reassigned.

4. Evaluation: New chromosomes generated in the previous steps are evaluatedagain. And a random subset of the chromosomes is preserved for the nextiteration. The algorithm ends after a fixed number of evolutions, or if all thechromosomes converge to a single mapping.

GA have been very popular heuristic and have been applied in a number of Gridscheduling problems. [17, 53, 72, 76]

Simulated Annealing

The next heuristic called Simulated Annealing (SA) is based on the restructur-ing of molecules observed while controlled cooling of molten metals and alloys.In SA, first a metal is melted, and then cooled slowly, until it reaches a thermalequilibrium which is an optimal state. When applied to task-machine mapping, wedefine a temperature entity that can be calculated for each task-machine mapping.Temperature could be calculated using the objective function. If a new mappingresults in a higher temperature, it is accepted with a certain probability in orderto escape from a local minima. The initial mapping is generated from a uniformrandom distribution. The mapping is then mutated similar to that in GA, and thetemperature of the mapping is evaluated. If the new temperature is lower, it re-places the old mapping. If it is worse, then it is accepted with some probability:A uniform random number z is selected s.t. z ∈ [0, 1). z is then compared to y

where,

y =

1 + e

old temp− new tempnew temp

−1

26

If z ≥ y the new mapping is accepted, otherwise it is rejected. This completesone iteration of the algorithm. The algorithm is stopped if the temperature does notreduce after a certain number of iterations or if the temperature reduces to zero,indicating an optimal solution. Examples: [25, 78].

Genetic Simulated Annealing

GA can be combined with SA to yield a hybrid approach called Genetic Sim-ulated Annealing [16]. GSA follows the procedure of GA as described earlier.However while selecting a chromosome the simpler decision process involved inSA is used.

Swarm Intelligence

Swarm intelligence is a term used for a set of meta heuristic algorithms derivedfrom the behavior of swarms of social animals in nature. Some such algorithmsinclude Ant Colony Optimization (ACO) [41] which is modeled after the foragingbehavior of ants, Particle Swarm Optimization (PSO) modeled after a swarm offish or hypothetical particles, and Bee algorithm modeled after foraging behaviorof honey bees.

Ant Colony Optimization (ACO) proposed by Dorigo [41] has been used inGrid scheduling [42, 66]. In this approach, each mapping is represented by an edgebetween a graph that starts from a start node and ends on the end node (optimalsolution). Each edge is traversed by ants (agents) which leave a trace (smell) of acertain chemical. This trace decays with time. While selecting a next edge, antstry to select the edge that has the maximum trace value. After an ants crosses overan edge its trace value is reinstated. The algorithm ends when all the ants havereached the end (optimal state) or if they are converged at an intermediate stateafter a number of iterations.

Particle Swarm Optimization is modeled after the social interactions in so-cial organisms. Initially a starting population is generated randomly, and a socialnetwork among the solutions to exchange information is defined. Each individualin the population is a candidate solution. Each individual or particle evaluates itsfitness function and exchanges this information with its neighbors. A particle also

27

remembers its old location. A new location is chosen if one of the neighbors haveachieved a better location, and all neighbors of that particle move towards the bet-ter location. This step is continued until all particles converge to a same locationor after a certain number of iterations. [69]

Other nature based approaches include optimization algorithm modeled afterthe search technique used by honey bees while searching nectar [30].

2.3.3 Learning based approaches

Motivation for using Machine Learning

Figure 2.1 CPU usage patterns of MapReduce application(wordcount). Mean andvariance of the resource usage distributions become recognizable characteristics ofa particular MapReduce job.

MapReduce applications have been successfully used in processing large amountsof data. Subtasks of the same type of a job apply the exact same computation on in-put data. Tasks tend to be I/O bound, with resource usages as a function of the sizerather than the content of input data. As a result, the resource usage patterns of aMapReduce job tend to be fairly predictable. For example, in Figure 2.1, we showthe CPU time spent in user mode and kernel mode by Map tasks of a WordCountjob. The figure shows distribution of CPU usages for about 1700 Map tasks. As wecan deduce from the figure, the resource usage of MapReduce applications followrecognizable patterns. Similar behavior is observed for other MapReduce apps and

28

resource types. This, and the fact that the number of tasks increase with the size ofinput data, present a unique opportunity for using learning based approaches.

Stochastic Learning Automata

Stochastic Learning Automata have been used in load balancing [20, 55, 56].Learning automata learn through rewards and penalties which are awarded aftersuccessful and unsuccessful decisions respectively. The SLA tries to learn the bestpossible action in a given automaton state. Examples of actions are: task migration,task cancellation, task assignment etc. Every action is associated with a probability.Probability of allocating a task on an overloaded node is less than the probabilityof allocation on an underloaded node. The goodness value of an action is a binaryvalue indicating the success/failure outcome of the action. This value is transmittedto the scheduler by the host where the action has been performed. After the action’soutcome is received, probabilities are updated as below:

1. Probability of success actions are incremented with a reward update.

2. Probability of failed actions are penalized with a penalty update.

Initially actions are chosen randomly. However, as time progresses, and conse-quences of a number of actions are known, for each state only a few actions remainviable.

Decision Trees

Another popular classifier, the C4.5 Decision Tree has also been applied inprocess scheduling in Linux [63]. Every process is associated with a feature vectorhaving a number of static and dynamic features. Analysis is done to determine thebest possible feature variable set in order to minimize turn around time of a process.The algorithm is evaluated for the Linux scheduler (versoin 2.4.20-8) and a numberof classifiers have been used in the evaluation. The C4.5 classifier is found to givethe best results. The k-nearest neighbor algorithm is also evaluated. The reductionin turn around time of processes is because of the reduction in number of contextswitches.

29

Bayesian Learning

Bayesian Learning has been used effectively in dealing with uncertainty inshared cluster environments [70, 71]. The authors have used a Bayesian DecisionNetwork (BDN) to handle the conditional dependence between different factorsinvolved in a load balancing problem. The Bayesian network consists of a num-ber of nodes typically for each factor that might influence the eventual decision.Examples of such factors are: machine resource information, machine load, jobproperties etc. A set of decisions or actions are also predefined. Examples of ac-tions include task migration, task duplication, task allocation or cancellation etc.

While making a decision, first the decision variables are fed with the currentstate of the system, and based on the conditional probabilities, expected utilityearned after evaluating each action is calculated. Utility is measured based on thevalue of an objective function, a typical example of which is the load distributionamongst nodes. An action is chosen so that the objective function is minimized.The current state of resources is also coupled with the predicted state of resourceswhile making a decision. Linear models [39] are used to predict the next state ofresources. After making the decision, the state of resources and the value of theobjective functions are observed, and the conditional probability tables are updatedaccordingly. The BDN learns from one sample at a time, thus making the learningprocess incremental. The main motive behind using a BDN is in dealing with theuncertainty involved in node resource information, node availability and finallyapplication behavior.

Dynamic Bayesian Networks [23] have been used for load balancing as well.The similarity between their and our approach is the use of Bayesian inference.However whereas the authors in [71] have used a BDN, we use a Naive Bayesclassifier, where all factors involved in making the decision are assumed to beconditionally independent of each other. Despite the assumption, Naive Bayesclassifiers are known to work remarkably well [81], and as our results indicateit can be effectively applied to task assignment as well. Compared to BayesianNetworks, Naive Bayes Classifiers are much simpler to implement. Further detailsabout using the classifier are available in Chapter 4.

30

2.4 Existing Hadoop Schedulers

To better understand our approach and the limitations of current Hadoop sched-ulers, we now explain the key concepts involved in Hadoop scheduling.

Figure 2.2 Hadoop MapReduce Architecture

Hadoop MapReduce Architecture

Hadoop borrows much of its architecture from the original MapReduce systemat Google [38]. Figure 2.2 depicts the architecture of Hadoop’s MapReduce imple-mentation. Although the architecture is centralized, Hadoop is known to scale wellfrom small (single node) to very large (upto 4000 nodes) installations [12]. HDFS(Hadoop Distributed File Systems) deals with storage and is based on the GoogleFile System [49], and MapReduce deals with computation.

Each MapReduce job is subdivided into a number of tasks for better granu-larity in task assignment. Individual tasks of a job are independent of each other,

31

and are executed in parallel. The number of Map tasks created for a job is usuallyproportional to size of input. For very large input size (of the order of petabytes),several hundred thousand tasks could be created [37].

2.4.1 Native Hadoop Scheduler

Heartbeat Mechanism

Scheduling in Hadoop is centralized, and worker initiated. Scheduling deci-sions are taken by a master node, called the JobTracker, whereas the worker nodes,called TaskTrackers are responsible for task execution. The JobTracker maintains aqueue of currently running jobs, states of TaskTrackers in a cluster, and list of tasksallocated to each TaskTracker. Every TaskTracker periodically reports its state tothe JobTracker via a heartbeat mechanism. The contents of the heartbeat messageare:

• Progress report of tasks currently running on sender TaskTracker.

• Lists of completed or failed tasks.

• State of resources - virtual memory, disk space, etc.

• A boolean flag (acceptNewTasks) indicating whether the sender Task-Tracker should be assigned additional tasks. This flag is set if the number oftasks running at the TaskTracker is less than the configured limit.

Task or worker failures are dealt by relaunching tasks. The JobTracker keepstrack of the heartbeats received from the workers and uses it in task assignment. Ifa heartbeat is not received from a TaskTracker for a specified time interval, thenthat TaskTracker is assumed to be dead. The JobTracker then relaunches all thetasks previously assigned to the dead TaskTracker, that could not be completed.The Heartbeat mechanism also provides a communication channel between theJobTracker and a TaskTracker. Any task assignments are sent to the TaskTrackerin the response of a heartbeat. The TaskTracker spawns each MapReduce task in aseparate process, in order to isolate itself from faults due to user code in the tasks.

32

Limiting task assignment

The administrator specifies the maximum number1 of Map and Reduce tasksthat can simultaneously run on a TaskTracker. If the number of tasks currentlyrunning on a TaskTracker is less than this limit, and if there is enough disk space2

available, the TaskTracker can accept new tasks. This limit should be specifiedbefore starting a Hadoop cluster. This mechanism makes some assumptions whichwe find objectionable:

• In order to correctly set the limit, the administrator has detailed knowledgeabout the resource usage characteristics of MapReduce applications runningon the cluster. Deciding the task limit is even more difficult in cloud com-puting environments such as the Amazon EC2, where the resources could bevirtual.

• All MapReduce applications have similar resource requirements.

• The limit on max number of concurrent tasks correctly describes the capacityof a machine.

Clearly, these assumptions do not hold in real world scenarios given the rangeof applications for which Hadoop is becoming popular [10]. As the above assump-tions have been built into Hadoop, all the current schedulers available with Hadoop,the Hadoop default scheduler, FAIR scheduler [80], the capacity scheduler [5] andthe dynamic priority scheduler [8] suffer from this limitation.

Job Priorities

Hadoop has limited support for job priorities. Five job priorities are supported:

1. Very High

2. High

3. Normal1 (mapred.map.tasks.maximum and mapred.reduce.tasks.maximum in Hadoop’s

configuration files)2This can also be configured in the same configuration file

33

4. Low

5. Very Low

Job priority can be set at job launching time, or can be changed by the administratorwhen a job is running. Hadoop tries to schedule jobs according to a FCFS strategy,and picks oldest jobs with maximum priority for execution.

Data Local Execution

Data locality and speculative execution are two important features of Hadoop’sscheduling. Data locality is about executing tasks as close to their input data as pos-sible. Hadoop tries to achieve rack-level data locality. This is done to exploit therack-based topology in typical data centers. A topology script gives the rack loca-tion and distance between input split location and a node under consideration fortask assignment. First, the default scheduler tries to assign task to nodes whose dis-tance with input data is zero. Distance between a node and its input is zero if theyboth reside on the same machine. Next, nodes with distance one are considered;these are the machines that are within the same server rack. If no such machinesare found, then Hadoop assigns task a non data local task, i.e., the task whose inputdata are not in any of the machines in the same rack as the node running the task.

Speculative Execution

Speculative execution tries to re balance load on the worker nodes and triesto improve response time by relaunching slow tasks on different TaskTrackers withmore resources. In this mechanism the Hadoop scheduler relaunches tasks that areprogressing slowly, compared to other running tasks. Slow tasks are duplicated onmachines with free slots. This tries to achieve better utilization when the job isabout to end. This also reduces job make span by reducing runtime of the slowtasks. This also counters effects of overload by multiple task assignments on bettermachines.

34

2.4.2 LATE Scheduler

The LATE scheduler [79] tries to improve response time of Hadoop in mul-tiuser environments by improving speculative execution. It relaunches tasks ex-pected to “finish farthest into the future”. To better accommodate different types oftasks, task progress is divided into zones. A user defined limit is used to control thenumber of speculative tasks assigned to one node. The LATE scheduler achievesbetter response times especially in heterogeneous cloud environments. We wouldlike to point out that speculative execution increases the likelihood of node over-load.

2.4.3 FAIR Scheduler

The FAIR scheduler by Zaharia et. al. [80] has been optimized for multi-userenvironments, where a single cluster is shared across a number of users. This kindof Hadoop usage is popular in companies that do a lot of data mining operationsbased on user logs. The FAIR scheduler has been designed to reduce the makespanof short jobs, which are found to be frequent in such large environments [51].

The scheduler chooses jobs from a set of pools, and tries to assign jobs fairlyacross pools. Task preemption is also supported in order to achieve fairness. Theauthors use a version of max-min fairness with a minimum slot allocation guaran-tee. If jobs from a pool have been over assigned, tasks of those jobs are killed inorder to free slots for jobs with less than guaranteed allocations.

The FAIR scheduler uses delayed allocation to improve data locality on largeclusters. While allocating a task, unlike the native Hadoop scheduler which usesa best effort philosophy, the FAIR scheduler delays the allocation of a task in thehope that a node with more data locality might ask for that task. If the job at thehead of the job queue does not have a data local task, then subsequent jobs with datalocal tasks are chosen for assignment. If the head of the queue has not been gettinga data local task for a specific period of time, then it is forcefully allocated. Theauthors calculate expected gain in job’s response time by using delay schedulingto be:

E(gain) = (1− e−w

t )(D − t)

35

Where t is the rate of arrival of allocation requests from nodes (heartbeats), whichis the rate parameter of a Poisson process, D is the expected extension in runtimeof a non data local task compared to a data local task, and w is the wait time forlaunching tasks locally.

2.4.4 Capacity Scheduler

Capacity Scheduler [5] is another scheduler designed for sharing of large clus-ters. The scheduler defines a concept of queues, which are created by system ad-ministrators for submitting jobs. Each queue is guaranteed a fraction of the capac-ity (number of task slots) of the entire cluster. All jobs submitted to a given queuehave access to the resources guaranteed for that queue. In a single queue jobs areselected based on priorities. Higher priority jobs are selected before low prioritytasks, however, jobs with lower priority are not preempted for higher priority tasks,which could result in priority inversion within a queue.

If there are multiple queues and only a subset of these queues are having run-ning jobs, then there is facility to allocate more than the guaranteed capacity to thesubset of active queues. If and when the inactive queues also have job submissions,then their lost capacity is reclaimed. For this tasks of jobs in queues with excesscapacity are killed.

The scheduler also supports resource aware scheduling for memory intensivejobs. A job could optionally indicates if it needs nodes with more memory. Tasksof such a job are only allocated to nodes that have more than requested amount offree memory.

Whenever a request for a task comes from a TaskTracker, the scheduler choosesa queue that has the most free capacity, i.e. the queue whose ratio of number ofrunning tasks to the number of guaranteed tasks is the lowest. Task quota for usersis also supported. While selecting a job, they are usually chosen in FIFO order,only if the quota of the user of the job is not reached, in which case the next job inthe queue is chosen.

36

2.4.5 Dynamic Priority Scheduler

Dynamic Scheduler [8], based on the Tycoon [57] system uses a market basedapproach for task assignment. Each user is given an initial amount of virtual cur-rency. While submitting a job, users can declare a certain spending rate per unittime which indicates the amount of money the user is willing to spend in a giventime. The scheduler chooses jobs that earn the maximum money for the scheduler,i.e. jobs whose users are providing the maximum spending rate. Users can controltheir spending rate in order to change the priority of their jobs, hence the name, dy-namic priority scheduler. After a task allocation, amount equivalent to the effectivespending rate is deducted from the account of the user. If the account balance ofa user reaches zero, then no further tasks of that user are assigned. This turns theonus of properly prioritizing jobs on users instead of system administrators.

The FAIR scheduler [80], Capacity scheduler[5], and the Dynamic Priorityscheduler[8], try to achieve fairness, guaranteed capacity and adjustable prioritybased scheduling respectively. Hadoop on Demand[9] tries to use existing clus-ter managers for resource management in Hadoop. It should be noted that theseschedulers concentrate either on policy enforcement (fairness, for example) or onimproving response time of jobs from the perspective of the users. Our work dif-fers from these approaches in that, we allow a service provider to plug in his/herown policy scheme, while maintaining a specified level of utilization. Also, as dis-cussed in section 2.4.1, all of these schedulers allocate tasks only if there are fewerthan maximum allowed tasks (a limit set by the administrator) running on a workermachine. Our scheduler, on the other hand assigns tasks as long as any additionaltask is likely to overload a worker machine.

2.5 Summary

In this chapter we have attempted to give the reader a brief overview of thevast, extremely interesting and still actively researched field of Grid resource man-agement. We introduced the concepts and terms involved in grid resource manage-ment, and tried to present a thorough literature survey. We also discussed existing

37

schedulers for Hadoop and their pitfalls, which lead to our work on admission con-trol (Chapter 3) and task assignment (Chapter 4).

38

Chapter 3

Learning Based Admission Control

This chapter presents the learning based admission control algorithm, de-signed for MapReduce clusters. We begin with introducing the context: offeringMapReduce jobs as on demabd services. After describing the model we move onto explaining role of utility functions and mechanisms for establishing service con-tracts. Next, we present the admission control algorithm, and finally evaluate itagainst baseline approaches by comparing performance of a number of importantparameters.

3.1 Service Oriented MapReduce

Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastruc-ture as a Service (IaaS) are the three key paradigms that enable cloud computing.In these models, software applications, software platforms and infrastructure areprovided to the users in the form of on-demand services, and they are chargedaccording to the pay-per-use model. MapReduce, which has become a popularparadigm for large scale data processing in the Cloud, is usually associated withthe PaaS paradigm, where the service provider offers a ready to use MapReducecluster where users can run their jobs. An example of such a platform is the Ama-zon Elastic MapReduce service [2] where users can provision Hadoop clusters onthe fly, and perform data intensive computation by providing the implementationof Map and Reduce components, and process data that is hosted in Cloud basedstorage services.

39

Figure 3.1 The architecture of MapReduce as a Service. Our model is based onthe Hadoop open source MapReduce framework.

3.1.1 The Model

We propose a model that brings MapReduce in the SaaS paradigm. In ourmodel, service providers offer a set of MapReduce applications as Web Services.To carry out data intensive operations, users search from a repository of regis-tered services and select a service that performs the desired operation. The servicerepository thus takes form of an online market place where users can choose froma range of MapReduce services. A market is beneficial to users, as the serviceproviders are forced to provide better service at cheaper rates in order to overcomecompetition. Figure 3.1 shows the architecture.

MapReduce in the SaaS paradigm has following benefits from the users’ per-spective:

• Users can choose from a wide range of available applications to perform theircomputations, without having to invest in development of such applications.

• Users do not have to deal with establishing and maintaining MapReduceclusters, thus saving operational cost.

• Users can combine MapReduce services to form a data processing pipeline,where each unit in the pipeline could be offered by a different service provider,thus allowing the user to form a ‘mashup’ service.

40

After selecting a service, the users interact with it only through the web serviceinterfaces exposed by the service provider. This helps in hiding implementation de-tails from the uses, which could be beneficial to the service provider in cases whereexposing application implementation details is not preferred. Messages exchangedbetween the user and the services are based on familiar transport mechanisms, forexample, XML or JSON over HTTP. Users specify input data location in the cloud,meta data describing the computation to be performed on the input data and con-tractual service demands as parameters to the web service. Output data could eitherbe obtained directly as a response from the service or it could be stored in the cloud.The latter approach is useful if the user desires to perform further computation bymeans of another service instance.

After accepting a request, a service provider then launches a MapReduce jobcorresponding to the request in her cluster that is hosted entirely in the cloud. Thecloud in this case can be a private cloud that emulates cloud computing on privatelyowned infrastructure, or it could also be hosted in public cloud computing offeringssuch as Amazon EC2 [1], GoGrid, RackSpace Cloud etc. We assume that distinctrequests are independent of each other, and thus could be completed in parallel. Toincrease revenue, the service provider processes multiple requests simultaneouslyby multiplexing job execution in the MapReduce cluster to achieve better resourceutilization. The computational resources of the cluster are shared proportionatelyamong the users. The proportion of resource allocated to a user’s request dependsupon the utility earned by the service provider after completing the user’s request.

3.1.2 Usage examples

Having discussed the model, we now present few use cases of MapReduce asa service:

Ad-hoc querying on large datasets -

Consider a scenario where an online movie rental company is offering a largeanonymized data set of its users’ order history (of the order of few TBs) foranalysis. The company can also offer several ready to use operations such asdata selection, filtering, joins, pre-processing, etc. in the form of MapReducejobs. Users can chain several such operations to extract useful informative

41

patterns from the orders data such as set of genre of orders placed by usersin a particular age group. Users do not have to own the data set, and theycan reuse components developed by the service provider to extract desiredinformation.

On demand crawling and indexing of web information sources -

A service provider could allow users to submit a list of seed URLs to becrawled using a domain specific crawling algorithm developed by the serviceprovider. The crawling engine will utilize MapReduce jobs for distributingworkload across multiple nodes. The service providers could also provideinitial preprocessing utilities in the form of MapReduce jobs such as jobs forextracting images and their alt-text and surrounding text from crawled webpages. Users can pay the provider only for the resources consumed duringtheir crawl process, such as network bandwidth, disk space consumed by thecrawled data etc.

Document Format Conversion Service -

In this example, users submit a list of documents stored in the cloud to theservice provider, along with the desired output format for the documents.The service provider can then offer MapReduce jobs to convert documentsin the desired file format. Example applications include on-demand videoand audio conversion, generating thumbnails from video files etc.

3.1.3 Utility Functions: Expressing User Expectations

In MapReduce as a service model, users only pay for the share of resourcesconsumed for their computation. Besides the demand of correctness of compu-tation, deadline for performing the computation is also an integral part of users’expectation about the quality of service. Thus the price of the resources that theuser is willing to pay and the deadline that the service provider agrees by, constitutethe service contract in this model. The user and service providers must negotiateand mutually agree upon this contract. We do not address the problem of pricedetermination in this chapter; auctioning mechanisms such as the Dutch auction or

42

the English auction could be used effectively for the purpose of judging the valueof service.

Users specify utility functions that indicate the price they are willing to pay asa function of time taken to complete service request. We extend the generic threephase utility functions proposed by [22, 32, 50, 65]. In this framework, the usersspecify a soft deadline and a hard deadline. If the request completes before the softdeadline, a user pays the complete amount he/she agreed upon before submittingthe request. After the soft deadline, the utility from the perspective of the userdegrades, until the hard deadline, after which the user is no longer interested in theoutcome of the request and is unwilling to pay for completion of the service. Thedecay in the utility could be linear, or the rate of decay could also vary with timepassed since the soft deadline. The following set of parameters capture the set ofutility functions that exhibit this behavior.

Formally, utility can be expressed as a function of time:

U(t) =

U0 if 0 < t ≤ T1

U0 − α(t− T1)β if T1 < t ≤ T2

UP if t > T2

where, t = 0 is the time when a service request is accepted. U0 is the initialutility that the user is willing to pay if the request is completed before the softdeadline T1, after which the utility decays until the hard deadline T2. Users cancontrol the values of decay parameters α and β. Finally, UP gives the utility thatthe users are willing to pay after the hard deadline. A negative value of UP impliesa penalty to be incurred by the service provider for failing to meet the hard deadline.If UP is zero after T2, it means that the user is no longer interested in the outcomeof the service, and thus will not pay any charges to the provider. The provider isthus free to cancel the request.

The values of the decay parameters (α and β) represent the users interest inthe outcome of the service request. A value of β = 1 gives a linear degradation inthe utility if the job is not completed within the soft deadline. Similarly a value ofβ = 0 indicates a sharp drop off in the users interest if the soft deadline is missed.Decay functions for various values of α and β are shown in Figure 3.2.

The next section describes the need for admission control algorithms for theMapReduce as a Service model and our proposed algorithm.

43

Figure 3.2 Utility Functions for different values of decay parameters.

3.2 Learning Based Admission Control

We attempt to solve the problem of admission control for Hadoop, which isa leading open source framework for MapReduce. We briefly mention the archi-tecture of Hadoop MapReduce, and then proceed to our algorithm. First, let usconsider the need for an admission control algorithm.

In our model, a service provider processes multiple requests simultaneouslyby multiplexing job execution in the cluster. Resources in the cluster are sharedproportionately among the requests, and these proportions are decided by the util-ity that the service provider is expecting to earn after successful completion of arequest. As a result, it becomes necessary to judiciously accept incoming jobs,so that incoming jobs do not affect the performance of already running jobs. Ad-mission control also helps to prevent overloading of resources in the cluster. Asthe cluster is hosted in the cloud, the resources in the cluster could be scaled on-demand using auto-scaling capabilities. However, even if an auto-scaling facilityis available, admission control can still prove viable because the rate of arrival ofnew requests could be much more than the rate of commissioning new nodes in thecluster.

44

3.2.1 Recap: Hadoop Architecture

Let’s recapitulate the Hadoop architecture in brief in order to better understandour approach. Hadoop’s MapReduce implementation borrows much of its archi-tecture from the original MapReduce system at Google [38]. Figure 3.3 depicts thearchitecture of Hadoop’s MapReduce implementation. Although the architectureis centralized, Hadoop is known to scale well for small (single node) to very large(up to 4000 nodes) installations [12].

Scheduling decisions are taken by a master node (JobTracker), whereas theworker nodes (TaskTrackers) are responsible for task execution. The JobTrackerkeeps track of the heartbeat messages received periodically from the TaskTrack-ers and uses the information contained in them while assigning tasks to the Task-Tracker. If a heartbeat is not received from a TaskTracker for a specified timeinterval, the TaskTracker is assumed to be dead. In such a case, the JobTrackerre-launches all the incomplete tasks previously assigned to the dead TaskTracker.Task assignments are sent to the TaskTracker as a response to the heartbeat mes-sage. The TaskTracker spawns each MapReduce task in a separate process, in orderto isolate itself from faults due to user code in other tasks.

For a detailed description of the Hadoop architecture please see Chapter 2.

3.2.2 The Algorithm

The administrator specifies the maximum number of Map and Reduce taskslots that control the number of simultaneously running tasks on a TaskTracker.Jobs compete for task slots in the cluster, and it is the responsibility of the schedulerto properly allocate slots so that jobs do not suffer from starvation, and they receivetheir fair share of the resources in the cluster.

The admission controller runs at the master (JobTracker) node in the MapRe-duce cluster. Although user requests for services can arrive asynchronously, thealgorithm considers them for admission only at fixed points in time. Time intervalbetween two such admission points is referred to as an admission interval. Jobrequests arrived during an admission interval are maintained in the queue of can-didate jobs. The algorithm takes this queue as input, and admits at most one jobfor execution in the cluster. All other requests are rejected and are not considered

45

Figure 3.3 Architecture of MapReduce in Hadoop.

for further processing. The users are notified if their requested services have beenaccepted or rejected. Figure 3.4 summarizes the admission control block.

To decide if and which request to accept, we use the Expected Utility Hypoth-esis from decision theory. This hypothesis states that given a set of choices withvarying payouts and the likelihood of those payouts, a rational agent always prefersthe option that maximizes the agent’s expected utility. Applying this principle tothe problem of selecting a job to be admitted, the algorithm chooses a job thatmaximizes expected utility from the perspective of the service provider. Formally,

Selected job = argmaxj(Uj × P (J = Success|E))

where, Uj is the utility of the job as calculated from the utility function agreedupon by the user and the service provider in their service contract. While makingthe comparison, we consider only the utility that will be earned if the job is com-pleted before the soft deadline specified by the user. J = Success denotes theevent that job admission is successful according to success criteria dictated by the

46

Figure 3.4 Admission Controller.

service provider. The probability P (J = Success|E) is conditional on the currentstate of the resources in the cluster, E.

The admission controller uses prior knowledge accumulated to make admis-sion control decisions for predicting the outcome of admission of candidate jobs.To achieve this, we compute the posterior probability P (J = Success|E) usingBayes Theorem:

P (J = Success|E) =P (E|J = Success)× P (J = Success)

P (E)

The above equation forms the foundation of learning in our algorithm. Thealgorithm uses results of decisions made in the past to make the current decision.This is achieved by keeping track of past decisions and of their outcomes in theform of conditional probabilities.

The denominator P (E) in the above equation is independent of candidate jobsand can be ignored safely as a constant while comparing the candidate jobs. For

47

each job in the list, we estimate the probability of future success as well as futurefailure. A job is rejected if the likelihood of a failure is more than that of a suc-cess. If all jobs are likely to fail, none of the jobs are admitted. In other words,we classify the candidate jobs into potentially successful and potentially unsuc-cessful jobs, and then select the job that provides maximum utility from the set ofpotentially successful jobs. Figure 3.4 summarizes this process.

We thus select the job that maximizes the following quantity:

Uj × P (E|J = Success)× P (J = Success)P (E)

The state of the environment E comprises of a number of factors describingthe state of cluster resources such as cluster load, number of pending tasks currentlyin the cluster, the rate at which tasks are being completed, etc. We also extend thestate of resources by including in it the properties of job request such as the sizeof request, mean run times observed in the past to complete similar requests, etc.The list of factors that we consider while making a decision is given below. All thefactors are chosen based on their speculated effect on the result of an admissioncontrol decision.

Used map slots - Ratio of number of map tasks currently running to the maxi-mum allowed number of concurrent tasks in the MapReduce cluster. Thisparameter quantifies the availability of resources. A value less than one in-dicates resource availability, and means that new requests are more likely tobe completed, whereas a value greater than one indicates more contentionamong jobs for resources.

Used reduce slots - Ratio of number of reduce tasks currently running to the max-imum allowed number of concurrent tasks in the MapReduce cluster.

Pending maps - Number of map tasks currently waiting for slots to be allocated.This parameter quantifies the pending map workload.

Pending reduces - Number of reduce tasks currently waiting for slots to be allo-cated.

48

Finishing jobs - Number of jobs that are about to finish i.e. having very fewpending tasks. If the value of this parameter is high, the newly acceptedjob is expected to have sufficient resources for its execution as currentlyrunning jobs will be finishing soon releasing resources. For our experimentswe considered jobs with more than 85% of tasks completed as finishing jobs.

Map time average - Moving average of map task runtimes. This denotes the rateat which map tasks are being completed. Its also an indication of the natureof jobs currently running in the cluster.

Reduce time average - Same as above, but for reduce tasks.

Load - Ratio of number of tasks waiting to be assigned a slot to the maximumnumber of slots.

Job maps - Number of map tasks in the candidate job. Value of this parameterdepends on the size of input data. This and the following parameters are jobspecific, and maybe different for each candidate job.

Job reduces - Number of reduce tasks in the candidate job

Mean map time - Mean map task runtime observed for this job in its past runs

Mean reduce time - Same as above, but for reduce tasks of the job

Given all these parameters, the quantity P (E|J = Success) thus becomes:

P (E|J = Success) = P (e1, e2, e3...en|J = Success)

where, e1, e2, . . . , en are the factors constituting the state of the environmentE.

We assume that the probabilities of these factors are conditionally independentof each other (the Naive Bayes assumption). Thus,

P (E|J = Success) =n∏

j=1

P (ej |J = Success)

Service providers predefine the criteria for success or failure of a job. For ex-ample, the service provider could specify that any new admission that results in

49

overloading of resources of the cluster beyond a specified threshold will be con-sidered as a failure. Success and failure rules are used to validate a decision, basedon the effects of the current decision. Validation rules cannot be applied until dataabout the impact of a decision is available. The results of these validations are sentas feedback to the admission controller. Upon receiving the feedback, the algo-rithm updates its probabilities so that mistakes made by the algorithm, if any, arenot repeated in the future.

It is possible that an admission decision can adversely affect the makespan ofalready running jobs. However, the decision will be considered invalid only if itdoes not meet the success or failure criteria set by the service provider. Serviceproviders could define success-failure criteria that consider the effect on makespanof other jobs as well.

Our algorithm is greedy, as we choose the job that seems to provide maximumutility from the immediately available choice. It is also opportunistic, as we arewilling to suffer degradation of performance of existing jobs, if the newly admittedjob can offer more utility compared to utility gained from these already executingjobs.

3.3 Evaluation and Results

To verify the efficacy of our algorithm, we simulated the Hadoop MapReducearchitecture and studied the behavior of our algorithm with the following baselineapproaches:

• Myopic - In this approach, the job with maximum initial utility is acceptedwithout other considerations

• Random - A job is admitted randomly from a given set of candidate jobs.The given set of jobs is appended with a null value to simulate job rejection.

3.3.1 Simulation Settings

In our simulation model, the properties of a job are distributions specifyingruntimes of map and reduce tasks of a job. To model the distribution of runtimes,

50

we extracted and observed real world MapReduce job traces of MapReduce jobsrun on actual Hadoop clusters. We observed that Map runtimes of a particularjob follow the Normal distribution with the mean and standard deviation being thecharacteristic of the job. Similarly for reduce tasks the runtimes for Sort, Shuffleand Reduce phases also followed the Normal distribution.

Based on these observations, a map task modeled in our simulation occupies aslot for a random amount of time which is chosen from a Normal distribution whichis the characteristic of the job. Similarly each of the three phases in a reduce taskmodeled in our simulation also occupies a slot in accordance with Normal distri-butions which are again properties of the job. Our simulation does not model taskfailures as the utility is earned only after successful completion of a job request.Thus it is the responsibility of the service provider to make sure that all acceptedjobs are executed successfully, irrespective of individual task failures. We onlyuse the information that can be obtained through the JobTracker in Hadoop as theJobTracker provides a unified view of the MapReduce cluster. All the parametersmentioned in Section 3.2.2 could directly be obtained from the JobTracker itself.Figure 3.5 lists the simulation parameters and distributions used in generating sim-ulation events.

For comparing the results across different runs, we keep the pseudo randomdistribution parameters constant between runs. All values reported in the resultsare averaged over 10 independent runs, unless otherwise specified.

3.3.2 Algorithm Correctness

To verify whether the admission controller is able to accept/reject jobs in orderto maintain overload threshold as specified by the service provider, we measuredthe actual load average observed in a simulation run, and compare it against thedesired load average as set by the service provider. The plot below summarizesresults of these experiments.

As we can see in the plot (Figure 3.6), the achieved load average value is fairlyclose, to the desired load average value. Further, the error rate is independent ofthe desired load average value. The errors may arise as a result of Naive Bayesassumption made while computing posterior probabilities.

51

Parameter DecriptionJob arrival distribution ExponentialJob arrival rate (λ) 5 minutesJobTracker heartbeat interval 3 secondsAdmission interval 3 minutesJobTracker map slots 50JobTracker reduce slots 20Job map size Uniform Random (51, 100)Job reduce size ¡Job map size¿/10Simulation time 500 minutesDecay parameters (α and β) α = 1, β = 1Soft deadline (T1) Time taken when all map tasks are executed

in parallel + Time taken when all reduce tasksare executed in parallel

Hard deadline (T2) Time taken when only one task of the job isexecuted at a time

Figure 3.5 Simulation Parameters

3.3.3 Comparison with Baseline Approaches

Next, we compare the performance of the learning our admission control al-gorithm with two baselines, as specified in the beginning of this section.

First, we compare the mean load averages observed in our algorithm to Myopicadmission, and Random admission. For this set of experiments, we kept the over-load threshold to 100%. In other words, our admission controller rejected all thosejobs which were predicted to cause the cluster load over 1.0. Figure 3.7 shows theresults.

As can be clearly seen in Figure 3.7, our admission control algorithm is veryeffective in preventing overload. This establishes the correctness of our algorithm,and proves our argument of the necessity of sophisticated admission control algo-rithms for MapReduce.

3.3.4 Meeting Deadlines

The next experiments in our evaluation verify the ability of our algorithm inmeeting user deadline guarantees. For this set of experiments, the values of de-

52

Figure 3.6 Achieved and expected load ratio

Algorithm Achieved Load AverageRandom 42.11Myopic 42.09Our algorithm 0.97

Figure 3.7 Comparison of Achieved Load Averages

cay parameters α and β were both set to 1, thereby making the decay rate linear.The soft deadline (T1) in our case is the runtime of the job, if all tasks of the jobare executed simultaneously. The hard deadline (T2) is double the value of the softdeadline. To compare the algorithm with baseline approaches, we calculate the per-centage of jobs that complete before the soft deadline, and the percentage of jobsthat complete after the soft deadlines. We can see in Figure 3.8 that our algorithmis able to meet user QoS requirements in most of the cases, whereas the baselineapproaches cause job runtimes to exceed soft deadlines in most of the cases.

53

Figure 3.8 Performance while meeting user deadlines

Figure 3.9 Achieved Load Average with load cap

3.3.5 Performance with Load Cap

Next, we see if how the algorithms fare with an additional load cap enforced atthe time of job admission. In this setting, a job is admitted only if the current loadof the cluster is below a certain threshold. The load cap is enforced in each of thealgorithms, and their performance is presented in Figure 3.9. With an additionalload cap, we are adding a reactive constraint, since the current load on the clusteris a result of previous job submissions. However, a load cap ensures that a job isnot added in case of overload even in the naive approaches.

54

As we can see in Figure 3.9, the MEU algorithm fares better even with aload cap. Compared to results in Section 3.3.3, the other two approaches achievesignificantly less load averages. However, the performance of the MEU algorithmis almost the same despite the additional load cap, proving the efficacy of a learningapproach. With the load cap, the other two approaches overload the cluster bymore than 15 %, whereas the load average of MEU (97%) is very closed to desiredutilization.

3.3.6 Job Response Times

Figure 3.10 Comparing mean job runtimes

Next, we evaluate the runtimes achieved by our admission control algorithmwith the baseline approaches. For this set of experiments, we kept the utility ofthe jobs linearly proportional to job sizes. Figure 3.10 shows the mean runtimeachieved by the MEU, RAND and MYOPIC algorithms respectively. As we cansee from the figure, our algorithm achieves less runtime compared to the other twoapproaches. This is very beneficial to end users, as their main motivation is tofinish their job as soon as possible. MEU algorithm achieves an improvement of8.7 % over random admission, and 7.8 % over the myopic algorithm. It should be

55

Figure 3.11 Comparing runtime distribution

noted that the reduction in response time is being done without having to overloadresources in the cluster.

Another way to compare overall runtimes achieved by all three algorithms isby studying the distribution of job runtimes. Figure 3.11 shows the histograms ofMEU, RAND and MYOPIC. As we can see in the distributions, the number ofjobs with less runtimes (runtime ≤ 25 minutes) in MEU is much more comparedto the other two approaches. Another interesting observation is that in case ofMEU very few (≤ 1%) jobs end up running for longer durations. This shows thatour algorithm consistently achieves better response times. As discussed earlier,achieving better response times is crucial for user satisfaction.

56

3.3.7 Job Arrival Rates

Figure 3.12 Effect of Job Arrival Rate (λ) on Job Acceptance

It is important to study the behavior of algorithm under different levels ofdemand of access to the cluster. For this purpose, we vary the job arrival rate,i.e. the λ parameter of the exponential distribution. We increase value of λ from0.5 to 15 in steps of 0.5. A small value of λ results into a higher job submissionrate, resulting in many candidate job submissions. The main motivation behind thisexperiment is to see if the algorithm can cope up with heavy request traffic, and ifit accepts more jobs in case of low demand to meet desired utilization.

3.3.8 Effect of Utility Functions

To study this we plot the % of jobs accepted for each value of λ in Figure 3.12.The results are mean values calculated over 10 independent runs of the simulation.As we can see in the figure, the job acceptance rate increase steadily with increas-ing inter job arrival time. This shows that as lesser number of submissions, thealgorithm is willing to accept more jobs to maintain the specified level of utiliza-tion. This is important from the perspective of service providers, as they want toserve multiple simultaneous requests in order to achieve better revenues from theservice.

57

Figure 3.13 Effect of Utility Function on Job Acceptance

In the final experiment, we study the effect and need of utility functions. Asmentioned earlier in Section 3.1.3, utility functions are instruments for the usersand service providers to control job admission. Proper choice of a utility functioncan significantly impact job admission decision. To study the effect, we comparethe behavior of the algorithm in the following two cases:

1. Linear proportion: In this case, the job utility is linearly proportional to jobsize. Formally, U(J) = s, where s is the size of the job. Size of the job isthe total number of map and reduce tasks in the job.

2. Exponential proportion: In this case the job utility is exponentially propor-tional to job size. Specifically, U(J) = as, where a is some integer constant.During our experiment, we kept a = 2.

The resulting distribution of job sizes accepted by the algorithm are shown inFigure 3.13. We compare job size distribution, because in the absence of a utilityfunction, our algorithm always tends to accept smaller jobs, as they are the onesleast likely to overload a cluster. Hence, observing job sizes can confirm whetherthe choice of the algorithm is being affected by the utility function or not.

As we can see in the figure, with linearly proportional utility, the size distri-bution is slightly skewed towards smaller job sizes. However, with exponential jobsizes, the distribution is almost even, with larger jobs also getting accepted. Notethat, from the perspective of a service provider, smaller jobs are always preferable

58

as they finish quickly, and there is more likelihood of meeting QoS requirementsof smaller jobs, as in case of larger jobs other uncertain factors such as componentfailures, fluctuations in resource availability also play a more important role. Thus,the algorithm behaves according to the needs of the service provider. However,as is shown in the figure, this behavior can be adjusted by proper choice of utilityfunctions.

3.4 Summary

We presented a learning based admission control algorithm, specifically tar-geted for MapReduce clusters. Although the concept we exploited has its roots indecision theory, and the idea can be applied to a more generic case of web serviceadmission control as well. Our results validated using online learning in makingsingle choice decisions. Furthermore, the algorithm we proposed fared better onusers’ as well as system administrators’ expectations on a number of criteria whencompared to baseline approaches. We use a similar approach, although with somevariations in tackling task assignment in MapReduce as well, which is the focus ofthe next chapter.

59

Chapter 4

Task Assignment in MapReduce

This chapter presents the learning scheduler for Hadoop, developed as a partof this thesis. The scheduler source code is available for download from https:

//code.google.com/p/learnsched/. The scheduler uses the pluggablescheduling API introduced in Hadoop versions 0.19 and later. We begin by ex-plaining the scheduling algorithm used and then move on to describing the imple-mentation details of the scheduler. Finally, we present scheduler evaluations undera number of test cases that demonstrate the benefits of learning based approach.

4.1 LSCHED: Learning Scheduler for Hadoop

Having seen the scheduling mechanism in Hadoop in Chapter 2, we explainour task assignment algorithm in this section. Our algorithm runs at the JobTracker.Whenever a heartbeat from a TaskTracker is received at the JobTracker, the sched-uler chooses a task from the MapReduce job that is expected to provide maximumutility after successful completion of the task. Figure 4.1 depicts the task assign-ment process.

First, we build a list of candidate jobs. For each job in the queue of the sched-uler, one candidate instance for Map part and one (or zero, if the job does not havea reduce part) for the Reduce part is added in the list. This is done because theresource requirements of Map and Reduce tasks are usually different.

60

https://code.google.com/p/learnsched/

https://code.google.com/p/learnsched/

Figure 4.1 Task assignment using pattern classification. Evaluation of last deci-sion, and classification for current decision are done asynchronously.

We then classify the candidate jobs into two classes, good and bad, using apattern classifier. Tasks of good jobs do not overload resources at the TaskTrackerduring their execution. Jobs labeled bad are not considered for task assignment. Ifthe classifier labels all the jobs as bad, no task is assigned to the TaskTracker.

If after classification, there are multiple jobs belonging to the good class, thenwe choose the task of a job that maximizes the following quantity:

E.U.(J) = U(J)P (τJ = good|F1, F2, . . . , Fn) (4.1)

where, E.U.(J) is the expected utility, and U(J) is the value of utility functionassociated with the MapReduce job J . τJ denotes a task of job J , and P (τJ =

61

good|F1, F2, . . . , Fn) denotes the probability that the task τJ is good. The prob-ability is conditional upon the feature variables F1, F2, . . . , Fn. Feature variablesare described in more detail later in this section.

Once a job is selected, we first try to schedule a task of the job whose inputdata are locally available on the TaskTracker. Otherwise, we chose a non data localtask. This policy is the same as used by the default Hadoop scheduler.

We assume that the cluster is dedicated for MapReduce processing, and thatthe JobTracker is aware and responsible for every task execution in the cluster.Our scheduling algorithm is local as we consider state of only the concerned Task-Tracker while making an assignment decision. The decision does not depend onstate of resources of other TaskTrackers.

We track the task assignment decisions. Once a task is assigned, we observethe effect of the task from information contained in subsequent heartbeat from thesame TaskTracker. If based on this information, the TaskTracker is overloaded, weconclude that last task assignment was incorrect. The pattern classifier is then up-dated (trained) to avoid such assignments in the future. If however, the TaskTrackeris not overloaded, then the task assignment decision is considered to be successful.

Users configure overload rules based on their requirements. For example, ifmost of the jobs submitted are known to be CPU intensive, then CPU utilizationor load average could be used in deciding node overload. For jobs with heavynetwork activity, network usage can also be included in the overload rule. In acloud computing environment, only those resources whose usage is billed couldbe considered in the overload rule. For example, where conserving bandwidth isimportant, an overload rule could declare a task allocation as incorrect if it resultsin more network usage than the limit set by the user.

The overload rules supervise the classifiers. But, as this process is automated,the learning in our algorithm is automatically supervised. The only requirementfor an overload rule is that it can correctly identify given state of a node as beingoverloaded or underloaded. It is important that the overload rule remains the sameduring the execution of the system. Also, the rule should be consistent for theclassifiers to converge.

62

4.1.1 Feature Variables

During classification, the pattern classifier takes into account a number of fea-tures variables, which might affect the classification decision. The features we useare described below:

Job Features: These features describe the resource usage patterns of a job.These features could be calculated by analyzing past execution traces of the job.We assume that there exists a system which can provide this information. In ab-sence of such a system, the users can utilize these features to submit ’hints’ aboutjob performance to the classifier. Once enough data about job performance is avail-able, user hints could be mapped to resource usage information. The job featureswe consider are: job mean CPU usage, job mean network usage, mean disk I/Orate, and mean memory usage. The users estimate the usages on the scale of 10. Avalue of 1 for a resource means minimum usage, whereas 10 corresponds to maxi-mum usage. For a given MapReduce job, the resource usage variables of the Mappart and the Reduce part are considered different.

Node Features (NF): Node features denote the state and quality of computa-tional resources of a node. Node Static Features change very rarely, or remainconstant throughout the execution of the system. These include number of pro-cessors, processor speed, total physical memory, total swap memory, number ofdisks, name and version of the Operating System at the TaskTracker, etc. NodeDynamic Features include properties that vary frequently with time. Examples ofsuch properties are CPU load averages, % CPU usage, I/O read/write rate, Networktransmit/receive rates, number of processes running at the TaskTracker, amount offree memory, amount of free swap memory, disk space left etc. Processor speedcould be be a dynamic feature on nodes where CPUs support dynamic frequencyand voltage scaling.

4.1.2 Utility Functions

Utility functions are used for prioritizing jobs and policy enforcement. Animportant role of the utility functions is to make sure that the scheduler does notalways pick up ‘easy’ tasks. If the utility of all the jobs is same, the scheduler willalways pick up tasks that are more likely to be labeled good, which are usually the

63

tasks that demand lesser resources. Thus, by appropriately adjusting job utility itcould be made sure that every job gets a chance to be selected.

It is possible that a certain job is always classified as bad regardless of thevalues of feature vectors. This could happen if the resource requirements of the jobare exceptionally high. However, this also indicates that the available resources areclearly inadequate to complete such a job without overloading.

Utility functions could also be used in enforcing different scheduling policies.Examples of some such policies are given below. One or more utility functionscould be combined in order to enforce hybrid scheduling policies.

1. Map before Reduce: In MapReduce, it is necessary that all Map tasks of ajob are finished before Reduce operation begins. This can be implementedby keeping the utility of Reduce tasks zero until a sufficient number of Maptasks have completed.

2. First Come, First Serve (FCFS or FIFO): FCFS policy can be imple-mented by keeping the utility of the job proportional to the age of the job.Age of a job is zero at submission time.

3. Budget Constrained: In this policy, tasks of a job are allocated until theuser of a job has sufficient balance in his/her account. As soon as the balancereaches zero, the utility of jobs of the said user becomes zero, thus no furthertasks of jobs from the said user will be assigned to worker nodes.

4. Dedicated Capacity: In this policy a job is allowed a guaranteed accessto a fraction of the total resources in the cluster. Here, the utility couldbe inversely proportional to the deficit in the currently allocated fraction,and the promised fraction. Utility of jobs allocated more than the promisedfraction is set to zero to make sure that they are not considered during taskassignment.

5. Revenue oriented utility: In this policy, utility of a job is directly propor-tional to the amount the job’s submitter is willing to pay for successful com-pletion of the job. This makes sure that the algorithm always picks tasks ofusers who are offering more money for the service.

64

4.1.3 Avoiding resource starvation

As described in the previous section, utility functions are also used to avoidresource starvation amongst jobs. Resource starvation is possible if a particular jobneeds to do nontrivial amount of work. Jobs that are heavily CPU bound, mightoverload a worker node even if at the time of allocation the concerned node has100% free resources. As a result, tasks of such jobs are always labeled bad and arenever allocated.

To prevent this, we maintain an assignment count for each job. After everyallocation, the assignment count of the said job is increased. While choosing a job,we choose the job that has the least amount of assignments. For this the utility ofthe job is calculated as follows:

U(J) = aK−J.priority−J.assignments

where, a is a positive integer s.t. a > 1 and K is a large constant. We have kepta = 2 and K = 64. As the highest job priority in Hadoop corresponds to thelowest integer value, this makes sure that jobs with higher priorities always havehigher utility values.

The assignment counts of jobs are reset periodically. For this we calculate themaximum assignment value from the queued jobs. This value is deducted fromall the jobs so that jobs that did not get any assignments in this cycle have themaximum priority in the next cycle. If a job is consistently not being allocated, itsassignment value will continue to decrease. If this value reaches a certain lowerlimit set by the administrator, tasks of that job are forcefully allocated by labelingthem as good. This makes sure that such jobs also get a chance to execute theirtasks in the cluster. After such an allocation, the assignment count of the job isagain reset to zero, or it could be incremented with a value proportional to the jobspriority, so that other jobs that do not overload resources get preference in the nextcycle.

4.1.4 Benefit of doubt

If in certain cases the likelihood of a job being classified as good or bad isalmost equal, we label the task as good. For this the log likelihood of the poste-

65

rior probabilities are calculated. If the bad probability is not larger than the goodprobability by an order of magnitude, it is labeled as good, otherwise it is labeledas bad.

Next, we explain how the same algorithm can be implemented by using twodifferent pattern classifiers. In this chapter we consider only the Naive Bayes Clas-sifier, and the Perceptron classifier [43]. Theoretically, any linear classifier couldbe used for classifying jobs. However, we discuss these two based on their easeof implementation, and the ability of learning from one sample at a time (onlinelearning). Online learning helps in keeping memory used by the classifiers con-stant w.r.t the number of feature vectors. This is essential in our case; efficiency isan important goal for a scheduler implementation.

4.1.5 Using a Naive Bayes Classifier

If we apply Bayes theorem to equation 4.1 mentioned in the beginning of thissection, we get,

E.U.(J) =

U(J)P (F1, F2, . . . , Fn|τJ = good)P (τJ = good)

P (F1, F2, . . . , Fn)

The denominator in the above equation can be treated as a constant as its value isindependent of the jobs, and thus its calculation can be skipped during comparison.

We calculate both P (τJ = good|F1, F2, . . . , Fn) and P (τJ = bad|F1, F2, . . . , Fn).Job is labeled as good or bad depending on which of the two probabilities is higher.Under the assumption of Naive Bayes conditional independence we get,

P (F1, F2, . . . , Fn|τJ = good) =n∏

i=1

P (Fi|τJ = good)

Thus, we compute the following quantity for all the jobs and select the job thatmaximizes it.

E.U∗(J) = U(J)P (τJ = good)n∏

i=1

P (Fi|τJ = good)

Once the effects of a task assignments are observed, the probabilities are up-dated accordingly so that future decisions could benefit from the lessons learnedfrom the effects of current decisions.

66

Here we assume that the probabilities of all feature variables are conditionallyindependent of each other. This may not always be true. However, we observed thatthis assumption can yield a much simpler implementation. Despite the assumption,Naive Bayes classifiers are known to perform well. Our results show that the as-sumption does not have any drastic undesired effects on the overall performance ofthe scheduler.

4.1.6 Separability of feature vector space and classifier convergence

Naive Bayes classifiers assume that all feature variables are conditionally in-dependent of each other, and their probabilities could be calculated independently.This assumption is almost always incorrect in practice. However, Naive Bayesclassifiers have been known to outperform other popular classifiers including deci-sion trees and multilayer neural networks. Zhang [81] has discussed in detail aboutthe unexpected efficiency of Naive Bayes classifiers.

All the feature variables used in our classifier indicate either availability orusage of computational resources at a given node. Clearly, more the availability ofa resource, more is the likelihood of a task being completed without overloadingthe resource. For features which correspond to usage of resources, such as the jobfeatures, the opposite is true. i.e., more the resource usage, more is the likelihoodof task of that job overloading the node. Thus, we can say that for a given job, forevery feature variable there exists a separating value on one side of which task ofthe job is likely to overload the node, and vice versa. The vector corresponding toall such separating values gives the hyperplane which separates the feature vectorsinto two classes, good, and bad.

4.2 Evaluation and Results

We now briefly discuss the implementation, and then explain the evaluationmethodology and results of our experiments in this section.

67

4.2.1 Implementation Details

We have implemented our algorithm for Hadoop version 0.20.0. Our scheduleruses the pluggable scheduling API introduced in Hadoop 0.19. The scheduler cus-tomizes assignTasksmethod of the org.apahce.hadoop.mapred.TaskSchedulerclass.

We used only the Naive Bayes classifier in our implementation. Naive Bayesclassifier is better in online learning (learning from one sample at a time) and han-dling categorical feature variables compared to Perceptron classifier. We used asimple histogram for counting probabilities of discrete features.

Node Features are obtained from the heartbeat message. We extended theheartbeat protocol used in Hadoop to include node resources properties. Job Fea-tures are passed via a configuration parameter (learnsched.jobstat.mapand learnsched.jobstat.reduce) while launching a job. In absence ofthese parameters, mode of the values for each resource is considered as the respec-tive job feature.

At any point of time, we maintain at the most k decisions made by the classifierfor each TaskTracker, where k is the number of tasks assigned in one heartbeat.During the evaluation we kept k = 1. Once the decisions are evaluated by theoverload rule, we persist them to disk so that they can be used in re-learning, orwhen the desired utilization level is changed by the user. A decision made forthe current heartbeat is evaluated in the next heartbeat. This allows us to controlthe memory used by decisions. We disregard the accpetNewTasks flag in theheartbeat message, and consider a node for task assignment in every heartbeat.

We allow users to implement their own utility functions by extending our API.Utility functions in the scheduler are pluggable and can be changed at runtime. Wehave implemented a constant utility function, and FIFO utility function. Users canalso write their own overload rules by implementing the DecisionEvaluatorinterface.

68

4.2.2 Evaluation

4.2.2.1 Cluster Details

We used a cluster of eight nodes to evaluate our algorithm. One of the nodeswas designated as the master node which ran HDFS and MapReduce masters (Na-meNode and JobTracker). The remaining seven nodes were worker nodes. All ofthe nodes had 4 CPUs (Intel Quad Core, 2.4 GHz), a single hard disk of capac-ity 250 GB, and 4 GB of RAM. The nodes were interconnected by an unmanagedgibabit Ethernet switch. All of the nodes had Ubuntu Linux (9.04, server edi-tion) and SUN Java 1.6.0 13. We used Hadoop version 0.20.0 for this evaluation.The important Hadoop parameters and their values used in the experiments are de-scribed in Figure 4.2. For rest of the parameters, we used Hadoop’s default values.

Hadoop Parameter ValueReplication 3HDFS Block size 64 MBSpeculative Execution EnabledHeartbeat interval 5 seconds

Figure 4.2 Hadoop settings used in evaluation

We used one minute CPU load averages to decide overloading of resources.Load averages summarize both CPU and IO activity on a node. We calculatedthe ratio of reported load average with the number of available processors in anode. A value of 1 for this ratio indicates 100% utilization on a node. A node wasconsidered to be overloaded if the ratio crossed a user specified limit.

4.2.2.2 Workload Description

We evaluated our scheduler using jobs that simulate real life workloads. Inaddition to the WordCount and Grep jobs used by Zaharia et. al. [79], we also sim-ulate jobs to represent typical usage scenarios of Hadoop. We collected Hadoopusage information from the Hadoop PoweredBy [10] page. This page lists casestudies of over 75 organizations. We categorized the usages into seven main cate-gories, text indexing, log processing, web crawling, data mining, machine learning,

69

reporting, data storage and image processing. Figure 4.3 summarizes the frequencyof these use cases. The percentages represented in the figure are approximate.From this information, we conclude that Hadoop is being used in a wide range ofscenarios, naturally creating diversity in the resource requirements of MapReducejobs.

Figure 4.3 Prominent Use Cases for Hadoop. (percentages are approximate)

We came up with the following set of jobs to evaluate our scheduler. Wedescribe their functioning below:

• TextWriter: Writes randomly generated text to HDFS. Text is generatedfrom a large collection of English words.

• WordCount: Counts word frequencies from textual data.

• WordCount with 10 ms delay: Exactly same as WordCount, except that weadd an additional sleep of 10 ms before processing every key-value pair.

• URLGet: This job mimics behavior of the web page fetching component ofa web crawler. It downloads a text file from a local server. The local serverdelays response for a random amount (normal distribution, µ = 1.5s, σ =0.5s) of time to simulate internet latency. The text files we generated hadsizes according to normal distribution with mean of 300 KB, and variance of50 KB [60].

70

• URLToDisk: Downloads large video files(200MB) from a local web serverand saves them to disk.

• CPUActivity: Carries out a computationally expensive numerical calcula-tion for every key-value pair in the input.

Job CPU Memory Disk NetworkTextWriter 3 5 5 3WordCount 5 6 8 5WordCount+10ms delay 1 6 6 4URLGet 3 4 6 7URLToDisk 3 5 7 8CpuActivity 10 5 5 3

Figure 4.4 Resource usage of evaluation jobs as estimated on the scale of 10, avalue of 1 indicates minimum usage

We estimated the resource usages of the each job profile thus created on a scaleof 10 based on empirical observations for each computational resource. These val-ues are shown in Figure 4.4. A value of 10 for a resource means maximum resourceusage. These estimates were passed to the scheduler as Job Features as discussedin Section 4.1.1. It should be noted that the approximate usage estimates could bereplaced by actual estimates if such information is available. Our algorithm willwork in both cases, provided the estimates are mapped properly to actual resourceusages.

During each run of a job the input was created afresh, and deleted after the jobwas completed using the RandomTextWriter job in Hadoop. The job gener-ated 10 GB of text from words randomly chosen from a large collection of Englishwords.

4.3 Results

4.3.1 Demonstrating Learning Behavior

We first demonstrate the learning behavior of the scheduler. In this experimentwe ran the WordCount MapReduce job provided in Hadoop. The job was run

71

several times on randomly generated text of size 70 GB. The input was regeneratedbefore each run of the job. Figure 4.5 shows the average load on one of the workernodes during this period.

0 5 10 15 20 25 30Time in hours

0

20

40

60

80

100

120

Work

er

Uti

lizati

on

Utilization for WordCount

Figure 4.5 Learning behavior of the scheduler for WordCount job

The scheduler was asked to maintain utilization at 100%. Initially the utiliza-tion was lower than expected during the ‘learning phase’ of the scheduler. Af-ter that however, the nodes were rarely overloaded, and achieved utilization veryclose(approx. 96%) to the desired value(100%). Another interesting observation isthat the intensity of ‘peaks’ in the load reduces with time, thus confirming that thescheduler is indeed learning.

Figure 4.6 shows the reduction in the runtime of WordCount during this pe-riod. We report only the first six runs, since after that the runtime converged toaround 36 minutes. The scheduler converges in the first few runs of the job, withthe greatest reduction between the first two runs. The large runtime of the first runis due to the underutilization during the learning phase of the scheduler. We wouldlike to point out that for larger jobs, i.e. for jobs with more number of Map tasks,the scheduler would converge even quickly.

72

1 2 3 4 5 6Successive runs of wordcount

36

38

40

42

44

46

48

Runti

me o

f w

ord

count

in m

inute

s

Learning behaviour for WordCount

Figure 4.6 Learning behavior of the scheduler for WordCount job

4.3.2 Maintaining Desired Utilization

Next, we evaluated whether the scheduler is able to achieve user specifiedutilization. For this experiment we used TextWriter, WordCount and Grep jobs. Aconstant utility function was used to make sure that all the jobs had equal priority.During each run, we first ran the TextWriter job to create input before runningWordCount and Grep. We changed the desired utilization ratio (ratio of 1 meansmaintaining 100% utilization, which in turn means maintaining load average of 4on a quad core machine) from 1 to 4 in steps of 0.5.

Figure 4.7 shows the observed behavior of the scheduler. The figure showsmean load averages and variations in a single run. The values reported are averagedover ten experiments. As is shown in the figure, the achieved load average is quiteclose to the desired load average (4 times x). The relatively large variation in thevalues can be attributed to the initial underutilization during the learning phase ofthe scheduler. For higher values of desired utilization ratio, the gap between desiredand achieved load averages increases. This is because it is difficult to maintain

73

1.0 1.5 2.0 2.5 3.0 3.5 4.0Desired utilization ratio

0

2

4

6

8

10

12

14

16

Load A

vera

ge

Maintaining user specified level of utilization

MeanStd

Figure 4.7 Achieved utilization for different user requirements

work at higher utilization, as we can assign a task only every 5 seconds (heartbeatinterval).

We have deliberately included the learning phase of the scheduler in our eval-uations, for comparing the scheduler fairly with Hadoop’s default scheduler, as thedefault scheduler does not involve any learning phase.

4.3.3 Comparing Learning Rates

Next, we compare the learning rates of two jobs, URLGet and CPUActivity.URLGet is a network intensive job, whereas CPUActivity is CPU intensive. Wereport accuracy of the decisions made by the classifier in Figure 4.8. We consider adecision to be accurate if it is validated by the overload rule. For example, a deci-sion to allocate a task is accurate if the overload rule determines that the allocationdid not cause any overload. We report percentage accuracy per 250 decisions madeby the classifier. The accuracy for both the jobs increases, which is expected asthe scheduler is learning about the impact of both jobs on utilization. However, therate of increase in URLGet is much smaller compared to CPUActivity. This could

74

200 400 600 800 1000 1200 1400Number of decisions

10

20

30

40

50

60

70

80

90

Acc

ura

cy %

Classifier Accuracy Trend per 250 Decisions

CPUActivityURLGet

Figure 4.8 Classifier accuracy for URLGet and CPUActivity

be because of the delayed response by local servers in case of URLGet. As therequest from local server is delayed, URLGet tends to block for network I/O forrather unpredictable duration. CPUActivity on the other hand tends to be more pre-dictable, and hence classifier accuracy for CPUActivity improves faster. Anotherpoint to note is that achieving 100% accuracy is not that important as long as thescheduler is able to maintain utilization level specified by the user. Also, achieving100% accuracy could be difficult, because of uncertainty involved in factors suchas utilization contributed by DataNode processes, network traffic, age of resourceinformation [62], and finally errors in estimating resource requirements of jobs.

4.3.4 Comparison with Hadoop Native Scheduler

We now present the last set of our experiments in which we compared ourscheduler against Hadoop’s native scheduler described in Chapter 2. For theseexperiments, we used the workload as described in Section 4.2.2.2. Again, weused a constant utility function. We did not use any policy scheme (as described inSection 4.1.2), as our goal was to demonstrate that our scheduler does better task

75

assignment than Hadoop’s native scheduler. We do not compare our scheduler withother Hadoop schedulers because the number of tasks assigned by them would bethe same as that by the default scheduler because of reasons discussed in Chapter2. Each of the jobs was run in isolation. Input was regenerated before every newrun of a job.

We set the maximum number of concurrent tasks to 5 (4CPUs + 1 disk). Thiswas done to make sure that each task always had access to a disk or a CPU. Thissetting does not apply to the learning scheduler. Note that this is larger than thedefault Hadoop setting, where only two tasks are allowed to run concurrently on amachine.

Job LearningScheduler

Hadoopnative

Runtime com-pared to Hadoop

TextWriter 2.03 5 2.5x

WordCount 2.31 5 2x

WordCount+ 10msdelay

10.52 5 0.4x

URLGet 8.35 5 0.6x

URLToDisk 5.09 5 1x

CPUActivity 3.17 5 1.5x

Figure 4.9 Comparison of task assignment by Learning Scheduler and Hadoop’snative scheduler

Figure 4.9 shows the mean number of tasks assigned during each run by ourscheduler and Hadoop’s default scheduler. The values are averages calculated over10 experiment runs. Our scheduler allocated tasks in order to maintain utilizationof 100%. Job runtime is proportional to the number of tasks assigned by learn-ing scheduler and Hadoop default, as we made sure that each of the task wouldtake 60 seconds to complete. As can be seen from the table, Hadoop’s schedulerachieved better runtime in TextWriter, WordCount and CPUAcitivity jobs. How-ever, it should be noted that in all these cases, Hadoop’s task assignment policy re-sulted in overload at the worker machines due to excessive task assignment. Sinceour scheduler assigns tasks by considering utilization, shorter runtime could be eas-ily achieved if the utilization target given to the scheduler is increased. For Word-

76

Count with 10 ms delay, URLGet and URLToDisk the learning scheduler achievedsignificantly shorter runtime than the comparison. This is because, Hadoop allo-cated only fixed number of tasks, whereas our scheduler could allocate more tasksas each task was less demanding. The biggest improvement is seen in WordCountwith 10 ms delay. For these jobs, in the case of default Hadoop scheduler, thecluster was underutilized.

4.4 Summary

In this chapter we presented the learning scheduler for Hadoop. From ourexperiments it is clear that a learning based approach combined with additionalheuristic for preventing job starvation achieved higher system utilization comparedto the native Hadoop scheduler. The approach we used is based on the vary basicapproach used by Bayesian Decision Networks and Stochastic Learning Automatadescribed in Chapter 2. However the framework we proposed is generic enough tobe extended to other problems involved in Hadoop’s resource management such asspeculative execution and data local execution.

77

Chapter 5

Conclusions and Future Work

Identifying the need for autonomic resource aware scheduling in MapReduce,we tackled two important problems: Task assignment and Admission Control. Inboth of the approaches, we used Bayesian learning to continuously optimize theirdecision making process. The crux of the algorithm used in solving both the prob-lems was a closed control loop, created using a feedback informed Naive Bayesclassifier.

This chapter concludes this thesis by first summarizing our results, establish-ing the key lessons learned and by touching upon some future directions of re-search.

5.1 Similarity in the two approaches

Although applied to two different problems in resource management, the cruxof our systems is the same. We used a feedback informed classifier that contin-uously learns the effect of its decision. An external rule based system was usedto verify the decisions made by the classifier in retrospect, after the effects of thedecisions were observed. In both the problems, the classifier score was coupledwith a priority function to achieve fairness, and to grant users control over the be-havior of the system. This system was used to solve both admission control andtask assignment. In fact, we used the exact same classifier implementation in solv-ing both problems. The only difference was that of the classifier feature variables,and the evaluation rules. The commonality in both the approach also explains the

78

similarity between the results, and experiments carried out to test the effectivenessof our systems.

A basic assumption in our approach is that the effect of a decision is observablebefore the next decision is taken. Although a very safe assumption when appliedto MapReduce, this assumption may not hold true for a generic grid computingservice, where users generally submit a batch of requests at once, and the systemmust decide whether or not to accept and schedule the entire batch. We would liketo point out that this problem is essentially that of solving a combinatorial auction,something which is known to be difficult problem. Considering this, limiting asystem to accepting one job at a time does not seem to be a too stringent restriction.

Besides the obvious contribution in the form of the algorithm itself, a very im-portant lesson learned during the work of this thesis is that data driven approachescan be effectively used in distributed resource management. Learning based ap-proaches have rarely being tried for resource management in utility/grid/cloudcomputing, and we hope that this more researchers will be encouraged to try outother learning based approaches as well.

5.2 Admission Control

As we have established in the first chapter, Admission Control is a very impor-tant problem for real world MapReduce clusters. An effective admission controlalgorithm should make the system more robust by avoiding overload, and try toreduce response time for users. This is exactly what our algorithm achieved.

Extensive simulation has showed that our algorithm maintains expected levelof load with very high accuracy. This is true not only for the ideal utilization of100%, where the achieved average utilization is 97%, but also when the system isasked to maintain more than 100% utilization. However, this does not adverselyaffect response time. On the contrary, the algorithm achieves an improvement of7-8% over the baseline approaches in mean job runtimes. The overall distributionof runtimes also proves that the algorithm uniformally reduces job runtimes acrossall job sizes. The increase in the job acceptance rate with reducing frequency of jobsubmission rate shows that the algorithm tries to maintain utilization by acceptingmore number of jobs in periods of less user activity.

79

One observed downside of the approach is that it usually favors smaller jobs.However, by properly adjusting the priority function this problem was proven to besolvable. Moreover, it should be noted that accepting smaller jobs is usually in thebenefit of the service provider, as these jobs yield revenue sooner, and it is easierto guarantee availability of system resources over a shorter period of time.

We also presented a new paradigm of offering and utilizing MapReduce jobsin the cloud. The model we proposed is advantageous to users as well as potentialservice providers, willing to offer data intensive application as ready to use MapRe-duce services. We presented a concrete mechanism for expressing user demands inthe form of utility functions that capture users’ perceived value of the service as afunction of time.

5.2.1 Future directions

We now list some of the future areas of research in admission control usingour algorithm.

Admission control for NameNode of HDFS

The approach can be as it is applied to NameNode admission control. Na-meNode, i.e. the master node in the HDFS, is susceptible to crashes because ofuncontrolled access. As NameNode is the single point of failure in both MapRe-duce, and HDFS, solving admission control for NameNode is very important. Webelieve that the same algorithm could be used to solve this problem, by choosing aproper set of feature variables.

Application to resource brokering in the cloud

Resource brokering, as explained in Chapter 2, is the problem of pooling re-sources from multiple sites, and choosing to schedule a user’s applications on oneof these sites. The same learning bases approach could be used to solve this prob-lem. The resource broker could choose a site based on its past record in meetingSLAs given its current load, and the price the site is charging. The past recordtranslates to the probability tables of the scheduler, and the price charged by the

80

site translates into the priority function. The resource broker will choose that size,which maximizes the product of both. Here the goal is to maximize the profit ofthe user of the service.

Application to interactive SaaS

Another very exciting problem domain is that of interactive SaaS applicationsuch as Google Docs and Salesforce. Here, the problem is to decide which user’srequest is to be entertained given the current load of the system. In this case at-tributes such as the average time a user spends utilizing the system, the rate atwhich users are charged, and the pricing scheme itself could be considered as partof the feature variable while making a classification decision. The unpredictablenature of interactive applications also make the problem more interesting.

5.3 Task Assignment

Another key problem in MapReduce, that of task assignment, was also solvedusing the same learning based approach. The most important contribution of thispart of the thesis is the working scheduler itself, which has been contributed backto the Hadoop community as open source software. The scheduler is very effectivefor repetitive data intensive jobs, and will be useful in the majority of scenarios inwhich Hadoop is used.

During our evaluation we have tried to model the popular applications ofHadoop MapReduce. We used a workload that simulated real world use cases, andthe scheduler was tested against this workload. A key property of our schedulerwas continuous learning and adaptation for heterogeneous workloads. Despite theuncertainty involved in clusters, the scheduler was able to learn impact of differentapplications in their first few runs. This is a strong point, especially consideringthat most users use Hadoop to schedule periodic runs of a fixed set of jobs, and asour results show, the accuracy of the system only increases with more usage. Theability to provide job specific parameters also makes the system useful for one-shotjobs.

81

Once the scheduler stabilizes, we achieve much better performance than thedefault Hadoop scheduler. However, as the underutilization in the learning phaseis a limitation specific to our algorithm and not the default scheduler, we haveincluded it in our results. We have added several ‘rules’ to mitigate the initial underutilization, one of them being forceful allocation of tasks, if the load on nodes isvery low.

One limitation of the scheduler is that its not very effective with tasks with un-predictable behavior. If such behavior is anamolous, then the overall performanceof the scheduler will not be affected. However, if this is the norm in a particularapplication, then the convergence of the classifier cannot be guaranteed. In suchcases, a fixed slot based approach can be more effective. Another limitation, is thatfor efficiency, the classifier data structures need to be stored in main memory, andthis could be a problematic for very large clusters. Storing to classifier data to diskis not advisable, since it will severely impact performance.

5.3.1 Future Directions

A learning based task assignment approach can be applied to solve other im-portant problems as well. We lst couple of problems where we believe our approachcould be useful.

Application to conventional task assignment system

Our scheduling algorithm could also be used in other batch scheduling systemssuch as Torque and the Sun Grid Engine. In this case, the solution space will consistof all the tasks in the scheduler’s queue, and the scheduler will classify all tasksinto assignable and un-assignable classes. The difference between MapReduce andthese systems is that, in case of MapReduce the solution space consists of just theset of jobs, which is relatively small1. Since our algorithm runs in linear time withthe number of candidate jobs, while applying the algorithm to Torque, or SGE, wemight need to pre-compute scheduler decisions given a particular load level at a

1A job consists of multiple tasks, and in real world clusters, a MapReduce job can contain severalhundred thousand tasks.

82

node, and lookup a classification decision from the pre-computed table in order togain performance.

Power Aware Scheduling

With the total power consumed by data centers becoming a significant con-tributor to global power consumption, power efficient scheduling has become avery important area of research. We believe that a learning based approach couldbe used in a ‘green scheduler’ as well. Here the evaluation rule will be basedon global power consumed by a MapReduce cluster. If an allocation results intoexcess power usage, then it will be considered a failed allocation and vice versa.The learning based approach could be extended to other global objectives as well,examples include allocation based on overall bandwidth consumption (useful indistributed web crawlers), or access to a global resource such as a common ser-vice.

In the future we would like to evaluate our scheduler in heterogeneous envi-ronments as well as in cloud computing environments such as the Amazon EC2.Incorporating features that indicate node stability, and predicting component fail-ures is another interesting future direction.

5.4 Summary

We end this thesis with the hope that it will serve as a valuable referencedocument for anyone interested in knowing about the state of the art in resourcemanagement in MapReduce or someone who wants to write their own schedulers.We hope that learning based approaches that utilize task statistics will be embracedby the grid scheduling community.

83

Related Publications

• Learning Based Opportunistic Admission Control Algorithm for MapRe-duce as a Service, Jaideep Dhok, Nitesh Maheshwari and Vasudeva Varma,In The proceedings of 3rd India Software Engineering Conference, Mysore2010

• Using Pattern Classification for Task Assignment in MapReduce, JaideepDhok and Vasudeva Varma

84

Bibliography

[1] Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.

[2] Amazon Elastic MapReduce. http://aws.amazon.com/

elasticmapreduce/.

[3] Apache Hadoop. http://hadoop.apache.org.

[4] Apache Hive. http://hadoop.apache.org/hive/.

[5] Capacity Scheduler for Hadoop. http://hadoop.apache.org/

common/docs/current/capacity_scheduler.html.

[6] Closed Control Loop. http://en.wikipedia.org/wiki/Closed_control_loop.

[7] Cloudera Distribution for Hadoop on EC2. http://www.cloudera.

com/hadoop-ec2.

[8] Dynamic Priority Scheduler for Hadoop. http://issues.apache.

org/jira/browse/HADOOP-4768.

[9] Hadoop on Demand. http://hadoop.apache.org/common/docs/current/hod_user_guide.html.

[10] Hadoop PoweredBy. http://wiki.apache.org/hadoop/

PoweredBy.

[11] Jaql: Query Language for JSON. http://www.jaql.org/.

85

http://aws.amazon.com/ec2/

http://aws.amazon.com/elasticmapreduce/

http://aws.amazon.com/elasticmapreduce/

http://hadoop.apache.org

http://hadoop.apache.org/hive/

http://hadoop.apache.org/common/docs/current/capacity_scheduler.html

http://hadoop.apache.org/common/docs/current/capacity_scheduler.html

http://en.wikipedia.org/wiki/Closed_control_loop

http://en.wikipedia.org/wiki/Closed_control_loop

http://www.cloudera.com/hadoop-ec2

http://www.cloudera.com/hadoop-ec2

http://issues.apache.org/jira/browse/HADOOP-4768

http://issues.apache.org/jira/browse/HADOOP-4768

http://hadoop.apache.org/common/docs/current/hod_user_guide.html

http://hadoop.apache.org/common/docs/current/hod_user_guide.html

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

http://www.jaql.org/

[12] Scaling Hadoop to 4000 nodes at Yahoo! http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_

nodes_a.html.

[13] The Coming Data Explosion, Marissa Mayer, Google. http:

//www.readwriteweb.com/archives/the_coming_data_

explosion.php.

[14] Worldwide LHC Computing Grid. http://lcg.web.cern.ch/LCG/.

[15] A taxonomy and survey of grid resource management systems for distributedcomputing. Softw. Pract. Exper., 32(2):135–164, 2002.

[16] A. Abraham, R. Buyya, and B. Nath. Natures heuristics for scheduling jobson computational grids. In The 8th IEEE International Conference on Ad-vanced Computing and Communications, India. Citeseer, 2000.

[17] M. Aggarwal, RD Kent, and A. Ngom. Genetic algorithm based scheduler forcomputational grids. In High Performance Computing Systems and Applica-tions, 2005. HPCS 2005. 19th International Symposium on, pages 209–215,2005.

[18] D.P. Anderson. BOINC: A system for public-resource computing and stor-age. In proceedings of the 5th IEEE/ACM International Workshop on GridComputing, page 10. IEEE Computer Society, 2004.

[19] D.P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. SETI@home: an experiment in public-resource computing. Communications of theACM, 45(11):61, 2002.

[20] Amy W. Apon, Thomas D. Wagner, and Lawrence W. Dowdy. A learningapproach to processor allocation in parallel systems. In CIKM ’99: Proceed-ings of the eighth international conference on Information and knowledgemanagement, pages 531–537, New York, NY, USA, 1999. ACM.

[21] R. Armstrong, D. Hensgen, and T. Kidd. The relative performance of variousmapping algorithms is independent of sizable variances in run-time predic-tions.

86

http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html



http://www.readwriteweb.com/archives/the_coming_data_explosion.php



http://lcg.web.cern.ch/LCG/

[22] A. AuYoung, L. Rit, S. Wiener, and J. Wilkes. Service contracts and aggregateutility functions. In High Performance Distributed Computing, 2006 15thIEEE International Symposium on, pages 119–131, 0-0 2006.

[23] Z. Bin, L. Zhaohui, and W. Jun. Grid Scheduling Optimization Under Con-ditions of Uncertainty. Lecture Notes in Computer Science, 4672:51, 2007.

[24] B. Bode, D.M. Halstead, R. Kendall, Z. Lei, D. Jackson, and M.H.P.C. Center.The portable batch scheduler and the maui scheduler on linux clusters.

[25] T.D. Braun, H.J. Siegel, N. Beck, L.L. B”ol”oni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, B. Yao,D. Hensgen, et al. A comparison of eleven static heuristics for mapping aclass of independent tasks onto heterogeneous distributed computing systems.Journal of Parallel and Distributed Computing, 61(6):810–837, 2001.

[26] James Broberg, Srikumar Venugopal, and Rajkumar Buyya. Market-orientedgrids and utility computing: The state-of-the-art and future directions. Jour-nal of Grid Computing, 6(3):255–276, 2008.

[27] R. Buyya, D. Abramson, and S. Venugopal. The grid economy. Proceedingsof the IEEE, 93(3):698–714, March 2005.

[28] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, andIvona Brandic. Cloud computing and emerging it platforms: Vision, hype,and reality for delivering computing as the 5th utility. Future GenerationComputer Systems, 25(6):599–616, 2009.

[29] C. Catlett et al. The philosophy of TeraGrid: building an open, extensi-ble, distributed TeraScale facility. In Cluster Computing and the Grid 2ndIEEE/ACM International Symposium CCGRID2002, pages 5–5, 2002.

[30] C.S. Chong, A.I. Sivakumar, M.Y.H. Low, and K.L. Gay. A bee colony opti-mization algorithm to job shop scheduling. In Proceedings of the 38th confer-ence on Winter simulation, page 1961. Winter Simulation Conference, 2006.

87

[31] B.N. Chun and D.E. Culler. User-centric performance analysis of market-based cluster batch schedulers. In 2nd IEEE International Symposium onCluster Computing and the Grid, pages 30–38, 2002.

[32] Brent N. Chun and David E. Culler. User-centric performance analysis ofmarket-based cluster batch schedulers. In CCGRID ’02: Proceedings ofthe 2nd IEEE/ACM International Symposium on Cluster Computing and theGrid, page 30, Washington, DC, USA, 2002. IEEE Computer Society.

[33] EG Coffman Jr, MR Garey, and DS Johnson. An application of bin-packingto multiprocessor scheduling. SIAM Journal on Computing, 7:1, 1978.

[34] P. Crescenzi and V. Kann. A compendium of NP optimization problems,1998.

[35] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid informa-tion services for distributed resource sharing. In 10th IEEE InternationalSymposium on High Performance Distributed Computing, volume 184. SanFrancisco, 2001.

[36] D.P. Da Silva, W. Cirne, and F.V. Brasileiro. Trading cycles for informa-tion: Using replication to schedule bag-of-tasks applications on computa-tional grids. Lecture Notes in Computer Science, pages 169–180, 2003.

[37] J. Dean. Experiences with MapReduce, an abstraction for large-scale compu-tation. In International Conference on Parallel Architecture and CompilationTechniques, 2006.

[38] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Process-ing on Large Clusters. In Proceedings of the 6th Symposium on OperatingSystems Design and Implementation, pages 137–150, 2004.

[39] P.A. Dinda and D.R. O’Hallaron. An evaluation of linear models for host loadprediction, 1998.

[40] F. Dong and S.G. Akl. Scheduling algorithms for grid computing: State of theart and open problems. School of Computing, Queens University, Kingston,Ontario, 2006.

88

[41] M. Dorigo. Optimization, learning and natural algorithms. Italian) Ph. D.dissertation, Politecnico di Milano, Milan, Italy, 1992.

[42] M. Dorigo and T. Stutzle. The ant colony optimization metaheuristic: Algo-rithms, applications, and advances. International Series in Operations Re-search and Management Science, pages 251–286, 2003.

[43] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification(2nd Edition). Wiley-Interscience, 2 edition, November 2000.

[44] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enablingscalable virtual organizations. International Journal of High PerformanceComputing Applications, 15(3):200, 2001.

[45] R.F. Freund, M. Gherrity, S. Ambrosius, M. Campbell, M. Halderman,D. Hensgen, E. Keith, T. Kidd, M. Kussow, J.D. Lima, et al. Schedulingresources in multi-user, heterogeneous, computing environments with Smart-Net. Citeseer.

[46] R.F. Freund and H.J. Siegel. Guest Editor’s Introduction: HeterogeneousProcessing. Computer, 26(6):17, 1993.

[47] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: Acomputation management agent for multi-institutional grids. Cluster Com-puting, 5(3):237–246, 2002.

[48] W. Gentzsch et al. Sun grid engine: Towards creating a compute power grid.In Proceedings of the 1st International Symposium on Cluster Computing andthe Grid, page 35, 2001.

[49] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google filesystem. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003.

[50] David E. Irwin, Laura E. Grit, and Jeffrey S. Chase. Balancing risk andreward in a market-based task service. In HPDC ’04: Proceedings of the 13thIEEE International Symposium on High Performance Distributed Computing,pages 160–169, Washington, DC, USA, 2004. IEEE Computer Society.

89

[51] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg.Quincy: fair scheduling for distributed computing clusters. In Proceedings ofthe ACM SIGOPS 22nd symposium on Operating systems principles, pages261–276. ACM, 2009.

[52] A. Kertesz and P. Kacsuk. A taxonomy of grid resource brokers. In 6thAustrian-Hungarian Workshop on Distributed and Parallel Systems (DAP-SYS), Innsbruck, Austria. Springer, 2006.

[53] S. Kim and J. Weissman. A GA-based approach for scheduling decomposabledata Grid applications. In Proceedings of the 2004 International Conferenceon Parallel Processing (ICPP04). Citeseer, 2003.

[54] K. Krauter, R. Buyya, and M. Maheswaran. A taxonomy and survey ofgrid resource management systems. Software Practice and Experience,32(2):135–164, 2002.

[55] T. Kunz. The Learning Behaviour of a Scheduler using a Stochastic LearningAutomation. Technical report, Citeseer.

[56] T. Kunz. The influence of different workload descriptions on a heuristic loadbalancing scheme. IEEE Transactions on Software Engineering, 17(7):725–730, 1991.

[57] K. Lai, L. Rasmusson, E. Adar, L. Zhang, and B.A. Huberman. Tycoon:An implementation of a distributed, market-based resource allocation system.Multiagent and Grid Systems, 1(3):169–182, 2005.

[58] S.M. Larson, C.D. Snow, M. Shirts, and V.S. Pande. Folding@ Homeand Genome@ Home: Using distributed computing to tackle previously in-tractable problems in computational biology. 2009.

[59] A. Legrand, A. Zagorodnov, and D. Berman. Heuristics for Scheduling Pa-rameter Sweep Applications in Grid Environments. In Proceedings of the9th Heterogeneous Computing Workshop, IEEE, Los Alamitos, volume 363,2000.

90

[60] Ryan Levering and Michal Cutler. The portrait of a common HTML webpage. In DocEng ’06: Proceedings of the 2006 ACM symposium on Docu-ment engineering, pages 198–204, New York, NY, USA, 2006. ACM.

[61] M. Maheswaran, S. Ali, H.J. Siegel, D. Hensgen, and R.F. Freund. Dynamicmatching and scheduling of a class of independent tasks onto heterogeneouscomputing systems. In 8th Heterogeneous Computing Workshop (HCW99),pages 30–44, 1999.

[62] Michael Mitzenmacher. How useful is old information? IEEE Trans. ParallelDistrib. Syst., 11(1):6–20, 2000.

[63] A. Negi and K.P. Kishore. Applying machine learning techniques to improvelinux process scheduling. In TENCON 2005 2005 IEEE Region 10, pages1–6, Nov. 2005.

[64] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: Anot-so-foreign language for data processing. In Proceedings of the 2008 ACMSIGMOD international conference on Management of data, pages 1099–1110. ACM New York, NY, USA, 2008.

[65] Florentina I. Popovici and John Wilkes. Profitable services in an uncertainworld. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Super-computing, page 36, Washington, DC, USA, 2005. IEEE Computer Society.

[66] C. Rajendran and H. Ziegler. Ant-colony algorithms for permutation flow-shop scheduling to minimize makespan/total flowtime of jobs. EuropeanJournal of Operational Research, 155(2):426–438, 2004.

[67] Kavitha Ranganathan and Ian Foster. Simulation studies of computationand data scheduling algorithms for data grids. Journal of Grid Computing,1(1):53–62, March 2003.

[68] F.D. Sacerdoti, M.J. Katz, M.L. Massie, and D.E. Culler. Wide area clustermonitoring with ganglia. In Proceedings of the IEEE Cluster 2003 Confer-ence. Citeseer, 2003.

91

[69] A. Salman, I. Ahmad, and S. Al-Madani. Particle swarm optimization fortask assignment problem. Microprocessors and Microsystems, 26(8):363–371, 2002.

[70] L.P. Santos, D. de Informatica, and A. Proenca. A Bayesian runtime loadmanager on a shared cluster. In First IEEE/ACM International Symposium onCluster Computing and the Grid, 2001. Proceedings, pages 674–679, 2001.

[71] L.P. Santos and A. Proenca. Scheduling under conditions of uncertainty: abayesian approach. Lecture notes in computer science, pages 222–229, 2004.

[72] S. Song, Y.K. Kwok, and K. Hwang. Security-driven heuristics and a fastgenetic algorithm for trusted grid job scheduling. In Proceedings of the19th IEEE International Parallel and Distributed Processing Symposium(IPDPS’05)-Papers, volume 65, pages 04–08. Citeseer.

[73] V. Subramani, R. Kettimuthu, S. Srinivasan, and P. Sadayappan. Distributedjob scheduling on computational grids using multiple simultaneous requests.In Proceedings of the 11th IEEE International Symposium on High Perfor-mance Distributed Computing, page 359. IEEE Computer Society, 2002.

[74] S. Vadhiyar and J. Dongarra. A Metascheduler for the Grid. In Proceedingsof the 11th IEEE Symposium on High-Performance Distributed Computing,pages 343–351. Citeseer, 2002.

[75] O. Waldrich, P. Wieder, and W. Ziegler. A meta-scheduling service for co-allocating arbitrary types of resources. Lecture Notes in Computer Science,3911:782, 2006.

[76] M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of scientific work-flows in the ASKALON grid environment. ACM SIGMOD Record, 34(3):62,2005.

[77] R. Wolski, N. Spring, and J. Hayes. The network weather service: A dis-tributed resource performance forecasting service for metacomputing. FutureGeneration Computer Systems, 15(5-6):757–768, 1999.

92

[78] L. Young, S. McGough, S. Newhouse, and J. Darlington. Scheduling archi-tecture and algorithms within the ICENI Grid middleware. In UK e-ScienceAll Hands Meeting, pages 5–12. Citeseer, 2003.

[79] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improvingmapreduce performance in heterogeneous environments. In Proc. of USENIXOSDI, 2008.

[80] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy,Scott Shenker, and Ion Stoica. Job Scheduling for Multi-User MapReduceClusters. Technical Report UCB/EECS-2009-55, EECS Department, Univer-sity of California, Berkeley, Apr 2009.

[81] Harry Zhang. The Optimality of Naive Bayes. In Valerie Barr and ZdravkoMarkov, editors, FLAIRS Conference. AAAI Press, 2004.

93

Documents

Learning Based Admission Control and Task Assignment in ...web2py.iiit.ac.in/publications/default/download/... · and indexing etc. Many organizations utilize cloud computing services