Simulation Environment for Load Balancing on a Cloud

Project Report

Cloud Computing Report1

December 22, 2010

Marcus Ljungblad

Navaneeth Rameshan

Wasif Malik

This report is prepared by

Marcus Ljungblad

Navaneeth Rameshan

Wasif Malik

1This report is a part of the cloud computing project.

Contents

1 Introduction 1

2 ProposedMethod 2

2.1 Attempted approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Web Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.2 Single Task Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Web Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 Single Task Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Implementation 5

4 Results 7

4.1 Single-Task mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.1 Simulation summary . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Web-Task Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2.1 Simulation Summary . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Round-Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.1 Simulation Summary . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Conclusion 13

5.1 Scope for Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Shifting jobs between workers . . . . . . . . . . . . . . . . . . . . 13

5.1.2 Load parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

i

List of Figures

2.1 High level flow of Scheduling algorithm . . . . . . . . . . . . . . . . . . . 3

3.1 UML diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Response and Queued jobs over time in single-task mode. . . . . . . . . . 8

4.2 Number of active, idle, and computing workers over time for single-taskmode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Response time and queued jobs over time in web-task mode . . . . . . . 10

4.4 Active, idle and computing workers over time in web-task mode. . . . . . 11

4.5 Response time and queued jobs using round-robin scheduling . . . . . . . 12

ii

Chapter 1

Introduction

Distributing jobs of unknown size efficiently across a large number of machines is one

of the greatest challenges today in cloud computing. The goals ranges from minimizing

cost to minimizing time to complete a set of jobs; two often contradicting requirements.

While the best way to complete all jobs as fast as possible may be to schedule one job per

machine, it is far from the most cost-efficient. In effect, one must always make a trade-off

between the two.

In this report we present three algorithms: one focuses on minimizing response time for

a job, one on minimizing cost, and one reference algorithm implemented using round-

robin. To test the algorithms a cloud simulator was implemented in C++. Results from

simulations with varying inputs are compared against the round-robin algorithm. Finally,

a set of improvements to the evaluated algorithms are proposed.

1

Chapter 2

ProposedMethod

2.1 Attempted approaches

The following subsections describe the initial attempted approaches for scheduling.

2.1.1 Web Mode

For the web mode, we intended to do an efficient distribution of jobs to worker nodes, to

minimize swapping costs and fit them in memory in the best possible way so as to ensure

load distribution. However, since the scheduler has no information of the job memory, the

only possible way to efficiently distribute jobs is to schedule jobs in round-robin or random

distribution initially and get information about the job’s memory from the worker. Then

the cost of transferring jobs from one worker node to another is calculated to see if it is

feasible to do a transfer for efficient load distribution. It may not be feasible to transfer a

job if the job has already computed most of its instructions. Although the worker nodes

don’t provide a feedback on the number of instructions, the worst case time taken for the

jobs completed is used to estimate the time remaining. We discarded this method, as the

estimate for cost was deviating significantly from the actual cost. Time taken for a job

to complete depends largely on the number of jobs that are already present in the worker

node, the swapping costs and also the number of instructions for the job. We believe that

these factors made it difficult to predict the completion time.

2

3

2.1.2 Single Task Mode

For the single task mode, we intended to submit jobs in a round robin manner and also

compute the job away time for all the jobs that have been submitted . The job away

time is computed every scheduler cycle after at least one job from a task completes. The

average time is used in this case to estimate the completion time for jobs that have been

submitted. If the job away time is lesser than the average time of completion for the jobs

in the task, then the estimated time of completion is the average time itself but if the job

away time is greater than the average time, the average completion time is recomputed

as a weighted average. However, here again the estimation of time to complete deviated

significantly from the actual values and as a result either more workers were started than

what was necessary or lesser in some cases.

2.2 Proposed method

Figure 2.1 shows the high level flow of the scheduling algorithm.

Figure 2.1: High level flow of Scheduling algorithm

2.2.1 Web Mode

In the web mode, the goal of the scheduling algorithm is to minimize the average response

time. Ideally, a practical solution would be to distribute load evenly and to start enough

4

number of worker nodes to minimize response time. The jobs are submitted in round-

robin manner until at least one job completes. This is a modified version of the normal

round robin algorithm; it only sends one job per active worker in each scheduling cycle.

This results in the scheduler holding back jobs and increases the chances of jobs finishing

quickly due to reduced swapping time. As jobs complete, we keep track of worst case

completion time. Based on the current time, we also keep a track of time until next

charging tick. As soon as we have at least one completed job, the future jobs are scheduled

to worker nodes based on their current load. In the implementation, load is the number of

jobs a worker node is currently working on. Lesser the load, more jobs can be sent to the

worker and vice versa. For each worker node, the worst case execution times for the jobs

to be sent are estimated based on the worst case completion time seen so far. If the worst

case execution time of the jobs at hand exceeds the time until the next charging tick,

then only that many jobs are sent that are estimated to complete within the charging

tick. These jobs are chosen randomly from the queue for a good distribution. Jobs that

arent sent are considered spilled and spilled jobs for each worker is accumulated in the

same cycle. Depending on the estimation of how long the spilled jobs take, new worker

nodes are started. The scheduler cannot sent the spilled jobs to the newly started workers

immediately because the workers take some time to boot up and cant accept jobs in that

time. So, the spilled jobs are saved to a hash map and the scheduler tries to send them

to the specified workers at the start of each scheduling cycle. As soon as the new worker

node(s) boot up, the jobs will be submitted to them.

2.2.2 Single Task Mode

In single task mode, a similar scheduling algorithm like above is used but with one

major difference; the decision to start new nodes is dependent on the percentage of waste

value. The more money the user is willing to waste, the more nodes will be started

by the scheduler. This would result in a very quick response time but the cost will be

significantly more. Similarly, the lower the percentage of waste value, the more strict is

the scheduler in starting new nodes, the lesser the cost, but increased average response

time. For estimating the time it would take to complete all the jobs in hand, the scheduler

uses the same approach used in web task scheduler i.e. calculates the completion time

by multiplying the number of jobs in queue with the worst time to complete one job.

Theoretically, It would have been better to consider the average completion time for each

task and then decide if new nodes should be started or not; but due to time constraints

and complexity, this approach was not implemented.

Chapter 3

Implementation

The implementation of all modules was done in C++. The UML diagram in figure 3.1

shows the relationship between modules and their key attributes.

Worker-workerState: enum+execute()+startWorker()+stopWorker()+submitJobs()+getState()+getAvailableMemory()+isAcceptingJobs()+getTotalMemory()+getCostPerHour()+getInstructionsPerTime()+getTotalExecutionTime()+getTotalCPUTime()+getAverageResponseTime()+getTotalCost()+getQueuedJobs()+getJobsCompleted()

Scheduler-queuedJobs: list<Job>-runningJobs: list<Job>-completedJobs: list<Job>-workers: list<Worker>-workerStats: list<WorkerStats>+runScheduler()+submitJobs(list<Job>)+notifyJobCompletion()-getSlowestJobTime()-fetchJobsFromQueueRandomly()-startWorkerNode()-runRoundRobinScheduler()-runWebScheduler()-runSingleTaskScheduler()

Job-jobid: long-taskid: longnum_instructions: longmem_size: long+getJobID()+getTaskID()

TaskGen-tasks: list<Task>-jobs: list<Job>scheduler: Scheduler-sendTask()-createTask()

Simulator-currentTime: long

Task-taskid: long-jobrate-num_of_jobs-jobs

1 0..nhas

1 1submit jobs/tasks

0..n

1

has

0..n

1

has

1 1

1

0..n

1 1..nhas

Figure 3.1: UML diagram

The clock functionality was implemented in the Simulator class by having a while loop

calling the task generator, scheduler, and worker objects in every iteration. One iteration

5

6

is considered to be one millisecond by each module. But since the scheduler can only

work after a configurable interval, the scheduler ignores iterations and only does work

after the specified scheduling interval. The jobs and tasks are fed to the scheduler by the

task generator at a rate specified in the input file (input.conf). The number of workers

to start automatically at the start of the simulation can be configured in workers.conf

but they will still take time to start-up when the simulation starts. Till that time, the

jobs will be queued at the schedulers end. The initial workers objects are created by

simulator and passed on to the scheduler at the start of the simulation. A limitation in

this implementation is that the scheduler has to wait for workers to start, even though

an initial numbers of started workers is specified.

The worker node is implemented as a state machine with the following states: Initialising,

Idle, Computing, Swapping, or Offline. It can accept jobs in all states except Initialising

and Offline. It maintains two queues: jobs in memory, and jobs in hard drive. When

swapping occurs at a node, it is conducted in a round-robin fashion. A job is started only

if it fits in the memory, if not, an existing job is swapped out (i.e moved to the jobs in

hard drive queue) and the next job retried. Moreover, the worker maintains a public API

statistical use towards the scheduler.

Chapter 4

Results

In this section, three test runs are presented and evaluated. The following configuration

was used for all the test runs.

Table 4.1: Simulator Configuration

Scheduling interval 0.1 sWorker node speed 300 instr/s

Worker node memory 8 GbWorker swapping cost 5 instr/gb

Worker quantum 0.1 sWorker node start-up time 120 s

Worker node notification time 2 instrWorker node cost 1 Euro/hour

Allowed waste 30%Workers started 2

4.1 Single-Task mode

Table 4.2 shows the input configuration for the jobs.

It can be seen from figure4.1 that in the first 120 seconds no jobs are scheduled as the

workers are being started during this time. Then jobs are scheduled in a round-robin

fashion until a sufficiently accurate estimation of when the jobs will complete is attained.

At this point one more worker is started and the spilled jobs are sent to the designated

worker. However, as well see in the following graph 4.2, this new worker is not well used

7

8

Table 4.2: Input Configuration

S 1 100 1 0 500 1000 1024 2048S 2 400 10 0 5000 7500 1024 2048S 3 100 10 0 1000 2000 1024 2048S 4 200 1 0 7500 10000 1024 2048

Figure 4.1: Response and Queued jobs over time in single-task mode.

and only a small number of jobs are sent to it. Response time starts to increase around

time 6000s since larger jobs remain and more swapping is carried out between them. Even

though average response time starts to increase no more worker nodes are started. This

is because the scheduler has already submitted all of the jobs in its queue and there is no

more estimation made. Instead it would increase the cost, and unused time, compared

to letting them run on existing machines.

Based on the initial jobs completed the scheduler estimates that one more worker is

required to optimize cost. As seen in the graph 4.2 , however, this worker is not very well

used while the other two are. This is because cost takes priority. Note that the machine

is switched off before the next charging instant.

9

Figure 4.2: Number of active, idle, and computing workers over time for single-taskmode.

Table 4.3: Simulation summary

Number of jobs: 800Total: 25138sActive: 21723sUnused: 3415sWaste: 15.7207%Cost: 6 Euro

Job avg response time: 7430.67sStandard deviation: 3299.69s

4.1.1 Simulation summary

Cost is calculated incorrectly when a node has been turned of due to a bug in the im-

plementation. It should be 7 Euro as we still have to pay for the hour the third worker

was online. Moreover, average response time is considerable as the long running jobs are

being swapped often.

By, for example, re-evaluating the need for more workers when response time increases

and restart a portion of the jobs at a new worker would be one way of minimizing the

response time in this case.

10

4.2 Web-Task Mode

Table 4.4: Input Configuration

W 1 100 1 0 5 100 1024 2048W 2 400 10 0 50 1000 1024 2048W 3 100 10 0 500 10000 1024 2048W 4 200 1 0 1000 20000 1024 2048

Figure 4.3: Response time and queued jobs over time in web-task mode

From graph 4.3, initially no jobs are scheduled, then as soon as workers are up an running

all jobs that have arrived to the scheduler so far are sent to workers. Response time starts

to increase due to swapping. It levels a little between 1000 to 2000 seconds as the job sizes

of the first jobs remain small. However, there are 200 jobs with possibly long execution

times (up to 20k instructions) which causes the response time to increase. Since no new

workers are started later in the execution, and jobs are not moved between workers, the

response time will continue to increase.

Even though response time increases towards the end of the simulation, and despite that

jobs are being completed, we see that worker usage is maximised for most of the execution.

Note that the simulation runs for almost 2 hours, but finishes before the next charging

time unit.

11

Figure 4.4: Active, idle and computing workers over time in web-task mode.

4.2.1 Simulation Summary

Table 4.5: Summary of simulation



Standard deviation is relatively high as some jobs finishes in very little time, and some

take a long time to complete due to increased swapping.

We should make a note here that minimizing swapping is one factor that should be taken

into account when designing the scheduling algorithm. It is also the first completing jobs

that determine how many more additional workers to start, causing the response time

towards the end to increase significantly as the available workers are heavily loaded and

the estimation of work that remains starts to drift.

12

4.3 Round-Robin

As a comparison to the examples provided above, this example uses a simple round-robin

algorithm. It starts with two workers and does not start any additional workers. This

example was made using the same configuration as for Web-Task mode.

Figure 4.5: Response time and queued jobs using round-robin scheduling

4.3.1 Simulation Summary

Table 4.6: Simulation summary



As seen from graph 4.5, the response time is almost doubled and the time to complete all

jobs is also doubled. Comparing the summaries we see that unused time is a lot higher

using round-robin. Primarily because this algorithm does not take allowed waste into

account when distributing the jobs, nor starts new worker nodes.

Chapter 5

Conclusion

5.1 Scope for Improvement

5.1.1 Shifting jobs between workers

As the results show, the a drawback with the algorithms proposed is that workers to

start is only estimated when jobs are in the scheduler queue. If, during the simulation,

the response time starts to increase dramatically due to large jobs that was not part of

the initial estimation by the scheduler, more nodes could be started up and jobs shifted

to these new nodes. For example, the scheduler could be improved to take increased

response time into account. Since jobs have already been passed on to workers by the

scheduler, such change would require jobs to be cancelled at one worker, and restarted at

another. Something which was not implemented fully in this simulator.

The increase in response time is largely dependent on the number of instructions to

complete a job, and to a lesser extent the memory. Although the latter affect response

time, workers that have been assigned too many jobs will swap even though its jobs have

few instructions. A worker node with many jobs assigned to it will conduct swapping

between all the jobs in a round-robin fashion. While this avoids starvation (i.e small jobs

not being computed at all), it increases the average response time significantly for small

jobs in case it is competing for computing cycles with many other jobs.

5.1.2 Load parameters

Another improvement that could be made is the variables used to estimate load. In the

proposed method, only number of jobs per worker is considered a measurement of load.

13

14

However, if, for example, time remaining and memory consumption of a job (only be

known once it has been started) can be added to estimate load, a better distribution of

jobs would be attained. In that case the scheduler would try to aggregate smaller jobs

onto the same worker, and fewer larger jobs per worker. This way swapping between jobs

would be minimized and response time for small jobs would not suffer as much.

5.1.3 Timeline

It would be interesting to evaluate the scheduler if jobs were not only sent at the beginning

of the simulation. By for example adding delay between tasks in the input configuration

file, jobs arriving at different times would have to be accounted for. While a rate of which

jobs are sent from the Task Generator to the Scheduler provides some of this behaviour,

it primarily impacts the beginning of the simulation. In a real-world scenario jobs must

be able to arrive at any time to the scheduler, something a delay parameter between tasks

could provide.

5.2 Conclusion

In this report two cloud scheduling algorithms have been evaluated and compared to a

round-robin algorithm. The results show improvement over the round robin scheduling.

A simulator with a number of constraints was implemented to test these algorithms.

Beyond the proposed method, two other approaches were considered but were discarded

due to difficulties in estimating cost and time to complete.

Documents

Simulation Environment for Load Balancing on a Cloud