Invited: Performance Evaluation of Heterogeneous Multi ...menasce/papers/403_Menasce.pdfnumber of failures and wasted time due to job restarts. Section VII describes the replication

Invited: Performance Evaluation of Heterogeneous Multi-Queues with JobReplication

Daniel A. Menasce and Noor BajunaidDepartment of Computer Science

George Mason University4400 University Drive, Fairfax, VA 22030, USA

[email protected] and [email protected]

Abstract—A system composed of n heterogenous serversreceives jobs to be processed. A front-end dispatcher submitsd (d ≤ n) copies of each job to the queue of each of the d chosenservers. The copy that completes first causes the removal fromthe system of each of the other d − 1 copies, which can beexecuting or in the queue for a server. This paper evaluatesjob execution time statistics as a function of (1) the averagejob arrival rate from a Poisson process, (2) the distribution ofjob service times, (3) the processing capacity of each server, (4)the job replication and dispatching policy. The paper derivesanalytical expressions for the average processing time of jobsthat fail during execution and uses discrete event simulationto carry out experiments for analyzing a variety of statistics(e.g., response time, probability that jobs are cancelled while inthe queue or killed while in execution, and server utilizations)related to a heterogeneous multi-server system under fourdifferent dispatching policies.

I. INTRODUCTION

As indicated in [3], it may be advantageous to submita job to multiple server queues at the same time whenthere is significant server side variability, which couldbe due to factors such as background load, networkinterrupts, garbage collection, and sotfware failures. Byusing redundancy, latency can be reduced and a job isconsidered to be complete when it finishes ahead of allits replicas, which are cancelled if they are still waiting tostart processing or are killed if they have already startedto be processed. Previous theoretical queuing researchon redundant queues used a simplifying independenceassumption that the service time of all job replicas aredrawn from identical and independent random variables(see e.g., [4], [6], [8]). However, all the replicas areidentical in terms of their computing and I/O demandsand therefore one cannot assume that the service timesof all replicas of a job are independently drawn from thesame distribution at each server.

Our study considers that the servers in a clusterare basically heterogeneous in terms of capacity evenif they start as identical. The reasons may include: (1)servers that age or fail tend to be replaced by newer andfaster ones; (2) the state of seconday storage devicesamong the servers of a cluster diverges in differentways over time leading to different I/O times; (3) servervirtualization leads to different server capacities due tothe co-existence of various virtual machines in a singlephysical server; and (4) the background load of OSservices varies from server to server.

We also consider here that jobs may fail duringtheir execution due to a variety of factors, includingmemory leaks and misconfigurations, in which casejobs have to be restarted. Jobs may fail several timesbefore they succeed. Therefore, a job’s processing timeis elongated by the amount of computation wasted dueto failures.

This paper examines and compares four differ-ent dispatching policies. Two of them (Random andRedundant-to-Idle-Queue) use none or very little in-formation about the servers to make dispatching de-cisions. The other two (Average Queue Length andQueue Cancellation) use server statistics to estimatethe response time of a job at each server before de-ciding where to send a job and its replicas. Thesetwo policies are more intrusive than the first two andmay not be easy or practical to implement in someenvironments.

The contributions of this paper include: (1) A formalstatement of the problem of dispatching replicas of ajob to heterogeneous servers. (2) The derivation of ananalytical expression for the average job processingtime in the presence of job failures. (3) An extensivesimulation study of the variation of the response timeand other related metrics as a function of the number ofjob replicas for several dispatching policies.

The rest of the paper is organized as follows.Section II formally describes the problem. Section IIIpresents the assumptions used in the paper. Section IVintroduces the notation used here. Section V discussesthe motivation for dispatching multiple copies of a job.The next section provides a derivation of the averagenumber of failures and wasted time due to job restarts.Section VII describes the replication dispatching poli-cies considered in this paper and Section VIII describesthe simulation experiments and their results. Finally,Section IX provides some concluding remarks and dis-cusses future work.

II. PROBLEM DESCRIPTION

Consider a server cluster composed of n serversand a front-end job dispatcher (heretofore referred asdispatcher). Jobs arrive to the dispatcher, which createsd (1 ≤ d ≤ n) replicas of the job and submits themto server queues chosen according to a dispatching

policy . The set of replicas of a job is called a job group.The first job of a job group to finish causes all other d−1replicas to be cancelled or killed.

Figure 1 shows an example of the system we con-sider in this paper. There are four servers and thedispatcher is shown receiving three jobs (J1, J2 and J3)in this order. Two copies are made of each job. The twocopies of job J1 are submitted to servers 1 and 3, thecopies of job J2 are submitted to servers 3 and 4, andthe two copies of job J3 to servers 2 and 4. If the copyof job J1 finishes in server 1 before its copy in server 3,that copy is deletd whether it is still in server 3’s queueor being processed by server 3.

Front-endDispatcher

Server1

Server2

Server3

Server4

J3,J2,J1

J1(b)

J1(a)

J2(a)

J2(b)

J3(a)

J3(b)

Fig. 1: Multiple Queues with Job Replication.

We consider, for the reasons below, that serversare heterogeneous. In practice, it is more common forservers in a cluster to have different effective capacitiesfor the following reasons:• As servers age or fail, they are replaced by newer

ones with typically different characteristics (e.g.,different clock frequencies, different cache sizes,and even different hardware architectures). Thus,even a cluster of identical servers gradually morphsover time into a cluster of heterogeneous servers.

• The performance of two identical servers tends todiverge over time due to the state of their sec-ondary storage.

• When servers are virtual servers running on phys-ical servers, their effective capacity is affectedby the load of other virtual machines running onthe same physical box [10]. Thus, in this case, aserver’s capacity tends to change over time.

• The background load of OS services varies overtime from server to server and uses resourcesrequired by the jobs.

III. ASSUMPTIONS

We consider the following assumptions:• A1: The cancellation time of a job in the queue is

negligible because it involves trivial operations inthe data structure that represents the queue.

• A2: The time to kill a job (i.e., removal of a job whenit is being processed) is considered to be negligi-ble because it involves process context switching,which is typically very small compared with a job’sprocessing time. It should be noted though that thepenalty for killing a job is that a server wastes timeon that job, delaying other jobs that arrived after thekilled job. We assume that the dispatcher can keeptrack of the time spent processing by a job thatis killed at any server (see also assumption A7).Therefore, the dispatcher can compute the averagetime a job wastes being served before it is killed bydividing the sum of the wasted times by the numberof killed jobs at each server.

• A3: Jobs arrive to the dispatcher according to aPoisson process. This is a reasonable assumptionwhen jobs are generated from a large number ofindependent sources.

• A4: The dispatcher knows the distribution of thenumber of job operations as well as the capacityof each of the n servers. This means it knowsthe distribution of job’s service times. However, thedispatcher does not know the actual service time ofindividual jobs because the actual execution time isusually data-dependent.

• A5: Servers use a FCFS discipline to serve jobs intheir queues.

• A6: The dispatcher knows the status of each queue(i.e., the list of jobs in each queue) and the idof the job being served but it does not know theexecution times of the jobs in the server queuesnor the execution time of the jobs being processed.

• A7: The dispatcher is notified when a job is can-celled or killed at any server. Therefore, the dis-patcher can keep a running estimate of the proba-bility that jobs are cancelled or killed at any server.

IV. NOTATION

Consider the following notation:• n: number of servers• d (d ≤ n): number of replicas generated for a job.• O: r.v. that indicates the number of operations of a

job.• Ci (i = 1, · · · , n): processing capacity of serveri, in operations/sec. We assume without loss ofgenerality that C1 ≤ C2 ≤ · · · ≤ Cn.

• Si: r.v. that indicates the processing time of jobs atserver i; thus, Si = O/Ci.

• λ: average arrival rate of jobs at the dispatcher.• λi: average arrival rate of jobs at server i.• f(P, i): fraction of jobs received by the dispatcher

that are sent to server i under replication policy P.• Ri: average response time of jobs at server i.• Pcancel,i: probability that jobs are cancelled at

server i, computed as the ratio of the number ofjobs cancelled at the server over the total numberof jobs received by the server.

• Pkill,i: probability that jobs are killed at each server,computed as the ratio of the number of jobs killedat the server over the total number of jobs receivedby the server.

• Pcomplete,i: probability that jobs that arrive at serveri complete their processing at that server, i.e., theyare neither cancelled nor killed. Thus, Pcomplete,i =1− (Pcancel,i + Pkill,i).

• E[Wastedkill,i]: average processing (not waiting)time spent by killed jobs at server i. This value iscomputed by the dispatcher (see Assumption A2).

• Fi: r.v. that represents the failure instant of a jobrunning at server i after its last start or restart atthat server.

• fFi(x): pdf of Fi.• qi: probability that a job fails while executing at

server i• Wi: r.v. that denotes the amount of wasted compu-

tation time at server i per failure of a job at server i• Pi: r.v. that indicates the total processing time of a

job at server i including successful processing Siplus the total wasted time due to all failures of thejob.

• Uu,i: useful utilization of server i defined as theratio of the time the server was busy processingrequests that were not killed at the server dividedby the total simulation time.

• Uw,i: wasted utilization of server i defined as thethe ratio of the time the server was busy processingjobs killed at the server divided by the total simula-tion time.

V. MOTIVATION

As a motivation for our work, we show in Fig. 2 agraph of average response time versus the number ofreplicas d for a case of 5 servers. The arrival rate ofrequests to the dispatcher is 0.072 req/sec, the numberof operations per job is geometrically distributed withparameter p = 0.01 and all servers have the samenominal capacity of 10 operations/sec. All servers havePoisson failure models with different failure rates: β1 =0.01, β2 = 0.02, β3 = 0.04, β4 = 0.05, and β5 = 0.05.In this example, the dispatcher randomly selects d outof the 5 servers to send replicas of a job. The pictureshows that for the data above, the minimum responsetime occurs when the number of replicas is equal to3. In other words, dispatching more than one replicaof the job reduces the response time to a point butwhen too many replicas are submitted, the responsetime increases due to excessive wasted time by killedjobs.

Figure 3 shows the relationship between cancelled,killed, and completed requests as they flow through thedispatcher, go through one or more servers, and leave.The overall arrival rate of requests to the dispatcher isλ requests/sec. A fraction f(P, i) of these requests isrouted to server i according to the dispatching policy P.

0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5

AverageRe

spon

seTim

e

d

Fig. 2: Average response time with 95% confidence intervalsas a function of the number of replicas d.

So, the arrival rate of request to server i is

λi = λ× f(P, i). (1)

Then, the rate at which requests complete from serveri, i.e., the throughput Xi, is

Xi = λ× f(P, i)× Pcomplete,i (2)

We can then compute the useful utilization, Uu,i of eachserver i as

Uu,i = λ× f(P, i)× Pcomplete,i × E[Pi] (3)

An upper bound for the wasted utilization, Uw,i, of serveri due to job killings is

Uw,i < λ× f(P, i)× Pkill,i × E[Pi] (4)

because the average time spent by the server on a killedjob is less than E[Pi].

As another example, Figs. 4-6 show graphs depict-ing Pcancel, Pkill and Pcomplete for each of nine serversfor random dispatching to d servers for d = 2, 5 and 9,respectively. The capacity of the nine servers is givenby Ci = 2 i operations per second. So, the capacityranges from 2 to 18 operations per second with server1 being the slowest and server 9 the fastest. The arrivalrate of requests is 0.02 requests/sec and the numberof operations per request is distributed according to thebinomial Bin (150, 0.5).

As a general observation with respect to Figures 4-6, we can see that the cancel probability is very lowbut the kill probability is high and close to 1 for theslower servers and decreases for the faster servers. Thecompletion probability increases with the speed of theservers, i.e., faster servers are more likely to completea job.

If the actual execution times of the jobs were knownby the dispatcher, it could submit just one copy ofthe job to the shortest queue. However, according toassumptions A4 and A6, the dispatcher does not knowthe actual execution times, just the service time distribu-

Front-endDispatcher

Serveri

λ λf(p,i)

…

…

λf(p,i)Pcancel

λf(p,i)Pkill

λf(p,i)Pcomplete

Fig. 3: Flow of Requests.

0.00

0.12

0.25

0.37

0.51

0.63

0.77

0.89

0.960.99

0.87

0.74

0.62

0.49

0.37

0.23

0.11

0.04

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9

ServerNumber

d=2

Pcomp Pcancel Pkill

Fig. 4: Pcancel, Pkill and Pcomplete for d = 2 and randomdispatch.

0.00 0.00 0.00 0.00 0.020.09

0.26

0.58

0.85

1.00 1.00 1.00 0.99 0.970.91

0.74

0.42

0.15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9ServerNumber

d=5

Pcomp Pcancel Pkill


0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.26

0.74

1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.74

0.26

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9

ServerNumber

d=9

Pcomp Pcancel Pkill


tions. Therefore, submitting more than one copy givesa job a better chance of joining a queue where thejob will be served with a lower response time. Then, itwould appear that n copies should be made of eacharriving job and they should be submitted to all servers.However, according to assumption A2, jobs in a queueare delayed by the partial execution time of jobs that arekilled. So, the more copies are generated the more weincrease the response time in each queue. Therefore,there is an optimal balance between these two effects(i.e., increasing the chance of joining a queue with thesmallest waiting time and joining a queue with signifi-cant delays due to jobs that will end up being killed).

VI. FAILURES DURING JOB PROCESSING

It is not uncommon for a job’s execution to stallbecause of a variety of reasons such as hardware faults,memory leaks, or misconfigurations. In fact, Hadoop, anopen source implementation of Google’s MapReduce,

creates a copy of a task on another node if the originalone is performing poorly. This is called speculative ex-ecution and it has been noted to improve job responsetimes by 44% [2], [10]. Therefore, we assume that jobscan fail during execution and that they are restarted onthe same node without any significant delay. We do notconsider hardware failures here, just software-relatedfailures.

Consider that a job executes k operations. So, itsexecution time without failures at server i is S = k/Ci.The probability qi(S) that the job fails before completingits execution, conditioned on the value of S, is theprobability that the failure instant occurs before the endof the execution. So,

qi(S) =

∫ S−

0

fFi(x) dx (5)

The probability of exactly j failures experienced by thejob with execution time S executing at server i before itsucceeds in completing its execution is qi(S)

j(1−qi(S))

assuming independence among failures. For example,if the failure model for node i is Poisson, i.e., Fi is ex-ponentially distributed with parameter βi, then qi(S) =1− e−βiS .

The average number of failures NFi(S) before a jobis able to complete at server i is

NFi(S) =

∞∑j=0

j × qi(S)j(1− qi(S)) =qi(S)

1− qi(S). (6)

The above expressions are conditioned on the servicetime being equal to S. We can uncondition the expres-sion for NFi as follows:

NFi =

∞∑k=1

qi(k/Ci)

1− qi(k/Ci)Pr[Si = k/Ci]

=

∞∑k=1

qi(k/Ci)

1− qi(k/Ci)Pr[O = k] (7)

because Pr[Si = k/Ci] = Pr[O = k].Assume as an example that the r.v. O is geo-

metrically distributed with parameter p. Then Pr[O =k] = (1 − p)k−1p. Also assume as above, for the sakeof example, that the failure model is Poisson. Then,qi(k/Ci) = 1− e−βik/Ci and

NFi =

∞∑k=1

1− e−βik/Cie−βik/Ci

(1− p)k−1p

=p

e−βi/Ci − 1 + p− 1 (8)

if [(1−p)/e−βi/Ci ] < 1. Note that Eq. (8) yields NFi = 0,as expected, when there are no failures, i.e., βi = 0.

The average value, E[Wi | Si = S], of Wi given thatSi = S is the average value of Fi given that Fi < Sbecause when a failure occurs before a job completesits S seconds of computation, the computation up to thefailure instant is wasted. The expression for E[Wi | Si =

S] follows from the definition of conditional probability:

E[Wi | Si = S] = E[Fi | Fi < S]

=

∫ S−x=0

x.fFi(x)dx∫ S−x=0

fFi(x)dx. (9)

We can then uncondition Equation (9) as follows

E[Wi] =

∞∑k=1

E[Wi | S = k/Ci]Pr[O = k]. (10)

For example, consider as before that Fi is exponentiallydistributed and O is geometrically distributed. Then,E[Wi | S = S] can be computed as

E[Wi | S] = S] =

∫ S−x=0

x.βie−βixdx∫ S−

x=0βie−βixdx

(11)

=

1βi−(

1βi

+ S)e−βiS

1− e−βiS. (12)

Unconditioning on S = k/Ci gives us

E[Wi] =

∞∑k=1

1βi−(

1βi

+ (k/Ci))e−βik/Ci

1− e−βik/Ci.(1− p)k−1p

(13)The infinite summation in Equation 13 can be com-

puted algorithmically (see Algorithm 1) by observingthat as k → ∞, e−βik/Ci → 0, and using L’Hospital’sRule, ke−βik/Ci → 0. Note also that e−βik/Ci → 0much faster than (1 − p)k−1. Thus, for sufficiently largevalues of k (say k = k∗), the term inside the summationbecomes ≈ (1/βi)(1 − p)k−1p and the summation fromk = k∗ to∞ converges to (1− p)k∗−1/βi. Therefore, thesummation in Equation (13) can be computed by addingthe k-th term starting at k = 1 until the k-th term is closeenough (within a tolerance) to (1/βi)(1−p)k−1p (lines 3-6 of the algorithm). We then need to add (1− p)k∗−1/βito the already accumulated values (line 7 of the algo-rithm).

Algorithm 1 Algorithmic Computation of Eq. (13)Input βi, p, CiOutput E[Wi]

1: Let T (k) =1βi−(

1βi

+(k/Ci))e−βik/Ci

1−e−βik/Ci .(1− p)k−1p2: k ← 1; SUM ← 0;3: repeat4: SUM← SUM+ T (k)5: k ← k + 16: until T (k) ≈ (1/βi)(1− p)k−1p7: E[Wi]← SUM+ (1− p)k−1/βi

We can now combine Eqs. (8) and (13) to computethe average job processing time E[Pi] at server i, con-sidering job failures, as

E[Pi] = E[Si] +NFi × E[Wi]. (14)

VII. REPLICATION POLICIES

A replication policy determines how many copies ofa job should be generated and to which servers theyshould be submitted. We consider the following policiesfor selecting queue placement for job J . Some of thesepolicies are oblivious to the state of the servers whileothers use some information about the state of theservers. Clearly, the policies that use more informationfrom the servers are expected to perform better. How-ever, in practice, it may not be feasible for the servers toshare significant state information with the dispatcher,especially when the number of servers is very large asis the case with many warehouse-scale data centers [1].• AverageQLength (AQL): Compute for each server i

an estimate Ri of the response time of an arrivingjob J at server i as

Ri ≈ E[Pi] +QLi.E[Pi] (15)

where E[Pi] is the average processing time of jobsat server i computed using Eq (14) and QLi is thequeue length at server i including the job beingserved. QLi is known to the dispatcher accordingto Assumption A6. Note that Ri is a conservativeestimate because it is possible that one or moreof the jobs found in the queue by job J will becancelled or killed. The AQL policy then selects thed servers with the smallest values of Ri to submitjob J .

• QCancellation (QC): As indicated in AssumptionA7, the dispatcher knows when jobs are cancelledor killed at any server. Therefore, the dispatchercan keep a running estimate of the values ofPcomplete,i and Pkill,i for each server by dividing,respectively, the number of jobs completed at theserver by the number of jobs submitted to theserver and the number of jobs killed at the server bythe number of jobs submitted to the server. Thus,this policy is similar to the AverageQLength policybut takes into account the estimated probability thatsome of the jobs are killed or cancelled. Therefore,the dispatcher computes for each server i the fol-lowing estimate for the average response time Rias

Ri ≈ E[Pi] +

QLi [E[Pi].Pcomplete,i +

E[Wastedkill,i].Pkill,i] (16)

and submits a replica of job J to the d servers withthe smallest values of Ri.

• Redundant-to-Idle-Queue (RIQ): This dispatchingpolicy was introduced in [3]. Under this policy, thedispatcher randomly choses d servers and queriesthem to check if they are idle or busy. If all d serversare busy, job J joins the queue of one of the dservers at random. However, if 0 < i ≤ d queriedservers are idle, job J is submitted to all i idle

servers.• Random: This policy randomly selects d different

queues to submit job J . This policy, also referredto as Redundant-d in [3], is oblivious to the stateof the queues. Note that the RIQ and the Randompolicies are identical when d = 1. Also, at lightloads, RIQ and Random tend to behave similarlybecause there is a very high probability that allchosen servers are idle.

VIII. SIMULATION STUDY RESULTS

This section describes the simulator and the setupfor the simulation experiments. Figure 7 depicts thearchitecture of the simulator and its main components.The simulator was written in Java and considers jobs

JobGenerator

<<GeometricGenerator>>numOpera1onsGenerator

<<PoissonGenerator>>arrivalTimeGenerator

Dispatcher

Randomselector

AQLresponse1mees1mator

QCresponse1mees1mator

Server

<<Queue>>Wai1ngJobs

ServiceCenter

NewJob

NewJob

CompletedJob

CancelCompleted

Job

Sta3s3csGenerator

•  ResponseTime,

•  Wasted1me

•  P_kill,P_completeP_cancel.

Fig. 7: Block Diagram of the Simulator.

that arrive to the dispatcher from a Poisson process withλ = 7 and 14 jobs/sec. The number of operations perjob is distributed according to a geometric distributionwith parameter p = 0.01 in our experiments. Otherdistributions could be used though. Thus, Pr[O = k] =(1 − p)k−1p = 0.99k−1 × 0.01 for k = 1, · · · . We usedPoisson failures throughout our simulation studies. All ofour simulation experiments consider 100 servers (i.e.,n = 100) clustered into four groups: cluster 1 (servers1-25), cluster 2 (servers 26-50), cluster 3 (servers 51-75), and cluster 4 (servers 76-100). Remember thatlowered numbered servers have lower nominal capacity.Table I shows the profiles of the four clusters. The tableshows the average, standard deviation, and coefficientof variation (COV) of the server capacities in eachcluster as well as these statistics for the failure rates.The COVs for the server capacity are smaller than forthe failure rates.

Our simulator computes the following metrics foreach server and computes averages per cluster: av-erage response time Ri, job cancellation probability

TABLE I: Profile of the four server clusters.

Capacity (in ops/sec) Failure Rate (in failures/sec)Cluster id Range mean stdev COV mean stdev COV

1 1-25 67.0 9.7 0.14 0.014 0.008 0.572 26-50 99.6 7.7 0.08 0.032 0.014 0.433 51-75 135.5 14.2 0.10 0.040 0.018 0.454 76-100 175.8 11.6 0.07 0.054 0.023 0.42

P icancel, job killing probability P ikill, job completion proba-bility P icomplete, average useful utilization Uu,i, and av-erage wasted utilization Uw,i. Note that the simulatorlearns, over time, estimates of P icomplete and P ikill foreach server.

We carried out two types of simulation experiments:(a) average experiments: study the average values ofthe metrics above and their corresponding 95% confi-dence intervals obtained from 30 simulation runs with3,000 job completions in each and (b) time-progressionexperiments: study the evolution of the just describedmetrics over time. Results are plotted for every 50 jobcompletions. Both types of studies show results fordispatching policies AQL, QC, RIQ, and Random fordifferent values of the replication factor d. We used thesame set of seeds (e.g., for job arrivals, job failures,number of operations per job) for each simulation run.That is, we used 30 seeds through our runs. Each seedis used to run simulations to cover all the combinationsof the dispatching policies and the value of the replica-tion factor d. Thus, we are comparing the policies and dvalues under the same sequence of jobs and under thesame failures.

A. Average Simulation Results

The results shown in this section depict a variety ofaverage and 95% intervals for the metrics describedat the top of this section. We prepared three sets ofgraphs to show the average results after the simulationconverged. For convergence, we let 3000 jobs completein the simulation and then collect statistics from thecompletion of the next 3000 jobs. Figure 8 shows sixdifferent charts numbered 1 through 6 from left to rightand then top to bottom as in the previous subsection.

The response time charts are normalized by thenumber of operations for each job and then averagedfor all the jobs completed at the system or at a servercluster.• Fig. 8 (1): depicts the variation of the average

response time from all the jobs at the system, vs.the replication factor d when the arrival rate λ =7 jobs/sec, for the four dispatching policies. TheAQL and QC dispatching policies produce betterresponse times for smaller values of d, while theresponse time from the Random and RIQ policiesyield better results as we increase d. The bestresults overall are obtained for d = 1 and policies

AQL and QC, which are statistically equivalent atthe 95% confidence level. It should be noted how-ever that these policies require detailed knowledgeof the operation of the servers.

• Fig. 8 (2): this figure is similar to Fig. 8 (1) butconsiders a job arrival rate twice as big (i.e., λ = 14jobs/sec).These two charts are similar but the re-sponse times are higher overall for all policies asexpected. It is interesting to note that QC is betterthan AQL for d > 10 because the queue predictionof QC is more accurate.

• Fig. 8 (3): depicts the variation of the averageresponse time for jobs completed at each cluster,vs. the replication factor d when λ = 14 jobs/secand for the Random dispatching policy. Under thispolicy, all the servers have a chance to receive anarriving job, and when d = 1 and d = 2, servers inclusters 1 and 2 have a better chance of completingtheir jobs. As expected, the average response timeat cluster i is better than the average response timeat cluster j ∀ i > j because servers in cluster i arefaster, on average, than servers on cluster j. It canalso be seen that, for all four clusters, the averageresponse time tends to improve as we increase dto 4 or 5, and then it worsens for larger values ofd. Increasing d to 4 or 5 increases the chance ofarriving jobs being dispatched and completed atfaster servers. However, for larger values of d, thecontention at servers increases and thus, jobs willspend more time waiting for other jobs that mostprobably will be killed later.

• Fig. 8 (4): this figure is similar to Fig. 8 (3) exceptthat it uses the RIQ policy. For smaller values of d,the results from the RIQ policy follow the same pat-tern as the Random policy because these policiesbehave similarly for small values of d. However, theRIQ policy tends to give better results for larger dvalues. This is because the RIQ policy does not in-crease contention as the Random policy because itdispatches jobs to idle servers only or to one serveronly in case no idle server is selected initially.

• Fig. 8 (5): this figure is also similar to Fig. 8 (3)except that it uses the AQL policy. Under this policy,an arriving job will be dispatched to the serverswith d shortest estimated response time. For d = 1,we see results for cluster 4 only, which means thatall the jobs were dispatched to cluster 4 servers

because no cancelation or killing will occur ford = 1. For larger values of d, we see other clusterscompleting some jobs. This happens because aswe increase d, we increase the contention amongcluster 4 servers and therefore, servers in otherclusters have a better chance of receiving and com-pleting some of the jobs. All clusters experiencelonger response times for larger values of d. UnderAQL, only 5 out of 30 simulation runs showed aresponse time for cluster 1 and d = 4, and 27 runsshowed a response time when d = 5. In addition,only 2 out of 30 runs showed a result from cluster2 for d = 2.

• Fig. 8 (6): this figure is also similar to Fig. 8 (3) butuses the QC dispatching policy. The results showthe same pattern as the AQL results, except thatwe can see results for cluster 1 only for d ≥ 4.In fact, out of 30 simulation runs, only 2 showeda response time from cluster 1 when d = 4, and23 runs showed a response time for cluster 1 whend = 5.

In general, the charts in Fig. 8 show that the Randomand RIQ policies give better response time with largervalues of d, while AQL and QC give better responsetime for smaller values of d. In fact, under the AQL andQC dispatching policies, redundancy does not improveresponse time. However, these policies will always usethe same set of servers. Random and RIQ require none(in the case of Random) or little (in the case of RIQ)knowledge of the status of the servers. AQL and QC aremore intrusive and require the dispatcher and servers toshare a lot more knowledge.

The second set of charts shows Pcomplete,i and Pkill,i

for all four policies, d = 1, 2, 3, 4, 5, 10, and 15 and λ = 14requests/sec.

Figure 9 shows eight different graphs numbered 1through 8 from left to right and then top to bottom as inthe previous subsection. We describe now each of theeight graphs and the observations we derive from them.• Fig. 9 (1): depicts the variation of Pcomplete,i vs.d when the arrival rate λ = 14 jobs/sec underthe Random dispatching policy. It is expected thatPcomplete,i decreases as d increases because moreservers receive replicas of the same job. However,slower servers experience a faster decrease inPcomplete,i while the decrease in cluster 4 is al-most linear. This is because as we increase d, thecompetition for job completion tends to be betweencluster 4 servers. So, most of the completions willstill be seen from servers at this cluster.

• Fig. 9 (2): depicts the variation of Pkill,i vs. d whenthe arrival rate λ = 14 jobs/sec under the Randomdispatching policy. We notice that this probabilityinitially increases with d and then starts to de-crease. The value of d at which this happens isnot the same for each cluster. For example, thishappens for d = 5 for cluster 1, for d = 10 for cluster

2, and for d = 15 for cluster 3. The figure does notshow results for d > 15 but Pkill,i shows signs ofdecreasing for d = 15 for cluster 4. The explanationfor this behavior is that as d increases, the Randompolicy will distribute jobs into more clusters in auniform way. Therefore, more jobs tend to be killedby servers at higher capacity clusters and Pkill,i

increases. However, as d goes above a threshold,the queue size increases at all servers and highercapacity servers will take longer to complete jobsand therefore will provide more chance for jobs notto be killed at other servers.

• Fig. 9 (3): this chart is similar to Fig. 9 (1) but usesthe RIQ dispatching policy. The chart follows thesame pattern as Fig. 9 (1), but with better overallcompletion probability. This happens because jobswill be dispatched to an idle server most of the time,and when a job is dispatched to one server, thisserver is guaranteed to complete the job.

• Fig. 9 (4): this chart is similar to Fig. 9 (2) but usesthe RIQ dispatching policy. Under this policy, Pkill,i

always increases with d. For clusters 1 and 2, Pkill,i

becomes very high for d = 10 and d = 15. As d in-creases, the servers that are idle will most probablybe idle due to constant killing and canceling and notdue to not receiving jobs. So, when the same set ofservers are found idle and selected by RIQ, d − 1of these will kill their job, thus increasing the overallPkill,i.

• Fig. 9 (5): this chart is similar to Fig. 9 (1) butuses the AQL dispatching policy. The high valueof Pcomplete,i for cluster 1 servers under lower dvalues is due to not receiving any jobs and thus notchanging their initial value of Pcomplete,i, which is 1.The decrease in Pcomplete,i for clusters 3 and 4 ismostly attributed to competition within the serversat these clusters.

• Fig. 9 (6): this chart is similar to Fig. 9 (2) but usesthe AQL dispatching policy. The killing probabilityincreases then decreases with d for all clusters.This is because servers start with Pkill,i = 0, andif they do not receive any jobs, they will keep thisprobability. For larger values of d, faster serverstend to have longer queues, preventing them fromreceiving more jobs until their queues becomeshorter. So, many servers on these clusters will notreceive any jobs for larger values of d, and thustheir Pkill,i will be 0, which will lower the overallaverage Pkill,i from all the servers in the cluster.

• Fig. 9 (7): this chart is similar to Fig. 9 (1) but usesthe QC dispatching policy and follows the samepattern as Fig. 9 (5) because the AQL and QC poli-cies work under the same principle of estimatingthe queue size.

• Fig. 9 (8): this chart is similar to Fig. 9 (2) butuses the QC dispatching policy and follows thesame pattern as Fig. 9 (6) for the reasons explained

0.01434

0.00807

0.00705

0.00658

0.00632

0.00581

0.00570

0.00531

0.00537

0.00544

0.00560

0.00574

0.00662

0.00792

0.00531

0.00537

0.00544

0.00562

0.00579

0.00661

0.00778

0.01397

0.00834

0.00716

0.00665

0.00635

0.00580

0.00564

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

AllJobs(ƛ=7)

RandomPolicy AQLPolicy QCPolicy RIQPolicy

0.0187

0.0090

0.0075

0.0070

0.0067

0.0063

0.0066

0.0054

0.0056

0.0058

0.0062

0.0065 0.0092

0.0167

0.0054

0.0056

0.0059

0.0062

0.0065 0.0086

0.0131

0.0196

0.0099

0.0079

0.0072

0.0068

0.0061

0.0060

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

AllJobs(ƛ=14)

RandomPolicy AQLPolicy QCPolicy RIQPolicy

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

RandomPolicy (ƛ=14)

Cluster1 Cluster2 Cluster3 Cluster4

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

RIQPolicy(ƛ=14)


0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

AQLPolicy(ƛ=14)


0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

1 2 3 4 5 10 15

AverageNo

rmalize

dRe

sponseTim

e

d

QCPolicy(ƛ=14)


Fig. 8: Results for Average Response Times.

above.The third set of results shows four graphs for the

useful and wasted utilization per cluster under the dif-ferent policies for d = 1, 2, 3, 4, 5, 10, and 15 and λ =14. Figure 10 shows four different graphs (one percluster) numbered 1 through 4 from left to right andthen top to bottom as in the previous subsection. Eachgraph displays four stacked bars (one per policy) foreach value of d. The bottom part of each stacked barrepresents useful utilization and the top part representswasted utilzation. We describe now each of the fourgraphs and the observations we derive from them.• Fig. 10 (1): depicts the variation of average server

utilization for cluster 1 servers vs. d for various poli-cies. Under Random and RIQ dispatching, cluster 1

servers will always be utilized, however, it is mostlya wasted utilization where all the work is wastedbecause of job killing. Under AQL and QC policies,the servers will only be used for larger values of dand it is mostly a wasted utilization.

• Fig. 10 (2): depicts the variation of average serverutilization for cluster 2 servers vs. d for variouspolicies. Similarly to cluster 1, the servers will beused under Random and RIQ policies for all valuesof d, but as d increases, most of the work will bewasted. Under AQL and QC policies, the serverswill be utilized for d ≤ 4. Again, most of this iswasted work.

• Fig. 10 (3): depicts the variation of average serverutilization for cluster 3 servers vs. d for various

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_complete

d



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_kill

d



0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

1 2 3 4 5 10 15

AverageP_complete

d

RIQPolicy(ƛ=14)


0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

1 2 3 4 5 10 15

Averagep_kill

d

RIQPolicy(ƛ=14)


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_complete

d

AQLPolicy(ƛ=14)


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_kill

d

AQLPolicy(ƛ=14)


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_complete

d

QCPolicy(ƛ=14)


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 10 15

AverageP_kill

d

QCPolicy(ƛ=14)


Fig. 9: Results for Average Pcomplete,i and Pkill,i.

policies. Under Random and RIQ policies, the use-ful utilization decreases and the wasted utilizationincreases as d increases. However, under AQL andQC policies, the useful utilization is roughly thesame for d ≤ 4, but the wasted utilization increases.

• Fig. 10 (4): depicts the variation of average serverutilization for cluster 4 servers vs. d for variouspolicies. The servers in cluster 4 have the bestuseful utilization among all other clusters underall policies. This is expected as they have a bet-ter chance of completing the jobs due to theirhigher capacities. They even have better usefulutilization under Random and RIQ dispatching forlarger values of d because under these policiesthey compete with servers of lower capacities, withless probability of killing. Under AQL and QC, thecompetition will mostly be between servers of highcapacity, which increases the wasted work in thiscluster.

B. Time-progression Results

We show here a variety of graphs for a typicalsimulation run with 3,000 job completions. Results areshown for d = 10, λ = 14 requests/sec, AQL and QCpolicies, and are displayed for each 50 job completions.Because server selection under AQL and QC dependson the estimated response time, in addition to Pcomplete,i

and Pkill,i, we show how these values converge withtime.

Figure 11 shows four different graphs numbered 1through 4 from left to right and then top to bottom.We describe now each graph and the observations wederive from them.• Fig. 11 (1): depicts the variation of the response

time, normalized by the number of operations perjob and averaged for all jobs completed at eachcluster vs. the number of job completed. The re-sults are shown for all the clusters and under AQLdispatching policy. It can be seen that the responsetime converges quickly.

• Fig. 11 (2): depicts similar results as those ofFig. 11 (1) but for the QC policy. The results takeslightly longer to converge because the responsetime estimate depends on the dispatcher obtaininga stable estimate of Pcomplete,i and Pkill,i shown inthe next two charts.

• Fig. 11 (3): depicts the variation of Pcomplete,i av-eraged for all jobs completed at each cluster vs.the number of job completed. It can be seen thatPcomplete,i for cluster 1 starts high until this clusterstarts receiving jobs due to long queues in otherclusters. However, most of the jobs are killed,and thus Pcomplete,i decreases. The other clustersdecrease their completion probability before thecompletion of 50 jobs. Since these results are ford = 10, we expect significant job killing and wastedwork.

• Fig. 11 (4): depicts the variation of Pkill,i averagedfor all jobs completed at each cluster vs. the num-ber of jobs completed. The changes in this charthappen in the opposite direction from the changesin Fig. 11 (3). Cluster 1 has a slow convergencemainly due to a low dispatch rate to servers in thiscluster.

IX. CONCLUSIONS

This paper investigated the advantages of submittingmultiple replicas of a job to the servers of a heteroge-neous cluster in order to reduce the average responsetime. A dispatcher receives jobs but does not knowwhich server would be able to process the job faster. So,submitting an arriving job to multiple servers increasesthe chance that the job will complete faster. Once ajob completes, all its replicas, including the ones beingprocessed, are removed from the other servers. If thenumber of replicas is too high, the number of jobs tokill will increase and the partially completed jobs delayother jobs waiting to use that server. Therefore, there isan optimal value of the number of replicas.

We considered here the possibility of jobs failing dur-ing processing and provided an analytic expression forthe average time wasted due to failures. Extensive simu-lation experiments were carried out to compare four dis-patching policies under various number of replicas. Twoof the policies (Random and RIQ) require the dispatcherto know no or very little information about the serveroperation. For these policies, the average response timedecreases with the number of replicas and increasesagain as the number of replicas increases further. TheAQL and QC policies are more intrusive and requiresignificantly more coordination between the dispatcherand each server. These policies do not show any benefitfrom using replication and are the best overall, as longas one is willing to tolerate the strong coupling betweenthe dispatcher and the servers. The explanation for thatis that AQL and QC operate on estimating the responsetime at each server. If the estimate were 100% precise,the best policy would be to send a single copy of thejob to the server where that job would experience thesmallest response time.

In the future, we intend to investigate policies thattake cost into account. Cost-aware policies are es-pecially important in IaaS clouds since the longer ittakes to process a workload the higher the cost. Inthis paper, we considered that all jobs have the samedistribution of number of operations. As future work, weintend to consider the multiclass case where jobs ofdifferent classes have different distributions of numberof operations. Additionally, we intend to investigate thespecific overhead of undoing actions taken by killedjobs. This is important for stateful jobs. Finally, weplan to investigate the error between the response timeestimates computed by the AQL and QC policies andthe actual response times.

0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 10 15

Utilizatio

n

d

Cluster1(ƛ=14)

RandomUseful RandomWasted AQLUseful AQLWasted

QCUseful QCWasted RIQUseful RIQWasted

0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 10 15

Utilizatio

n

d

Cluster2(ƛ=14)



0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 10 15

Utilizatio

n

d

Cluster3(ƛ=14)



0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 10 15

Utilizatio

n

d

Cluster4(ƛ=14)



Fig. 10: Results for Average Uu,i and Uw,i.

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

1350

1450

1550

1650

1750

1850

1950

2050

2150

2250

2350

2450

2550

2650

2750

2850

2950

AverageNo

rmalize

dRe

sponseTim

e

NumberofCompletedJobs

AQL,d=10

cluster1 cluster2 cluster3 cluster4

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

1350

1450

1550

1650

1750

1850

1950

2050

2150

2250

2350

2450

2550

2650

2750

2850

2950

AverageNo

rmalize

dRe

sponseTim

e


QC,d=10


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

1350

1450

1550

1650

1750

1850

1950

2050

2150

2250

2350

2450

2550

2650

2750

2850

2950

AverageP_complete


QC,d=10


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

50 150

250

350

450

550

650

750

850

950

1050

1150

1250

1350

1450

1550

1650

1750

1850

1950

2050

2150

2250

2350

2450

2550

2650

2750

2850

2950

AverageP_kill


QC,d=10


Fig. 11: Time-progression Results.

ACKNOWLEDGEMENTS

This work was partially supported by the AFOSRgrant FA9550-16-1-0030.

REFERENCES

[1] L.A. Barroso, J. Clidaras, and U. Holzle, The Datacenter asa Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd. edition, Morgan & Claypool Publishers se-ries Synthesis Lectures on Computer Architecture, 2013, ISBN:9781627050098.

[2] J. Dean and S. Ghemawat, MapReduce: Simplified Data Process-ing on Large Clusters, Communications of the ACM, 51 (1), pp.107-113, 2008.

[3] Gardner, K., M. Harchol-Balter, and A. Scheller-Wolf, A BetterModel for Job Redundancy: Decoupling Server Slowdown andJob Size, IEEE Modeling, Analysis and Simulation of Computerand Telecommunication Systems (MASCOTS 2016) . London, UK,September 2016.

[4] Gardner, K., S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyytia,and A. Scheller-Wolf, Reducing latency via redundant requests:Exact analysis, Proc. ACM SIGMETRICS, June 2015.

[5] Harchol-Balter, Mor, Performance Modeling and Design of Com-puter Systems: Queueing Theory in Action, Cambridge UniversityPress, 2013.

[6] Huang, L., S. Pawar, H. Zhang, and K. Ramchandran, Codescan reduce queueing delay in data centers, 2012 IEEE Intl. Symp.Information Theory (ISIT), pp. 27662770, 2012.

[7] Kleinrock, L., Queuing Systems, Volume I: Theory , John Wiley &Sons, 1975.

[8] Koole G. and R. Righter, Resource allocation in grid computing,Journal of Scheduling, 11:163173, 2009.

[9] Rayward-Smith, V.J., I.H. Osman, C.R. Reeves, and G.D. Smith,eds., Modern Heuristic Search Methods,, John Wiley, Hoboken, NJ,1996.

[10] Zaharia, Matei, Andy Konwinski, Anthony D. Joseph, RandyH. Katz, and Ion Stoica, Improving MapReduce performance inheterogeneous environments,, In Osdi, vol. 8, no. 4, 2008.

Documents

Invited: Performance Evaluation of Heterogeneous Multi ...menasce/papers/403_Menasce.pdfnumber of failures and wasted time due to job restarts. Section VII describes the replication