Scheduling Jobs across Geo-Distributed Datacenters with ...iqua.ece.toronto.edu/papers/lchen-tnse18.pdf · Scheduling Jobs across Geo-Distributed Datacenters with Max-Min Fairness

2327-4697 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2018.2795580, IEEETransactions on Network Science and Engineering

1

Scheduling Jobs across Geo-DistributedDatacenters with Max-Min Fairness

Li Chen, Shuhao Liu, Baochun Li, Fellow, IEEE, and Bo Li, Fellow, IEEE

Abstract—It has become routine for large volumes of data to be generated, stored, and processed across geographically distributeddatacenters. To run a single data analytic job on such geo-distributed data, recent research proposed to distribute its tasks acrossdatacenters, considering both data locality and network bandwidth across datacenters. Yet, it remains an open problem in the moregeneral case, where multiple analytic jobs need to fairly share the resources at these geo-distributed datacenters. In this paper, wefocus on the problem of assigning tasks belonging to multiple jobs across datacenters, with the specific objective of achieving max-minfairness across jobs sharing these datacenters, in terms of their job completion times. We formulate this problem as a lexicographicalminimization problem, which is challenging to solve in practice due to its inherent multi-objective and discrete nature. To address thesechallenges, we iteratively solve its single-objective subproblems, which can be transformed to equivalent linear programming (LP)problems to be efficiently solved, thanks to their favorable properties. As a highlight of this paper, we have designed and implementedour proposed solution as a fair job scheduler based on Apache Spark, a modern data processing framework. With extensive evaluationsof our real-world implementation on Amazon EC2 and large-scale simulations, we have shown convincing evidence that max-minfairness has been achieved and the worst job completion time has been significantly improved using our new job scheduler.

Index Terms—Geo-distributed Datacenter Networks, Wide-Area Big Data Analytics, Scheduling, Fairness

F

1 INTRODUCTION

It is increasingly common for large volumes of data to begenerated and processed in a geographically distributedfashion, across multiple datacenters around the world.Popular data analytic frameworks, such as MapReduce[1] and Spark [2], are extensively employed to processsuch large volumes of data efficiently. A data analyticjob typically proceeds in consecutive computation stages,each of which consisting of a number of computationtasks that are executed in parallel. To start a new compu-tation stage, intermediate data from the preceding stageneeds to be fetched, which may initiate multiple networkflows.

When input data is located across multiple datacen-ters, a naive approach is to gather all the data to beprocessed locally within a single datacenter. Naturally,transferring huge amounts of data across datacentersmay be slow and inefficient, since bandwidth on inter-datacenter network links is limited [3]. Existing research(e.g., [4], [5]) has shown that better performance can beachieved if tasks in an analytic job can be distributedacross datacenters, and located closer to the data tobe processed. In this case, designing the best possibletask assignment strategy to assign tasks to datacenters is

• Li Chen, Shuhao Liu and Baochun Li are with the Department ofElectrical and Computer Engineering, University of Toronto, Canada. E-mail: {lchen,shuhao,bli}@ece.utoronto.ca. Bo Li is with the Department ofComputer Science and Engineering, The Hong Kong University of Scienceand Technology, Hong Kong, China. E-mail: {bli}@cse.ust.hk

• The research is supported in part by the NSERC Collaborative Researchand Development Grant, and by grants from RGC GRF under thecontracts 16211715 and 16206417, a grant from RGC CRF under thecontract C7036-15G.

important, since different strategies lead to different flowpatterns across datacenters, and ultimately, different jobcompletion times.

When designing optimal task assignment strategies,however, existing works in the literature [4], [5] onlyconsidered a single data analytic job. The problem ofassigning tasks belonging to multiple jobs across dat-acenters remains open. Given the limited amount ofresources at each datacenter, multiple jobs are inherentlycompeting for resources with each other. It is, therefore,important to maintain fairness when allocating such ashared pool of resources, which cannot be achieved iftasks from one job are assigned without considering theother jobs.

In this paper, we propose a new task assignmentstrategy that is designed to achieve max-min fairnessacross multiple jobs with respect to their performance,as they compete for the limited pool of shared resourcesacross multiple geo-distributed datacenters. To be morespecific, we wish to minimize the job completion timesacross all concurrent jobs, while maintaining max-minfairness. Such a problem can be formally formulated asa lexicographical minimization problem, which has uniquechallenges that make it difficult to solve this problemwith multiple objectives. The task assignment problemis essentially an integer optimization problem, which ingeneral is NP-hard [6].

To address these challenges, we first consider thesubproblem of minimizing the worst (longest) job com-pletion time among all the concurrent jobs, which turnsout to have a totally unimodular coefficient matrix forlinear constraints, based on an in-depth investigationof the problem structure. Such a nice property guar-



2

antees that the extreme points in a feasible solutionpolyhedron are integers. Moreover, with several steps ofnon-trivial transformations, we show that the optimalsolution to the original problem can be obtained bysolving an equivalent problem with a separable convexobjective. With these structures identified, we can thenapply the λ-technique and linear relaxation to obtain alinear programming (LP) problem, which is guaranteedto have the same solution to the original problem. Asa result, any LP solver can be used for minimizing thecompletion time of each job, and to efficiently computethe overall assignment decisions that achieve the optimalcompletion times with max-min fairness.

To demonstrate the practicality of our proposed so-lution, we have designed and implemented a new jobscheduler to assign tasks from multiple jobs to geo-distributed datacenters, in the context of Apache Spark.Our experimental results on multiple Amazon EC2 dat-acenters have shown that our new scheduler is effectivein optimizing job completion times and achieving max-min fairness.

Highlights of our original contributions are as follows.First, as motivated by our example in Sec. 2, we focuson jointly assigning tasks from multiple data paralleljobs across multiple datacenters, which considers theinterplay between these jobs when sharing a limited poolof datacenter resources. Second, our problem, formulatedas a lexicographical minimization problem in Sec. 3,has both the discrete and multi-objective nature thatmake it challenging to solve. Fortunately, with a carefulinvestigation of its structure, we are able to identify thefavorable properties of totally unimodular constraintsand a separable convex objective, thus transforming itinto an equivalent LP problem to be efficiently andelegantly solved (Sec. 4 and Sec. 5). Finally, to show itseffectiveness and practicality, we have completed bothlarge-scale simulations and a real-world implementationof our proposed solution within the Spark job schedulingframework, and conducted extensive evaluations acrossmultiple datacenters in Amazon EC2 (Sec. 6).

2 BACKGROUND AND MOTIVATION

In this section, we first present an overview of theexecution of a data analytic job whose input data isstored across geographically distributed datacenters. Wethen consider the sharing of resources in these data-centers across multiple concurrent jobs, and provide amotivating example to illustrate the need for fair jobscheduling.

2.1 Data Analytic Jobs in the Wide AreaIt is typical for a data analytic job to contain tens or hun-dreds of tasks, supported by a data parallel framework,such as MapReduce and Spark. These tasks are parallelto or dependent upon each other, and network flows aregenerated between dependent tasks, since tasks in thesubsequent stage need to fetch intermediate data from

tasks in the current stage. For example, in a MapReducejob, a set of map tasks are first launched to read inputdata partitions and generate intermediate results; thenreduce tasks would fetch such intermediate data frommap tasks for further processing, which involves trans-ferring data over the network.

In the case of running tasks in a data analytic jobacross multiple datacenters, data may be transferred overinter-datacenter links, which may become bottlenecksdue to their limited bandwidth availability. Our designobjective in this paper is to compute the best way toassign tasks belonging to multiple jobs to geo-distributeddatacenters, so that all jobs can achieve their best possi-ble performance with respect to their completion times,without harming the performance of others. This impliesthat max-min fairness needs to be achieved across jobssharing the datacenters, in terms of their job completiontimes.

For a better intuition of our problem, we use Fig. 1to show an example with two data analytic jobs sharingthree geo-distributed datacenters. For job A, both of itstasks, tA1 and tA2, require 100 MB of data from inputdataset A1 stored in DC1, and 200 MB of data from A2located at DC3. For job B, the amounts of data to beread by task tB1 from dataset B1 in DC2 and B2 inDC3 are both 200 MB; while task tB2 needs to read 200MB of data from B1 and 300 MB from B2. These tasksare to be assigned to available computing slots in thethree datacenters, each with two slots, two slots, and oneslot, respectively. The amounts of available bandwidth ofinter-datacenter links are illustrated in the figure, withthe unit of MB/s.

For each job, a different assignment of its tasks willlead to flows of different sizes traversing different links,thus resulting in different job completion times. More-over, both jobs must share and compete for the samepool of computing resources across these datacenters.For example, since DC3 only has one available comput-ing slot, if we assign a task from one job, tasks fromthe other job cannot be assigned. In order to achievethe best possible performance for both jobs, we need toconsider the placement of their tasks jointly, rather thanindependently.

2.2 An Intuition on Task Assignment

We have illustrated two ways of assigning tasks todatacenters in Fig. 2 and Fig. 3. Intuitively, DC3 is afavorable location for tasks from both jobs, since they allhave part of their input data stored in this datacenter,and the links of DC1-DC3 and DC2-DC3 both have highbandwidth. For simplicity, we assume that all the tasksare identical, with the same execution time, thus the jobcompletion time is determined by the network transfertime. If the scheduler tries to optimize task assignmentof these jobs independently, the result is shown in Fig.2.To optimize the assignment of job A, task tA2 would beassigned to the only available computing slot in DC3, and



3

A1

B1 A2 B2

DC1

DC2 DC3

8080 150 100

120

160

tA1 tA2Job A

tB1 tB2Job B

availablecomputation

resource

Fig. 1: An example of scheduling multiple jobsfairly across geo-distributed datacenters.

A1

B1 A2

DC2

tB1

tA1tB2 tA2

B2

200/100

200/160

300/160

100/150

100/80

DC3

DC1

200/80

Fig. 2: A task assignment that favors theperformance of job A at the cost of job B.

A1

B1 A2

DC2

tA1

tA2tB1 tB2

B2

200/100

200/160

200/160

200/120

100/80

DC3

DC1

Fig. 3: The optimal assignment for both jobA and job B.

tA1 would be placed in DC2, which result in a networktransfer time of max{100/80, 200/160, 100/150} = 1.25seconds. Then, if the scheduler continues to optimize theassignment of job B, DC1 and DC2 would be selected todistribute task tB1 and tB2, respectively, resulting inthe transfer time of max{200/80, 200/100, 300/160} = 2.5seconds for job B.

However, this placement is not optimal when consid-ering the performance of these jobs jointly. Instead, weshow the optimal assignment for both jobs satisfyingmax-min fairness in Fig. 3. With this assignment, tasktB2 of job B would occupy the computing slot in DC3,which avoids the transfer of 300 MB data from datasetB2. Task tB1 is assigned to DC2 rather than DC1, whichtakes advantage of the high bandwidth of the DC3-DC2link. As a result, the flow patterns are illustrated in thefigure. In this assignment, the network transfer timesof job A and B are max{200/100, 100/80, 200/160} = 2seconds, and max{200/160, 200/120} = 5/3 seconds, re-spectively. Compared with the independent assignmentin Fig. 2 where the worst performance is 2.5 seconds,this assignment results in the worst network transfertime of 2 seconds (job A), which is optimal if we wish tominimize the worst job completion time, and is fair interms of the performance achieved by both jobs.

We are now ready to formally construct a mathe-matical model to study the problem of optimizing taskassignment with max-min fairness across multiple jobsto be achieved.

3 MODEL AND FORMULATION

We consider a set of data parallel jobs K = {1, 2, · · · ,K}submitted to the scheduler for task assignment. Theinput data of these jobs are distributed across a setof geo-distributed datacenters, represented by D ={1, 2, · · · , J}. Each job k ∈ K has a set of paralleltasks Tk = {1, 2, · · · , nk} to be launched on availablecomputing slots in these datacenters. We use aj to denotethe capacity of available computing slots in datacenterj ∈ D.

For each task i ∈ Tk of job k, the time it takes to com-plete consists of both the network transfer time, denotedby cki,j , to fetch the input data if the task is assignedto datacenter j, and the execution time represented by

eki,j . The network transfer time is determined by boththe amount of data to be read, and the bandwidth onthe link the data traverses. Let Ski denote the set ofdatacenters where the input data of task i from job kare stored, called the source datacenters of this task forconvenience. The task needs to read the input data fromeach of its source datacenters s ∈ Ski , the amount ofwhich is represented by dk,si . Let bs,j 1 represent thebandwidth of the link from datacenter s to datacenter j(s 6= j). Hence, the network transfer time of task i ∈ Tk,if assigned to datacenter j, is expressed as follows:

cki,j =

0, when Ski = {j};max

s∈Ski ,s6=j

dk,si /bsj , otherwise. (1)

This indicates that when all the input data of task i arestored in the same datacenter j, i.e., Ski = {j}, therewould not be any traffic generated across datacenters,and thus the network transfer time is 0. Otherwise, taski fetches data from each of its remote source datacenters ∈ Ski , s 6= j with the completion time of dk,si /bsj . As thenetwork transfer completes when input data from all thesource datacenters have been fetched, the transfer timecki,j is represented as the maximum of dk,si /bsj over allthe remote source datacenters.

The assignment of a task is represented with a binaryvariable xki,j , indicating whether the i-th task of job kis assigned to datacenter j. A job k completes when itsslowest task finishes, thus the job completion time ofk, represented by τk, is determined by the maximumcompletion time among all of its tasks, expressed asfollows:

τk = maxi∈Tk,j∈D

xki,j(cki,j + eki,j) (2)

As the computing slots in all the datacenters areshared by tasks from multiple jobs, we would like toobtain an optimal task assignment without exceeding theresource capacities. To be more specific, our schedulerwould decide the assignment of all the tasks, aiming to

1. On popular cloud platforms (e.g., Amazon EC2 and GoogleCould), inter-datacenter wide-area networks are provided as a sharedservice, where user-generated flows will compete with millions of otherflows. As a result, each inter-datacenter TCP flow will get a fair shareof the link capacity. Our measurement with iperf3 on EC2 verifies thisassumption.



4

optimize the worst performance achieved among all thejobs with respect to their job completion times, and thenoptimize the next worst performance without impactingthe previous one, and so on. This is executed repeatedlyuntil the completion times have been optimized for allthe jobs. Such an objective can be rigorously formulatedas a lexicographical minimization problem, with the follow-ing definitions as its basis.

Definition 1: Let 〈vvv〉k denote the k-th (1 ≤ k ≤ K)largest element of vvv ∈ ZK , implying 〈vvv〉1 ≥ 〈vvv〉2 ≥ · · · ≥〈vvv〉K . Intuitively, 〈v〉〈v〉〈v〉 = (〈vvv〉1, 〈vvv〉2, · · · , 〈vvv〉K) representsthe non-increasingly sorted version of vvv.

Definition 2: For any ααα ∈ ZK and βββ ∈ ZK , if〈ααα〉1 < 〈βββ〉1 or ∃k ∈ {2, 3, · · · ,K} such that 〈ααα〉k < 〈βββ〉kand 〈ααα〉i = 〈βββ〉i,∀i ∈ {1, · · · , k− 1}, then ααα is lexicograph-ically smaller than βββ, represented as ααα ≺ βββ. Similarly,if 〈ααα〉k = 〈βββ〉k,∀k ∈ {1, 2, · · · ,K} or ααα ≺ βββ, then ααα islexicographically no greater than βββ, represented as ααα � βββ.

Definition 3: lexminxxx

fff(xxx) represents the lexicographicalminimization of the vector fff ∈ RN , which consists of Nobjective functions of xxx. To be particular, the optimalsolution xxx∗ ∈ RK achieves the optimal fff∗, in the sensethat fff∗ = fff(xxx∗) � fff(xxx),∀xxx ∈ RK .

With these definitions, we are now ready to formulateour optimal task assignment problem among sharingjobs as follows:

lexminxxx

fff = (τ1, τ2, · · · , τK) (3)

s.t. τk = maxi∈Tk,j∈D

xki,j(cki,j + eki,j),∀k ∈ K (4)∑

k∈K

∑i∈Tk

xki,j ≤ aj , ∀j ∈ D (5)∑j∈D

xki,j = 1, ∀i ∈ Tk, ∀k ∈ K (6)

xki,j ∈ {0, 1}, ∀i ∈ Tk, ∀j ∈ D, ∀k ∈ K (7)

where constraint (4) represents the completion time ofeach job k as aforementioned. Constraint (5) indicatesthat the total number of tasks to be assigned to data-center j does not exceed its capacity aj , which is thetotal number of available computing slots. Constraint (6)implies that each task should be assigned to a singledatacenter.

The objective is a vector fff ∈ RK with K elements,each standing for the completion time of a particularjob k ∈ K. According to the previous definitions, theoptimal fff∗ is lexicographically no greater than any fffobtained with a feasible assignment, which means thatwhen sorting them in a non-increasing order, if their k-th largest element satisfies 〈fff∗〉k′ = 〈fff〉k′ ,∀k′ < k and〈fff∗〉k 6= 〈fff〉k, then we have 〈fff∗〉k < 〈fff〉k. This impliesthat the first largest element of fff∗, i.e., the slowestcompletion time, is the minimum among all fff . Thenamong all fff with the same worst completion time, thesecond worst completion time in fff∗ is the minimum, andso on. In this way, solving this problem would result inan optimal assignment vector xxx∗, with which all the jobcompletion times are minimized.

4 OPTIMIZING THE WORST COMPLETIONTIME AMONG CONCURRENT JOBS

Problem (3) is a vector optimization with multiple ob-jectives. In this section, we consider the single-objectivesubproblem of optimizing the worst job performance asfollows:

minxxx

maxk∈K

(τk) (8)

s.t. Constraints (4), (5), (6) and (7).

which is a primary step towards solving the originalproblem, to be elaborated in the next section.

Substituting the completion time τk in the objectivewith the expression in constraint (4), we have the fol-lowing problem with the non-linear constraint (4) elim-inated:

minxxx

maxk∈K

( maxi∈Tk,j∈D

xki,j(cki,j + eki,j)) (9)

s.t. Constraints (5), (6) and (7).

Though this problem is an integer programming prob-lem, we will show that it can be transformed into anequivalent linear programming (LP) problem after an in-depth investigation of its structure. As a result of sucha transformation, it can be solved efficiently to obtainthe optimal schedule vector xxx. Our transformation takesadvantage of its features of separable convex objective andtotally unimodular linear constraints, and it involves threemajor steps to be elaborated in the following subsections.

4.1 Separable Convex Objective

In the first step, we will show that the optimal solutionfor Problem (9) can be obtained by solving a problemwith a separable convex objective function, which isrepresented as a summation of convex functions withrespect to each single variable xki,j .

We first show that the optimal solution of Problem (9)can be obtained by solving the following problem:

lexminxxx

ggg = (φ(x11,1), · · · , φ(xki,j), · · · , φ(xKnK ,J))


where φ(xki,j) = xki,j(cki,j + eki,j),∀i ∈ Tk, ∀j ∈ D, ∀k ∈ K,

and ggg is a vector with the dimension of M = |ggg| =J∑Kk=1 nk. For this problem, the objective includes min-

imizing the maximum element in ggg, which is the worstcompletion time across all the jobs. Therefore, the op-timal assignment variables xxx∗ that gives ggg∗ is also theoptimal solution for Problem (9).

Let ϕ(ggg) define a function of ggg:

ϕ(ggg) =

|ggg|∑m=1

|ggg|gm =M∑m=1

Mgm

where gm is the m-th element of ggg.Lemma 1: ϕ(·) preserves the order of lexicographically

no greater (�), i.e., ggg(xxx∗) � ggg(xxx) ⇐⇒ ϕ(ggg(xxx∗)) ≤ ϕ(ggg(xxx)).



5

Proof: We first consider ααα,βββ ∈ ZK that satisfiesααα ≺ βββ. If we use the integer k(1 ≤ k ≤ K) torepresent the first non-zero element of 〈α〉〈α〉〈α〉−〈β〉〈β〉〈β〉, we have〈ααα〉k = 〈βββ〉k,∀k ≤ k ≤ K and 〈ααα〉k < 〈βββ〉k. Assume〈ααα〉k = m, then 〈βββ〉k ≥ m+ 1.

ϕ(ααα) =K∑k=1

K〈ααα〉k =k−1∑k=1

K〈ααα〉k +K〈ααα〉k +K∑

k=k+1

K〈ααα〉k

≤ ∑k−1k=1K

〈ααα〉k +K〈ααα〉k + (K − k)K〈ααα〉k=

∑k−1k=1K

〈ααα〉k + (K + 1− k) ·K〈ααα〉k<

∑k−1k=1K

〈ααα〉k +K ·Km,

where the first inequality holds as 〈ααα〉k ≥ 〈ααα〉k,∀k+ 1 ≤k ≤ K.

ϕ(βββ) =K∑k=1

K〈βββ〉k =k−1∑k=1

K〈βββ〉k +K〈βββ〉k +K∑

k=k+1

K〈βββ〉k

>∑k−1k=1K

〈βββ〉k +K〈βββ〉k + (K − k) · 0≥ ∑k−1

k=1K〈βββ〉k +K ·Km.

Given that∑k−1k=1K

〈ααα〉k =∑k−1k=1K

〈βββ〉k , we have provedthat ϕ(ααα) < ϕ(βββ).

If ααα = βββ, which means that 〈ααα〉k = 〈βββ〉k,∀1 ≤ k ≤ K, itis trivially true that ϕ(ααα) =

∑Kk=1K

〈ααα〉k =∑Kk=1K

〈βββ〉k =ϕ(βββ). Thus, we have proved ααα � βββ =⇒ ϕ(ααα) ≤ ϕ(βββ).

We further prove ϕ(ααα) ≤ ϕ(βββ) =⇒ ααα � βββ by provingits contrapositive: ¬(ααα � βββ) =⇒ ϕ(ααα) > ϕ(βββ). ¬(ααα � βββ)implies ααα 6= βββ and the first non-zero element of 〈α〉〈α〉〈α〉−〈β〉〈β〉〈β〉is positive, which further indicates the first non-zeroelement of 〈β〉〈β〉〈β〉 − 〈α〉〈α〉〈α〉 is negative, i.e., βββ ≺ ααα. Thus, thecontrapositive is equivalent to βββ ≺ ααα =⇒ ϕ(βββ) < ϕ(ααα),which has already been proved previously using theexchanged notations of ααα and βββ.

With ααα � βββ ⇐⇒ ϕ(ααα) ≤ ϕ(βββ) holding for any αααand βββ of the same dimension, we complete the proof byletting ααα = ggg(xxx∗) and βββ = ggg(xxx).

Based on Lemma 1, we have

lexminxxx

ggg ⇐⇒ minxxx

ϕ(ggg) =∑k∈K

∑i∈Tk

∑j∈D

Mφ(xki,j)

where the objective function ϕ(ggg) is a summation of theterm Mφ(xk

i,j), which is a convex function of the singlevariable xki,j .

Therefore, solving Problem (9) is equivalent to solvingthe following problem with a separable convex objective:

minxxx

∑k∈K

∑i∈Tk

∑j∈D

Mφ(xki,j) (10)


4.2 Totally Unimodular Linear ConstraintsIn the second step, we investigate the coefficient matrixof linear constraints (5) and (6). An m-by-n matrix is

totally unimodular [6], if it satisfies two conditions: 1) anyof its elements belongs to {−1, 0, 1}; 2) any row subsetR ⊂ {1, 2, · · · ,m} can be divided into two disjoint sets,R1 and R2, such that |∑i∈R1

aij −∑i∈R2

aij | ≤ 1,∀j ∈{1, 2, · · · , n}.

Lemma 2: The coefficients of constraints (5) and (6)form a totally unimodular matrix.

Proof: Let Am×n denote the coefficient matrix of allthe linear constraints (5) and (6), where m = J+

∑Kk=1 nk,

representing the total number of the constraints, and n =

J∑Kk=1 nk, denoting the dimension of the variable xxx.

It is obvious that any element of Am×n is either 0or 1, satisfying the first condition. For any row subsetR ⊂ {1, 2, · · · ,m}, we can select all the elements thatbelong to {1, 2, · · · , J} to form the set R1. As such, R isdivided into two disjoint sets, R1 and R2 = R − R1. Itis easy to check that for coefficient matrix of constraint(5), the summation of all its rows, represented by rows{1, 2, · · · , J}, is a 1 × n vector with all the elementsequal to 1. Similarly, for coefficient matrix of constraint(6), the summation of all its rows, represented by rows{J + 1, J + 2, · · · , J +

∑Kk=1 nk}, is also a 1 × n vector

whose elements are 1. Hence, we can easily derivethat

∑i∈R1

aij ≤ 1,∑i∈R2

aij ≤ 1,∀j ∈ {1, 2, · · · , n}.Eventually, we have |∑i∈R1

aij −∑i∈R2

aij | ≤ 1, ∀j ∈{1, 2, · · · , n}, and the second condition is satisfied.

In summary, we have shown that both conditionsfor total unimodularity are satisfied, thus the coefficientmatrix Am×n is totally unimodular.

4.3 Structure-Inspired Equivalent LP Transforma-tion

In the final step, exploiting the problem structure oftotally unimodular constraints and separable convexobjective, we can use the λ-representation technique[7] to transform Problem (10) to a linear programmingproblem that has the same optimal solution.

For a single integer variable y ∈ Y = {0, 1, · · · , Y } ,the convex function h : Y → R can be linearized withthe λ-representation as follows:

h(y) =∑s∈Y

h(s)λs, y =∑s∈Y

sλs∑s∈Y

λs = 1, λs ∈ R+, ∀s ∈ Y.

In our problem, we apply the λ-representation tech-nique to each convex function hki,j(x

ki,j) = Mφ(xk

i,j) :{0, 1} → R as follows:

hki,j(xki,j) =

∑s∈{0,1}

Ms(cki,j+eki,j)λk,si,j = λk,0i,j +M cki,j+e

ki,jλk,1i,j

which removes the variable xki,j by sampling at eachof its possible value s ∈ {0, 1}, weighted by the newly



6

introduced variables λk,si,j ∈ R+,∀s ∈ {0, 1} that satisfy

xki,j =∑

s∈{0,1}

sλk,si,j = λk,1i,j∑s∈{0,1}

λk,si,j = λk,0i,j + λk,1i,j = 1

Further, with linear relaxation on the integer con-straints (7), we obtain the following linear programmingproblem:

minxxx,λλλ

∑k∈K

∑i∈Tk

∑j∈D

(λk,0i,j +M cki,j+eki,jλk,1i,j ) (11)

s.t. xki,j = λk,1i,j , ∀k ∈ K, i ∈ Tk, j ∈ Dλk,0i,j + λk,1i,j = 1, ∀k ∈ K, i ∈ Tk, j ∈ Dλk,0i,j , λ

k,1i,j , x

ki,j ∈ R+, ∀k ∈ K, i ∈ Tk, j ∈ D

Constraints (5) and (6).

Theorem 1: An optimal solution to Problem (11) is anoptimal solution to Problem (8).

Proof: The property of total unimodularity ensuresthat an optimal solution to the relaxed LP problem (11)has integer values of xki,j , which is an optimal solution toProblem (10), and thus an optimal solution to Problem(9) as demonstrated in Sec. 4-A. Moreover, Problem (8)and (9) are equivalent forms, completing the proof.

Therefore, the optimal assignment that minimizes theworst completion time among all the jobs can be ob-tained by solving Problem (11) with efficient LP solvers,such as MOSEK [8].

5 ITERATIVELY OPTIMIZING WORST COMPLE-TION TIMES TO ACHIEVE MAX-MIN FAIRNESS

With the subproblem of minimizing the worst comple-tion time efficiently solved as an LP problem (11), wecontinue to solve our original multi-objective problem(3) by minimizing the next worst completion time re-peatedly.

After solving the subproblem, it is known that theoptimal worst completion time is achieved by job k∗,whose slowest task i∗ is assigned to datacenter j∗. Wethen fix the computed assignment of the slowest taskof job k∗, which means that the corresponding schedulevariable xk

∗

i∗,j∗ is removed from the variable set xxx forthe next round. Also, since task i∗ is to be assignedto datacenter j∗, it is intuitive that all the assignmentvariables associated with it should be fixed as zero andremoved from xxx: xk

∗

i∗,j = 0,∀j 6= j∗, j ∈ D.As we have fixed a part of the assignment, the

resource capacities should be updated in our prob-lem constraints in the next round. For example, ifxk∗

i∗,j∗ = 1, which means that the i∗th task of job k∗

would be assigned to datacenter j∗, then for the prob-lem in the next round, xk

∗

i∗,j∗ is no longer a variable.The resource capacity constraints should be updated as∑k∈K

∑i∈Tk,(k,i) 6=(k∗,i∗) x

ki,j∗ ≤ aj∗ − 1.

Algorithm 1: Performance-Optimal Task Assignmentamong Jobs with Max-Min Fairness.Input:

Input data sizes dk,si and link bandwidth bsj toobtain network transfer time cki,j (by Eq. 1);execution time eki,j ; datacenter resource capacity aj ;

Output:Task assignment xki,j ,∀k ∈ K,∀i ∈ Tk,∀j ∈ D;

1: Initialize K′ = K;2: while K′ 6= ∅ do3: Solve the LP Problem (11) to obtain the solution

xxx;4: Obtain xj

∗

k∗,i∗ = argmaxxki,j∈xxx

φ(xki,j);

5: Fix xk∗

i∗,j ,∀j ∈ D; remove them from variable setxxx;

6: Update the corresponding resource capacities inConstraints (5);

7: Set φ(xk∗

i,j) = xk∗

i,j(ck∗

i∗,j∗ + ek∗

i∗,j∗), ∀i ∈ Tk∗ ,∀j ∈ D;8: Remove k∗ from K′;9: end while

DC2

tA1

DC3

DC1

4s

tA2

tB2

3s

tB1

2.5s2s

DC2

tA1

DC3

DC1

4s

tB2

tA2

2s

tB1

3.5s2s

( a ) ( b )

Fig. 4: Two possible assignments of tasks from two jobs.

Moreover, the completion time of job k∗ is obtainedas φ(xk

∗

i∗,j∗), the completion time of its slowest task i∗

assigned to datacenter j∗, yet the assignment of othertasks has not been fixed, which would be the variables(xk∗

i,j ,∀(i, j) 6= (i∗, j∗)) of the problem in the next round.We set the associated completion times of these variablesas xk

∗

i,j(ck∗

i∗,j∗+ek∗

i∗,j∗). The rationale is that no matter howfast other tasks of k∗ complete, the completion time isdetermined by the slowest task i∗. This ensures that thecompletion time optimized in the next round is achievedby another job, rather than a task of job k∗ (other thanits slowest). This is better illustrated with the examplein Fig. 4.

In this example, after the calculation in the first round,task tA1 assigned in DC2 achieves the worst completiontime of 4s, among the two sharing jobs. In the secondround, if we do not change the associated completiontime for another task tA2, assignment (a) would becalculated as the optimal solution, since it achieves thecompletion times of (4, 3, 2.5, 2) for the four tasks, whichis lexicographically smaller than (4, 3.5, 2, 2) achieved byassignment (b). However, assignment (a) minimizes thenext worst completion time for another task of the samejob A, which does not match our original objective tooptimize the worst completion time for job B. Instead, if



7

Optimization(Algorithm 1)

AssignedTaskSet

Task Scheduler

Solution

SubmitTaskSet

ResourceOffer

Scheduling Pool

DAGScheduler

ClusterManager

Expire or

Enough jobs

To be launched

xki,j

≤ ajjdk,si /b

Fig. 5: The implementation of our task assignment algorithm in Spark.

we set the associated completion times of all job A’s tasksas 4, the objective function achieved by assignment (b)would be (4, 4, 2, 2), better than (4, 4, 2.5, 2) given by as-signment (a). In this way, the optimization can correctlychoose assignment (b) to achieve the optimal completiontime of job B. Note that although the completion time oftask tA2 in assignment (b) is longer than in assignment(a), the completion time of job A remains the same, whichis 4s.

As a result, the subproblem in the next round issolved over a decreased set of variables with updatedconstraints and objectives, so that the next worst jobcompletion time would be optimized, without impact-ing the worst job performance in this round. Such aprocedure is repeatedly executed until the last worstcompletion time of jobs has been optimized, and themax-min fairness has been achieved, as summarized inAlgorithm 1.

6 PERFORMANCE EVALUATION

Having proved the theoretical optimality and efficiencyof our scheduling solution, we proceed to implement itin Apache Spark, and demonstrate its effectiveness in op-timizing job completion times in real-world experiments.Moreover, we present an extensive array of large-scalesimulation results to demonstrate the effectiveness of ouralgorithm in achieving max-min fair job performance.

6.1 Real-World Implementation and Experiments

6.1.1 Design and ImplementationIn Apache Spark [2], a job can be represented by a Di-rected Acyclic Graph (DAG), where each node representsa task and each directed edge indicates a precedenceconstraint. In general, the problem of assigning all thetasks in a DAG to a number of worker nodes, withthe objective of minimizing the job completion time, isknown as NP-Complete [9]. As a practical and efficientdesign, Spark schedules tasks stage by stage, which ishandled by the DAG scheduler. When a job is submitted,it is transformed into a DAG of tasks, categorized intoa set of stages. The DAG scheduler will then submit thetasks within each stage, called a TaskSet, to the taskscheduler whenever the stage is ready to be scheduled,implying that all its parent stages have completed.

Fortunately, the task scheduler in Spark has accessto most of the information that our algorithm needs asits input. As a result, we have implemented our newtask assignment algorithm as an extension to Spark’sTaskScheduler module. The design of our implemen-tation is illustrated in Fig. 5. In our implementation,as soon as a TaskSet in a job has been submittedby the DAG scheduler to the task scheduler, they willbe immediately queued in the scheduling pool. With anumber of concurrent jobs waiting to be scheduled, ouralgorithm will be triggered when a preset timer expiresor when the number of pending jobs in the pool exceedsa certain threshold.

To optimize the assignment of tasks, our algorithmrequires knowledge about the size and location of theoutput data from each map task, represented by dk,si inour formulation. Such knowledge can be obtained fromthe MapOutputTracker, which are further saved inthe TaskSet of reduce tasks. The available bandwidth(bsj) between each pair of datacenters (s to j) can bemeasured with the iperf2 utility. The task executiontime eki,j can be obtained from historical data, which isthe common practice for recurrent jobs. (We can alsoutilize advanced prediction techniques such as [10] toestimate the task execution time in our future work.)In addition, information about the amount of availableresources, corresponding to aj in our formulation, canbe obtained from the cluster manager. Now that all theinput required by our algorithm is ready, an optimalassignment will be computed for all the tasks in thescheduling pool, through iteratively formulating andsolving updated versions of linear programming prob-lems, solved by the LP solver in the Breeze optimiza-tion library [11].

After the assignment has been computed, it will berecorded in the corresponding TaskSet, overriding theoriginal task assignment preferences. When the tasksare finally submitted for execution, these assignmentpreferences will be satisfied in a greedy manner. Sincethe scheduling decisions will satisfy resource constraintsby considering Resource Offers, each task in the TaskSetcan be launched in any available computing slot in theassigned datacenter.

6.1.2 Experimental Setup

We are now ready to evaluate our real-world imple-mentation with an extensive set of experiments de-ployed across 6 datacenters in Amazon EC2, locatedin a geographically distributed fashion across differentcontinents. The available bandwidth between each pairof datacenters, measured with the iperf2 utility andaveraged over 5 measurements over the period of 20minutes, is shown in Table 1. Compared to the intra-datacenter network where the available bandwidth isaround 1 Gbps, bandwidth on inter-datacenter links aremuch more limited: almost all inter-continental linkshave less than 100 Mbps of available bandwidth. This



8

TABLE 1: Available bandwidth across geo-distributed datacenters(Mbps).

Virginia Oregon Ireland Sing Sydney SPVirginia 1000 169 154 52 53 104Oregon - 1000 71 69 77 68Ireland - - 1000 49 40 65

Singapore - - - 1000 58 35Sydney - - - - 1000 38

San Paulo - - - - - 1000“Sing” is short for “Singapore”, and “SP” is short for “San Paulo”.

confirms the observation that transferring large volumesof data across datacenters is likely to be time-consuming.

In our experiments, we have used a total of 12 on-demand Virtual Machine (VM) instances as Spark work-ers in our Spark cluster, located across 6 datacenters.Two special VM instances in Virginia (us-east-1) havebeen used as the Spark master node and the HadoopFile System (HDFS) [12] Name Node, respectively. Allinstances are of type m3.large, each with 2 vCPUs,7.5 GB of memory, and a 32GB Solid-State Drive. Ineach instance, we run Ubuntu Server 14.04 LTS 64-bit(HVM), with Java 1.8 and Scala 2.11.8 installed. Hadoop2.6.4 is installed to provide HDFS support for Spark. Ourown implementation of the task assignment algorithmis based on Spark 1.6.1, a recent release as of July 2016.Our Spark cluster runs in the standalone mode, withall configurations left as default. No external resourcemanager (e.g., YARN) or database system is activated.

In order to illustrate the efficiency of our task assign-ment algorithm, we use the legacy Sort applicationas the benchmark workload. We choose this workloadbecause it is simple but primitive. As one of the simplestMapReduce applications, Sort has only one map andone reduce stage. However, its sortByKey() operationis a basic building block for many complex data analyticsapplications, especially in Spark SQL. It triggers an all-to-all shuffle, which introduces heavier cross-node trafficthan other reduce operations such as reduceByKey().

In our experiments, we have implemented a Sortapplication with multiple jobs, and submitted it to Sparkfor execution. The jobs are submitted to Spark in parallelthreads, triggering concurrent jobs to share the resourcesin the cluster. Therefore, the task assignment decisionsfor these concurrent jobs will be made and enforced byour implementation in the TaskScheduler. To evaluatethe performance under different workloads, we run theworkload with 3, 4 and 5 concurrent jobs as separateexperiments.

For each Sort job in our benchmark application, thedefault parallelism is set to 3. In other words, the job willtrigger 3 reduce tasks to sort the input dataset, which has3 partitions distributed on 3 randomly chosen workernodes. The input dataset is prepared as a step in the maptask. Each partition of the generated dataset is 100 MB insize, containing 10,000 key-value pairs. Then, as the startof the reduce task, these key-values will be shuffledover the network. Since each datacenter has only twoworkers, a fraction of the shuffled traffic will be sent over

inter-datacenter network links, which are likely to be theperformance bottleneck. Our task assignment algorithmis specifically designed to mitigate the negative effectsof such bottlenecks.

6.1.3 Experimental ResultsWe conducted three groups of experiments, with 3, 4,and 5 concurrent jobs, respectively. In the first twogroups, each job has three tasks; while in the thirdgroup, the total number of tasks of the five jobs is setas the total number of available computing slots, whichis 12 in our Spark cluster. In each experiment, the jobcompletion times achieved with our optimal task assign-ment algorithm is compared with that achieved with thedefault scheduling in Spark, used as the baseline in ourcomparison study.

The results of our experiments are presented in Fig. 6and Fig. 7, showing the worst completion times and thesecond worst completion times among concurrent jobs,respectively. With respect to the worst completion time,it is easy to see that our algorithm always performsbetter than the baseline in Spark, with a performanceimprovement of up to 66%, as shown in Fig. 6. With re-spect to the second worst completion time, our algorithmdoes not theoretically guarantee that it is smaller thanthat achieved with any unfair placement. Even withoutguarantees, our algorithm always shows better perfor-mance than the baseline in Spark. These experimentshave shown convincing evidence that our algorithm— optimizing for max-min fairness — is effective inmaximizing the worst completion time and achievingbest possible performance for all concurrent jobs.

To offer a more in-depth examination and show whysuch a performance improvement can be achieved, weconsider the 7-th run in the second group of our ex-periment, with the sharing relationship and bandwidthshown in Fig. 8. The six circles represent the six datacen-ters used in our experiment. A1 is located in the Virginiadatacenter, representing the data required by tasks fromjob A. For each task in job A to be scheduled, a fraction ofdata needs to be read from all the three datasets (A1, A2and A3). As each datacenter has two available computingslots, the assignment of 12 tasks from four jobs becomes aone-to-one mapping. In such a limited resource scenario,the assignment of a task is tightly coupled with eachother. It becomes more difficult for the default strategyin Spark to find a good assignment, without optimiz-ing across all the tasks. Moreover, the wide range ofavailable bandwidth between datacenters is not takeninto consideration by the default scheduling in Spark,which also explains the performance improvement ofour strategy over the baseline.

Furthermore, to evaluate the practicality of our al-gorithm, we have recorded the time it takes to calcu-late optimal solutions. Fig. 9 illustrates the computationtimes, each averaged over 10 runs, with the number ofvariables varying from 12 to 120. The linear programin our algorithm is efficient, as it takes about 1 second



9

2 4 6 8 10Sorted Runs for 3 Jobs

40

60

80

100W

ors

t JC

T (

s)Fair

Baseline


40

60

80

100

Wors

t JC

T (

s)

Fair

Baseline


40

60

80

100

120

Wors

t JC

T (

s)

Fair

Baseline

Fig. 6: The worst job completion time in a set of concurrent jobs.


20

40

60

80

100

Seco

nd W

ors

t JC

T (

s) Fair

Baseline


40

60

80

100

Seco

nd W

ors

t JC

T (

s) Fair

Baseline


40

60

80

100

Seco

nd W

ors

t JC

T (

s) Fair

Baseline

Fig. 7: The second-worst job completion time in a set of concurrent jobs.

Vir Ore

SP Ire

Sy Sin

169

104 71

49

58

38

154

52

53 69

7768

40

65

35

A1B1 A2

B2

A3

D2

B3

C3

C2

D3

C1

D1

Fig. 8: The location of input data for four jobs across six datacenters.

12 24 36 48 72 1200

0.5

1

1.5

2

2.5

3

The number of variables

Alg

orith

m r

un

nin

g t

ime

(s)

Fig. 9: The computation times of our algorithm at different scales.

to obtain the solution for 48 variables. The computationtime is less than 3 seconds for 120 variables, whichis acceptable compared with the transfer times acrossdatacenters that could be tens or hundreds of seconds.In our experiment, the algorithm is running in the VMwith 2 vCPUs. We envision that with more powerfulservers for the scheduler in a production environment,the running time could be even smaller.

6.2 Large-Scale Simulations

We further present the results and analysis of our ex-tensive simulations on our scheduling algorithm, toevaluate its effectiveness and performance at a finergranularity.

To allow simulations at a large scale, we implementedour simulator in JAVA, using the efficient CPLEX op-timizer [13] to solve our LP problems. We conductedan extensive set of simulations, each with 50 and 100concurrent jobs, respectively. The setting of the simu-lated inter-datacenter network is in consistent with ourreal-world experiment setting, where 6 geographicallydistributed datacenters are interconnected by links withbandwidth capacities shown in Table 1. For simplicity,each job has 10 tasks and each task has three partitionsof input data, which are randomly distributed across allthe datacenters. The size of input data is uniformly dis-tributed in the range of [50, 600] MB. The total resourcecapacity, i.e., the total amount of available computingslots, across all the datacenters is set as 1, 1.1, 1.5, 2.5, 5and 10 times the total number of tasks (represented as1X , 1.1X , 1.5X , 2.5X , 5X and 10X), respectively in eachgroup of simulation, to represent the different degrees ofresource competition.

We compare our proposed algorithm, referred to asOPTIMAL, with two baselines — LOCAL and CENTRAL.LOCAL refers to the default Spark scheduling algorithmapplied in the wide-area scenario, which tries to allocatetasks to the datacenter where their input data are stored.If such a data locality can not be satisfied for the lackof sufficient resources, LOCAL will randomly select adatacenter with available resources for task assignment.CENTRAL refers to the algorithm which attempts toassign all the tasks of a job to one or a few main



10

6 8 10 12 14Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL

(a) Even and Loose (2X)


0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL

(b) Even and Tight (1.1X)

5 6 7 8 9 10Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL

(c) Skew and Loose (2X) (d) Skew and Tight (1.1X)

Fig. 10: The CDFs of job completion times for 50 concurrent jobs, achieved with three algorithms, with different resource availability.


0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL



0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL


4 5 6 7 8 9 10Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

OPTIMALLOCALCENTRAL

(c) Skew and Loose (2X) (d) Skew and Tight (1.1X)

Fig. 11: The CDFs of job completion times for 100 concurrent jobs, achieved with three algorithms, with different resource availability.

datacenters as long as computing slots are available. Itrepresents the most naive way of running data analyticsin the wide area, which aggregates all the input data toa central datacenter and thus run all the tasks within asingle cluster.

Our primary evaluation metrics are the completiontimes of concurrent jobs and the improvement ratioof the worst performance among concurrent jobs. Tobe more specific, for a job whose completion time isopt with our algorithm and baseline with a baselinealgorithm, the improvement ratio is calculated as (opt−baseline)/baseline×100%, of which the maximum is 100%.

Impact of Resource Availability. We first present agroup of results with different settings of resource avail-ability. Fig. 10 and Fig. 11 illustrate the empirical CDFs ofjob completion times across all the 50 and 100 concurrentjobs, respectively. Each subfigure presents the CDFs ofjob completion times achieved with the three comparingalgorithms (OPTIMAL, LOCAL and CENTRAL), givendifferent settings of resource distribution and competi-tion degree. It is easily observed in all the figures that ouralgorithm always outperforms the baselines with respectto improving the worst job completion time.

Competition Degree: Tight vs. Loose. Fig. 10a and Fig. 10billustrate the completion times of 50 concurrent jobswhen all the available computing slots are evenly dis-tributed across all the datacenters. The degrees of re-source competition are set as 2X (loose) and 1.1X (tight)for Fig. 10a and Fig. 10b, representing that the totalnumbers of available slots are 2 and 1.1 times the totalnumber of tasks, respectively. As we observe, when theresource competition becomes more intense, i.e., thereis a tight budget of resource, the difference betweenour algorithm and the baselines becomes more obvious.

This implies that our algorithm shows more benefits inresolving competitions with our max-min fairness whenthe budget of resource is tight. In a similar vein, whenthe available computing slots are skew in distribution,our OPTIMAL also achieves shorter job completion timesthan the baselines, and the advantage has an increasingtrend with a tighter budget of resources, as demonstratedby Fig. 10c and Fig. 10d. The similar analysis applies toFig. 11 when the number of concurrent jobs is increasedto 100.

Resource Distribution: Skew vs. Even. Given a fixeddegree of resource competition, our algorithm showsmore advantage in the case of a even resource distri-bution, compared with a skew one. This can be easilydemonstrated by comparing Fig. 10a and Fig. 10c forthe loose case, and comparing Fig. 10b and Fig. 10d forthe tight case. Also, Fig. 11 shows a similar pattern. Theexplanation for such a pattern is that when the resourceis evenly distributed, our algorithm has a larger spacefor optimizing the task placement, and thus exhibit moreperformance improvement.

Average Improvement Ratio. Apart from the detailedcase study for Fig. 10 and Fig. 11, we further present per-formance improvement ratios achieved with 6 degrees— 1X , 1.1X , 1.5X , 2.5X , 5X and 10X — of resourcecompetition, for 50 and 100 jobs, respectively, wherethe available computing slots are randomly distributedacross the datacenters. Specifically, the performance im-provement ratio is the percentage of reduction in theworst completion time among all the concurrent jobsachieved by our OPTIMAL, compared with the baselineLOCAL. Fig. 12 and Fig. 13 show the empirical CDFsof the improvement ratio over 20 runs, for 50 and100 concurrent jobs, respectively. Moreover, the average



11

20 25 30 35 40 45 50Improvement Ratio (%)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

11.1

1.52.5

510

Fig. 12: CDFs of improvement ratios for 50jobs over 20 runs. (OPTIMAL vs. LOCAL)

Fig. 13: CDFs of improvement ratios for 100jobs over 20 runs. (OPTIMAL vs. LOCAL)

50 jobs 100 jobs0

10

20

30

40

50

60

Avera

ge Im

pro

vem

ent

Rati

o 4647

46 4744

47

42 43

34 3433

27

1x1.1x1.5x2.5x5x10x

Fig. 14: Average improvement ratios with different re-source availability for 50 and 100 jobs.


6.0 6.5 7.0 7.5Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF

FinalRoundFirstRound


5.0 5.5 6.0 6.5 7.0 7.5Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF


(c) Skew and Loose (2X)

5.5 6.0 6.5 7.0 7.5Job Completion Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Em

pir

ical C

DF


(d) Skew and Tight (1.1X)

Fig. 15: The CDFs of job completion times for 100 concurrent jobs, achieved in the first round and the final round of OPTIMAL.

performance improvement ratios are presented in Fig. 14for intuitive comparison.

Clearly, our algorithm achieves a performance im-provement of 27% to 47% over Local. With a larger degreeof resource competition (i.e., with more available com-puting slots), the improvement ratio becomes smaller.The reason is that our algorithm always seeks for theoptimal solution in minimizing the worst job completiontime, despite the degree of competition. However, theperformance of LOCAL is impacted by the resourceavailability. With more resources, LOCAL can achievedata locality for more tasks, which avoids stragglers andimproves job completion times. Therefore, the advantageof our algorithm becomes less obvious for the 2.5X ,5X and 10X cases. The higher standard deviation forthese cases can be explained by the randomness in plac-ing tasks that can not achieve data locality in LOCAL.Though the total amount of slots is larger, some tasksmay still fail to achieve data locality, as the resourcesare skew in distribution.

Improvement Over Rounds with OPTIMAL. Now wefocus on the behavior of our algorithm by investigatingthe calculated task scheduling decisions over rounds.To be particular, we calculate the job completion timesif scheduled with the temporary decision calculated inthe first round and the final decision after the finalround, respectively. We find that the job completiontimes achieved in these two rounds are the same formost cases. For the remaining cases, the improvementachieved in the final round over the first round ismarginal. We present the CDFs of completion times for100 concurrent jobs achieved in the first round and the

The number of variables120 600 1200 6000 12000

Alg

orith

m r

unnin

g tim

e (

s)

0

0.5

1

1.5

2

2.5

Fig. 16: The computation times of our algorithm in large-scale simula-tions, with different scales.

final round of our algorithm, respectively, in Fig. 15.The settings of the four subfigures are the same withFig. 11. As clearly illustrated, the performance in thefinal round shows no difference with the first found inFig. 15c, and is slightly better in Fig. 15a, 15b and 15d.The improvement is due to the iterative optimizationof our algorithm in improving performance of somejobs without impacting others, which has been clearlyillustrated in Sec. 5. These figures imply that the opti-mization in the first round is the major contributor to theperformance improvement of our algorithm. The spacefor further improvement over later rounds depends onspecific cases and is not large.

Scalability. Furthermore, as in our real-world exper-iment, we have measured the running times with dif-ferent problem scales in our simulations to evaluatethe scalability of our algorithm. The measured compu-tation time with each problem scale is averaged over



12

10 runs. Fig. 16 illustrates the computation times whenthe number of variables ranges from 120 to 12000. Asthe CPLEX solver used in simulations is more efficient,solving the problem with 120 variables only takes lessthan 0.5 seconds, which is much faster than the solverin Breeze used in our experiment. When the number ofvariables grows beyond ten thousands, the computationtime is still less than 2.5 seconds.

7 DISCUSSION

Adaptation to Network Heterogeneity. Distributed dataanalytics depends heavily on the reliability and theperformance of the underlying networks, due mainly toits abundant demand for data exchange among tasks.However, the networks, being best-effort and sharedamong many, can hardly deliver consistent or pre-dictable Quality of Service, especially in the wide areanetworks. As a result, data analytics jobs usually suf-fer from performance degradation due to spatial andtemporal network heterogeneity: different links can offerdifferent capacities at the same time, while the availablecapacity of a link can vary over time. It remains an openproblem to deal with these two types of heterogeneityin data analytics systems adaptively and in real-time.Solving this problem requires to identify the hetero-geneity effectively and to adjust the job scheduling withthe awareness of application-level workload. Our workoffers a preliminary solution which accounts for thespatial network heterogeneity. To be adaptive to the tem-poral heterogeneity, we can monitor the bandwidth andrun our algorithm periodically or when the bandwidthdynamics exceeds a certain threshold.

Economic-Based Resource Allocation. It would beinteresting to investigate the problem of task assignmentfor data analytic jobs among geo-distributed datacentersfrom the game theoretic perspective. For example, inthe framework of auctions, each job can be viewed asa buyer bidding for multiple goods, in the form ofcomputing slots, from multiple datacenters. Differentfrom conventional resource allocation where the utilityof a user is determined by the summation of the totalamount of resources received, allocating resources fordata analytic jobs suffers from the challenge that the jobperformance is determined by the slowest task, whichmeans that the utility is not determined by a simplesummation of resources. Such a non-additive nature maysignificantly change the problem structure and introducecomplexities in identifying the optimal solution. It is aninteresting direction to be explored in our future work.

8 RELATED WORK

With increasingly large volumes of data generated glob-ally and stored in geo-distributed datacenters, it hasreceived an increasing amount of research attention todeploying data analytic jobs across multiple datacenters.Based on their objectives, existing efforts can be roughlydivided into two categories: reducing the amount of

inter-datacenter network traffic to save operation costs,and reducing the job completion time to improve appli-cation performance.

Vulimiri et al. [3], [14] took the initiative to reduce theamount of data to be moved across datacenters whenrunning geo-distributed data analytic jobs. To reduce thebandwidth cost, they formulated an integer program-ming problem to optimize the query execution plan andthe data replication strategy. They also took advantageof the abundant storage resources to aggressively cacheresults of queries, to be leveraged by subsequent queriesto reduce data transfers. Pixida [15] proposed to dividethe DAG of a job into several parts, each to be executedin a datacenter, with the objective of minimizing the totalamount of traffic among these divided parts. Despitereducing traffic across datacenters, these solutions do notnecessarily shorten job completion times, as bandwidthavailability varies across different links and over time.

As a representative work in the second category, Irid-ium [4] proposed an online heuristic to place both dataand tasks across datacenters. Unfortunately, it assumesthat the wide-area network that interconnects datacen-ters is free of congestion, which is far from realistic.Flutter [5] removed this unrealistic assumption, formu-lated a lexicographical minimization problem of taskassignment for a single stage of one job, and obtainedits optimal solution. However, all existing works focusedon assigning tasks in a single job, without consideringthe inherent competition for resources among concurrentjobs. Despite using a similar theoretical foundation as[5], our problem considers multiple jobs, and is thereforeremarkably different and more challenging.

Accounting for the scenario of multiple jobs sharinggeo-distributed datacenters, Hung et al. [16] proposeda greedy scheduling heuristic to make job schedulingdecisions across geo-distributed datacenters, with anobjective of reducing the average job completion time.However, it assumes that the task assignment is pre-determined, and the scheduling decision is the execu-tion order of all the assigned tasks in each datacenter.Therefore, despite sharing a similar context of consider-ing multiple jobs sharing the same pool of computingresources in geo-distributed datacenters, this work isorthogonal to our work, which aims to determine thebest possible placement for tasks of all the sharing jobswith the consideration of fairness. As an extension toour previous conference paper [17], we have addedresults and analysis of an extensive set of large-scalesimulations, to comprehensively explore the behavior ofour algorithm.

There are plenty of existing efforts [18]–[20] relatedto task assignment and job scheduling in big data ana-lytic frameworks. To reduce job completion times, theyproposed to improve data locality and fairness [18],[19], and to mitigate the negative impact of tasks thatprogress slowly, called stragglers ( [20]). However, theyare all designed for frameworks deployed in a singledatacenter, and do not work effectively across multiple



13

datacenters.

9 CONCLUDING REMARKSIn this paper, we have conducted a theoretical studyof the task assignment problem among competing dataanalytic jobs, whose input data are distributed acrossgeo-distributed datacenters. With tasks from multiplejobs competing for the computing slots in each dat-acenter, we have designed and implemented a newoptimal scheduler to assign tasks across these datacen-ters, in order to better satisfy job requirements withmax-min fairness achieved across their job completiontimes. To achieve this objective, we first formulated alexicographical minimization problem to optimize all thejob completion times, which is challenging due to theinherent complexity of both multi-objective and discreteoptimizations. To address these challenges, we startedfrom the single-objective subproblem and transformedit into an equivalent linear programming (LP) problemto be efficiently solved in practice, based on an in-depthinvestigation of the problem structure. An algorithm isfurther designed to repeatedly solve an updated ver-sion of the LP subproblems, which would eventuallyoptimize all the job performance with max-min fairnessachieved. Last but not the least, we have implementedour performance-optimal scheduler in the popular Sparkframework, and demonstrated convincing evidence onthe effectiveness of our new algorithm using both real-world experiments and large-scale simulations.

REFERENCES[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Process-

ing on Large Clusters,” Communications of the ACM, vol. 51, no. 1,pp. 107–113, 2008.

[2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,M. Franklin, S. Shenker, and I. Stoica, “Resilient DistributedDatasets: A Fault-Tolerant Abstraction for in-Memory ClusterComputing,” in Proc. USENIX Symposium on Networked SystemsDesign and Implementation (NSDI), 2012.

[3] A. Vulimiri, C. Curino, P. Godfrey, T. Jungblut, J. Padhye, andG. Varghese, “Global Analytics in the Face of Bandwidth and Reg-ulatory Constraints,” in Proc. USENIX Symposium on NetworkedSystems Design and Implementation (NSDI), 2015.

[4] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella,P. Bahl, and I. Stoica, “Low Latency Geo-Distributed Data Ana-lytics,” in Proc. ACM SIGCOMM, 2015.

[5] Z. Hu, B. Li, and J. Luo, “Flutter: Scheduling Tasks Closer to DataAcross Geo-Distributed Datacenters,” in Proc. IEEE INFOCOM,2016.

[6] B. Korte and J. Vygen, Combinatorial Optimization: Theory andAlgorithms, 3rd ed., ser. Algorithms and Combinatorics. Springer,2006, vol. 21, ch. 5, p. 104.

[7] R. Meyer, “A Class of Nonlinear Integer Programs Solvable by ASingle Linear Program,” SIAM Journal on Control and Optimization,vol. 15, no. 6, pp. 935–946, 1977.

[8] E. Andersen and K. Andersen, “The MOSEK Interior Point Opti-mizer for Linear Programming: an Implementation of the Homo-geneous Algorithm,” in High Performance Optimization. Springer,2000, pp. 197–232.

[9] Y. Kwok and I. Ahmad, “Static Scheduling Algorithms for Allo-cating Directed Task Graphs to Multiprocessors,” ACM ComputingSurveys (CSUR), vol. 31, no. 4, pp. 406–471, 1999.

[10] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, andR. H. Katz, “Selecting the Best VM across Multiple Public Clouds:A Data-Driven Performance Modeling Approach,” in Proc. ACMSymposium on Cloud Computing (SOCC), 2017.

[11] Breeze: A Numerical Processing Library for Scala. [Online].Available: http://www.scalanlp.org

[12] Hadoop. [Online]. Available: https://hadoop.apache.org/[13] CPLEX Optimizer: High-Performance Mathematical Program-

ming Solver. [Online]. Available: https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

[14] A. Vulimiri, C. Curino, P. Godfrey, K. Karanasos, and G. Varghese,“WANalytics: Analytics for A Geo-Distributed Data-IntensiveWorld,” in Proc. Conference on Innovative Data Systems Research(CIDR), 2015.

[15] K. Kloudas, M. Mamede, N. Preguica, and R. Rodrigues, “Pixida:Optimizing Data Parallel Jobs in Wide-Area Data Analytics,”VLDB Endowment, vol. 9, no. 2, pp. 72–83, 2015.

[16] C. Hung, L. Golubchik, and M. Yu, “Scheduling Jobs AcrossGeo-Distributed Datacenters,” in Proc. ACM Symposium on CloudComputing (SoCC), 2015.

[17] L. Chen, S. Liu, B. Li, and B. Li, “Scheduling Jobs across Geo-Distributed Datacenters with Max-Min Fairness,” in Proc. IEEEINFOCOM, 2017.

[18] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker,and I. Stoica, “Delay Scheduling: A Simple Technique for Achiev-ing Locality and Fairness in Cluster Scheduling,” in Proc. ACMEuropean Conference on Computer Systems, 2010, pp. 265–278.

[19] B. Hindman, A. Konwinski, M. Zaharia, and et al., “Mesos: APlatform for Fine-Grained Resource Sharing in The Data Center,”in Proc. USENIX Symposium on Networked Systems Design andImplementation (NSDI), 2011.

[20] X. Ren, G. Ananthanarayanan, A. Wierman, and M. Yu, “Hopper:Decentralized Speculation-aware Cluster Scheduling at Scale,” inProc. ACM SIGCOMM, 2015.

Li Chen is currently pursuing her Ph.D. degreeat the Department of Electrical and ComputerEngineering, University of Toronto, where shereceived her M.A.Sc. degree in 2014. She re-ceived her B.Engr. degree from the Depart-ment of Computer Science and Technology,Huazhong University of Science and Technol-ogy, China, in 2012. Her research interests in-clude big data analytics systems, cloud comput-ing, datacenter networking and resource alloca-tion.

Shuhao Liu is currently a Ph.D. student in theDepartment of Electrical and Computer Engi-neering, University of Toronto. He received hisB.Eng. degree from Tsinghua University in 2012.His current research interests include software-defined networking and big data analytics.

Baochun Li received his Ph.D. degree from theDepartment of Computer Science, University ofIllinois at Urbana-Champaign, Urbana, in 2000.Since then, he has been with the Departmentof Electrical and Computer Engineering at theUniversity of Toronto, where he is currently aProfessor. He holds the Bell Canada EndowedChair in Computer Engineering since August2005. His research interests include large-scaledistributed systems, cloud computing, peer-to-peer networks, applications of network coding,

and wireless networks. He is a member of the ACM and a Fellow of theIEEE.



14

Bo Li is a Professor in the Department of Com-puter Science and Engineering, Hong Kong Uni-versity of Science and Technology. He holds theCheung Kong Chair Professor in Shanghai JiaoTong University. Prior to that, he was with IBMNetworking System Division, Research TrianglePark, North Carolina. He was an adjunct re-searcher at Microsoft Research Asia-MSRA andwas a visiting scientist at Microsoft AdvancedTechnology Center (ATC). He has been a tech-nical advisor for ChinaCache Corp. (NASDAQ

CCIH) since 2007. He is an adjunct professor in Huazhong University ofScience and Technology, Wuhan, China. His recent research interestsinclude: large-scale content distribution in the Internet, Peer-to-Peer me-dia streaming, the Internet topology, cloud computing, green computingand communications. He is a Fellow of IEEE for “contribution to contentdistributions via the Internet”. He received the Young Investigator Awardfrom the National Natural Science Foundation of China (NSFC) in2004. He served as a Distinguished Lecturer for IEEE CommunicationsSociety (2006-2007). He was a co-recipient for three Best Paper Awardsfrom IEEE, and the Best System Track Paper in ACM Multimedia (2009).

Documents

Scheduling Jobs across Geo-Distributed Datacenters with ...iqua.ece.toronto.edu/papers/lchen-tnse18.pdf · Scheduling Jobs across Geo-Distributed Datacenters with Max-Min Fairness