Energy-aware Hierarchical Scheduling ofApplications in Large Scale Data Centers
Gaojin Wen1 Jue Hong1 Chengzhong Xu2 Pavan Balaji3 Shengzhong Feng1 Pingchuang Jiang11Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
518055 Shenzhen, ChinaEmail: {gj.wen, sz.feng, jue.hong, pc.jiang}@siat.ac.cn
2Department of Electricaland Computer Engineering, Wayne State UniversityEmail: [email protected]
3 Mathematics and Computer Science, Argonne National LaboratoryEmail: [email protected]
AbstractβWith the rapid advance of cloud computing, largescale data center plays a key role in cloud computing. En-ergy consumption of such distributed systems has become aprominent problem and received much attention. Among exist-ing energy-saving methods, application scheduling can reduceenergy consumption by replacing and consolidating applicationsto decrease the number of running servers. However, mostapplication scheduling approaches did not consider the energycost on network devices, which is also a big portion of powerconsumption in large data centers.
In this paper we propose a Hierarchical Scheduling Algo-rithm for applications, namely HSA, to minimize the energyconsumption of both servers and network devices. In HSA, aDynamic Maximum Node Sorting (DMNS) method is developedto optimize the application placement on servers connected toa common switch. Hierarchical crossing-switch adjustment isapplied to further reduce the number of running servers. Asa result, both the number of running servers and the amountof data transfer can be greatly reduced. The time complexity ofHSA is Ξ(π β πππ(ππππ)), where n is the total number of thesevers in the data center. Its stability is verified via simulations.Experiments show that the performance of HSA outperformsexisting algorithms.
I. INTRODUCTION
Modern high-end computing systems including scientificcomputing centers, cloud computing platforms and enterprisecomputing systems use tens of thousands of nodes withhundreds of thousands of cores. The upcoming 10-20 petaflopmachines such as the Livermore Sequoia, the Argonne Mirasupercomputers in the U.S., the Kei supercomputer in Japan,would further increase this number to millions of cores.Finally, with the upcoming exascale platforms, this numberis expected to further grow to millions of nodes with close toa billion cores. Such massive-scale systems consume a largeamount of energy for operation and cooling. For example,the 2.98 petaflop Dawning Nebula system requires close to2.55 MW of power to operate. The increasing trend of powerconsumption would keep growing, if there is no investmentinto energy and power-awareness research for such systems.
As an effective energy conserving method, applicationscheduling becomes widely used. By consolidating applicationon as less servers as possible and making idle servers sleepor power-off, the energy consumption can be significantly
reduced[1], [2], [3], [4]. However, few of existing applicationscheduling approach has considered optimizing the power con-sumption of the network infrastructure. In a typical commoditylarge-scale systems, the network infrastructure is built toprovide high bisection bandwidth for applications that span theentire system. However, sharing by diverse applications andeach utilizing a part of the system, many of these network linksare unused. A network usage record from the Fusion clusterat the Argonne National Laboratory of the U.S., spreadingover around 18 months from November 2009 to May 2011,testifies such phenomenon (Fig. 1). As shown by this record,the network utilization can vary significantly depending onthe forwarding policy (in this case ranging from an average to54.7% to an average of 90.3%). This instance implies that, byconsidering the characteristics of the network infrastructurein application scheduling, more energy conservation can beachieved.
In this paper we focus on analyzing application schedulingand its impact on the energy usage of the network infras-tructure. We propose a Hierarchical Scheduling Algorithm forapplications, namely HSA, to minimize the energy consump-tion of both servers and network devices. We first presenta formal description of the application scheduling problemconsidering network devices, and define a series of metricsfor the evaluation of the algorithm performance. Then, weemploy special subset-by-subset and level-by-level schedulingmethods to design our HSA algorithm. The proposed HSAalgorithm has a time complexity of Ξ(π β πππ(ππππ)), wheren is the total number of the severs in the data center. Throughsimulation and experiments in a large scale data center, theefficiency and the stability of our algorithm are verified.
The remainder of this paper is organized as follows. Therelated work is presented in section II. Problem statementand evaluation metric are formulated in Section III. Themain framework of our method is described in Section IV.Scheduling method for node subset is investigated in SectionV. The total hierarchy scheduling method is proposed andanalyzed in Section VI. We validate it through simulationin a large scale data center in Section VII. Conclusions anddiscussions on future works are presented in Section VIII.
158
2011 International Conference on Cloud and Service Computing
978-1-4577-1637-9/11/$26.00 Β©2011 IEEE
0 100 200 300 400 500 6000
20
40
60
80
100
Day Number
Usa
ge (
%)
(a) Minimize energy usage causing as many links aspossible to be shared. The average network utilizationis 54.7%.)
0 100 200 300 400 500 6000
20
40
60
80
100
Day Number
Usa
ge (
%)
(b) Maximize performance and assuming that the com-munication pattern of the applications matches the setupperfectly. The average network utilization is 90.3%.
0 100 200 300 400 500 6000
20
40
60
80
100
Day Number
Usa
ge (
%)
(c) Use the default linear forwarding table policy of thesubnet management infrastructure used by InfiniBand:OpenSM. The average network utilization is 83.3%.
Fig. 1: Fusion network usage with different forwarding policy.
II. RELATED WORK
Workload placement on servers in clustered environmentshas been traditionally driven by performance objectives. Allservers are intended to be fully operational in order to providethe maximum available resources to users; the technique ofload balancing is used to spread the workload to as many ofthem as possible. This mode of operation causes low utilizationof the cluster resources, and consequently consumes a highamount of power. Research has pointed out that using tech-niques of load-screw scheduling and dynamically adjustingidle servers to power-save mode can significantly reduce theenergy consumption [11], [1], [2].
Load-screw scheduling is studied extensively in the confrontof online bin-packing problem [3], [5], [6], [7]. Each serveris treated as a bin with dimensions relative to available re-sources such as CPU, memory, and network bandwidth. Tasksare treated as objects with their resources requirements asdimensions. A scheduling algorithm determines the minimumnumber of servers to run these tasks concurrently. Live taskmigration among servers after scheduling can help to achieve ahigher resource utilization rate. It is also important for energy-
aware application consolidation, which is another populartechnique in green computing [8].
Migration cost is essential to the discussion. Differentmethods for estimating the cost have been proposed andstudied. Gambosi, Sanders,et al. [7] assumed the total numberand dimension of the migrated tasks to be the cost in theframework of semi-online bin-packing problem. Lefevre andOrgerie measured energy cost of virtual machines betweentwo different servers [9]. Verma et al. provided a systematicalway for migration cost aware virtualized system optimizations[4], [12]. In this paper, we extend these work by consideringa more comprehensive and realistic cost model. We includevariations in migration cost due to the hierarchical networktopology in the data center. Our proposed energy-awarehierarchical scheduling algorithm addresses such variationsefficiently.
There has also been a lot of prior work on other modelsof scheduling including QoS aware job scheduling [13], [14],[15]. The ability of such scheduling techniques to delay theexecution of jobs or slow down jobs (as long as the QoSrequirement is satisfied) provides even more opportunity forenergy savings. Our proposed approach is complementary tothese techniques and can be used together.
Application scheduling can be formulated as a bin packingproblem (BPP), in which a set of items of various sizes needto be packed by using a minimum number of identical bins.BPP is one of the oldest and most well-studied problems incomputer science. Numerous methods have been developed forsolving BPP in the past few years, including Best Fit (BF),First Fit (FF), Best Fit Descending (BFD), First Fit Descending(FFD) and so on. An excellent survey of the research on BPPis available in [17] . Let πππ be the minimum number of binsrequired to pack the input set of items. Packing generated byeither FF or BF uses no more than 17
10πππ +2 bins, and thatby either FFD or BFD uses no more than 11
9 πππ + 4 bins[17].
III. PROBLEM STATEMENT AND METRICS
In this section, we will give the statement of the problemfirst, and propose an evaluation criterion for the schedulingalgorithm.
A. Problem Description
Consider a finite sequence of nodes πππ =(ππππ1, ππππ2, . . . , πππππ) and a finite sequence or listof applications π΄ = (π1, π2, . . . , ππ), where an applicationππ β A corresponds to the amount of the resources (Memory,CPU, I/O) that used by the π-th application. We assume that0 < ππ β€ 100 for all π and the maximal capacity of eachnode is 100.
Computer nodes in data center are often connected throughhigh speed switches, as shown in Fig.2. The distance andenergy of transferring data between two nodes are different.Transferring data between two nodes connected directly to thesame switch tends to use less energy than that of the nodesconnected indirectly[10].
159
Fig. 2: Network in the data center
The objective of application scheduling is to schedule theapplications in A into a subset of nodes in πππ , so as tominimize the maximum number of nodes used and the totalsize of data transfer incurred during the execution of theseapplications.
B. State descriptor
Let an integer vector St = (π π‘1, π π‘2, . . . , π
π‘π) denote the
location of the applications, while π π‘π β S means item ππ islocated at the node πππππ π‘
πat time π‘. S0 represents the initial
state of the applications. We conceptually use an integer vectorS = (π 1, π 2, . . . , π π) to represent a state descriptor, where1 β€ π π β€ π. π π represents the π-th application will transferfrom the its source node πππππ 0
πto the π π-th nodes.
C. Objective function
Our objective is to find a new scheduling method to min-imize the number of nodes used and data transferred. Twoevaluation metrics are considered: node usage metric and thedata transfer metric.
1) Node usage metric: , denoted as πππππ π‘, is defined as
πππππ π‘ = β₯sβ₯where β₯.β₯ calculates the number of the nodes used by appli-cations A according to the s.
2) Data transfer metric: The cost of transferring data fromone node to another are various due to the differences innetworks and distance between nodes. Assume a transfer costweight matrix C = {ππ,π}, 0 β€ π, π β€ π, where ππ,π is theweight for data transfer from i-th node to j-th node. The datatransfer metric, denoted as π·ππππ π‘, can be defined by thefollowing formula:
π·ππππ π‘(s) =
π=πβπ=1
β₯ππ β π(π π β π 0π ) β ππ π,π 0π β₯
where π 0π is the node index application ππ resides.
π(π₯) =
{1, π₯ > 0 ππ π₯ < 0, (1a)
0, π₯ = 0. (1b)
IV. ALGORITHM OVERVIEW
In this section, we first introduce the concept of node subsetand node level, and then present the main framework of ourhierarchy scheduling method. It is based on two schedulingstrategies: scheduling for node subset and scheduling at a nodelevel.
A. Node subset and node level
We define a node subset to be the set of nodes, in whichthe cost of data transfer between any two nodes are assumedto be equal, such as the nodes connected to the same switch.Two examples of such a node subset are shown in Fig.3.
Fig. 3: Two examples of node subset.
A number will be assigned to each subset, as its level markaccording to the average cost of the data transfer among thenodes. For example, the level marks of the subsets in Fig.3are different. Every node level is composed of subsets withthe same level mark. Fig.4 shows a description of node levels.
Fig. 4: Examples of node level.
B. Scheduling in node subset
Node subset provides an advantage that nodes in the samenode subset can be scheduled without consideration of the costof the data transfer. Recall that, the scheduling target aims atminimizing the number of nodes used and data transfer. It isobvious that transfer small nodes or applications is superior tolarger ones. So nodes in the same node subset with larger ormore applications should be kept static and be used to receiveapplications that transferred from other nodes. In section IV,we will describe the details of the scheduling method withina node subset.
C. Hierarchical Scheduling
After processed by scheduling in node subset, all nodescan be divided into two categories ππππ’ππ and πππ’πππ’ππaccording to a preset threshold. ππππ’ππ contains the nodes inwhich the total size of the applications is above the threshold.πππ’πππ’ππ is the node set of the left nodes which are neitherempty nor full and can receive more applications. All nodesin these πππ’πππ’ππ will be divided into node subsets. For eachnode subset, application scheduling is done again. Such ahierarchical scheduling process will be executed repetitively.The details are presented in section IV.
160
V. SCHEDULING IN NODE SUBSET
For convenience, we use πππ denote the nodes in the nodesubset that are fixed to contain the packages. The left nodesin the node subset is named as transfer nodes set πππ‘.
A. The K-th Max Node Sorting Method (KMNS)
We sort the node sequence according to the amount of theapplications that belongs to each node in an ascending order.Then we select the first K largest nodes as πππ‘. The left nodesare πππ .
Then we transfer the applications in πππ‘ into πππ by theBest Fit Descending (BFD) method, since BFD has betterperformance than other scheduling methods.
During this transfer process, all applications in πππ‘ aresorted in descending order, and the applications transfer intothe πππ from the largest application to the smallest one bythe Best Fit method. If an application in πππ‘ can not find aholding node in the πππ , a new empty node will be added toπππ to hold the application.
We shall determine the matching between newly added nodein πππ and the input node sequence. The newly added nodewill be matched to the node in the input node sequence fromwhich most applications are transferred.
Finally, we calculate the nodes used and data transfer.In general, the scheduling method for node subset is as
follows:
Algorithm 1 The K-th Max Node Sorting AlgorithmInput: one node subset πππ with N nodes.Output: node sequence with the least nodes used and theleast data transfer.Main :
(1) sort πππ according to the amount of the applicationsof each node in ascend order and generate πππ
β².
(2) πππ‘ = πππ β²
π=1,...,πΎ , πππ = πππ π β²
π=πΎ+1,...,π .(3) transfer applications from πππ‘ into πππ with De-
scending Best Fit method.(4) find the matching between πππ and πππ .(5) calculate πππΎ
πππ π‘ and π·ππΎπππ π‘.
Property 1: If K=N, then there is no need to sort the inputnode sequence. The K-th Max Node Sorting Method becomesto be the Best Fitting Descending Method (BFD).
Property 2: πππΎπππ π‘ >= ππππ‘, where the optimal node
number ππππ‘ can be estimated by the following equations:
ππππ‘ = β ππ΄π΄πΆπ
β, (2)
where TA is the total size of the applications in the node subsetand ACN is the average capacity of the nodes.Eq.(2) means at least ππππ‘ nodes will be used to holding allthe applications in the node subset.
Lemma 1: Suppose the total size of the application is T, theoptimal node number for the input node sequence is πππΎ
πππ π‘,the average capacity of the node is ACN, the total number
of the nodes is N. Then π·ππΎπππ π‘ for KMNS is no more than
π·ππΈ(πΎ) = π β πππΎπππ π‘βππ , if πΎ <= π βπππΎ
πππ π‘,Proof: According to the condition, if πΎ <= πβπππΎ
πππ π‘,πππΎ
πππ π‘ nodes will be used to contain all the applications, thetotal size of the fixed applications is no less than πππΎ
πππ π‘ β ππ ,
so the data transfer is no more than π β πππΎπππ π‘βππ .
Lemma 1 gives an upper bound to the amount of datatransfer for a good scheduling method.
B. Performance of KMNS
To study the performance of the KMNS method, we willuse the following three metric:
1) Ratio of Node Used (RoNU):
π πππ(πΎ) = πππΎπππ π‘/π
It is clear that πππΎπππ π‘ is less than N and bigger than ππππ‘.
Consequently, ππππ‘
π <= π πππ(πΎ) <= 1.2) Ratio of Data Transfer (RoDT):
π ππ·π (πΎ) = π·ππΎπππ π‘/π
For the data π·ππΎπππ π‘ that need to be transferred, the best case
is that there is no data, and the worst case is that the data sizeis πβ1
π β π ββ π , (π ββ +β), which is very large. As aresult, 0 <= π ππ·π (πΎ) < 1.
3) Ratio of Data Transfer Estimation (RoDT):
π ππ·ππΈ(πΎ) = π·ππΈ(πΎ)/π = 1βπ πππ(πΎ)
From Lemma 1. we know
π ππ·ππΈ(πΎ) >= π ππ·π (πΎ), ππ πΎ <= π βπππΎπππ π‘
With different number of nodes in the input node sequence,we study the performance of KMNS as is shown in Fig. 5and Fig. 7. It seems that RoNU(K) decreases with respect toK and is no less than ππππ‘/π , while the RoDT(K) increaseswith respect to K. Transverse K can lead to the best result.
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Rat
io
N=32N
opt=20
K
RoNU(K)RoDT(K)RoDTE(K)N
opt / N
Fig. 5: Performance of KMNS with 32 nodes as input.
161
Algorithm 2 Dynamic Max Node Sorting Algorithm (DMNS)
Input: node sequence π΄ = {ππ}ππ=1,Output: node sequence πΆ with least nodes used πππππ π‘
and least data transfer π·ππππ π‘Initialization: πππππ π‘ = π , π·ππππ π‘ = +β,πΆ = ππ’ππMain Loop:using BFD process all the applications in A, calculateπππ΅πΉπ·
πππ π‘ .for K = 0 to π do
compute πππΎπππ π‘ , π·ππΎ
πππ π‘ and the resulting node se-quence B by KMNS algorithm.if πππππ π‘ > ππ
πΎπππ π‘ then
πππππ π‘ = πππΎπππ π‘, π·ππππ π‘ = π·π
πΎπππ π‘, C=B.
else if πππππ π‘ == πππΎπππ π‘ then
if π·ππππ π‘ > π·ππΎπππ π‘ then
π·ππππ π‘ = π·ππΎπππ π‘, C=B
end ifend if
end for
C. Dynamic Max Node Sorting Method (DMNS)
Targeting at using least nodes and minimal data transfer, wepropose an algorithm of Dynamic Max Node Sorting Methodin Algorithm 2.
From property 1, we know that BFD is a special caseof DMNS, so πππππ π‘ <= πππ΅πΉπ·
πππ π‘ . From [17], we knowπ΅πΉπ·(πΏ) <= 11
9 πππ +4. The upper bounds for DMNS canbe calculated by the following lemma.
Lemma 2:
π·πππ(πΏ) <=11
9πππ + 4
VI. HIERARCHY SCHEDULING
The whole process results in a hierarchy scheduling processas illustrated in Algorithm 3.
A. Complexity
In whole, the hierarchy scheduling algorithm processesnodes of the same level subset by subset and the nodes level bylevel. Assume there are n nodes in total and 16 ports in eachSwitch. The largest level of the hierarchy is log16 π. There aren/(16*m) node subsets of level m. Assume the time complexityof the Algorithm 1 is Ξ(1). so the complexity of the Algorithm2 can be calculated by the equation Ξ£
log16 ππ=1 π/(16 β π) =
Ξ(π β πππ(ππππ)).B. Stability
The hierarchy scheduling algorithm process the nodes sub-set by subset and node subsets level by level. If applicationsin some node are changed, placement of the applications inthe corresponding subset will be refined, which generates anew πππ’πππ’ππ. So the following hierarchy scheduling from thelevel 2 to n will be influenced and lead to a new schedulingresult. The stability performance of the proposed hierarchy
Algorithm 3 Hierarchy Scheduling of Applications (HSA)
Input: node sequence π΄ = {ππ}ππ=1, data transfer costmatrix πΆπ ;Output: node sequence πΆ with least nodes used and leastdata transferInitialization: current level: π = 1, π‘βπππ ββπππ = 0.95,πΆ =ππ’ππMain Loop:while π΄! = ππ’ππ do
divide A into subsets {π΅π}πππ=1 according to the transfer
cost matrix πΆπfor i = 1 : ππ do
process subset π΅π with Algorithm 2.calculate the πππ’πππ’ππ and the ππππ’ππ according tothe π‘βπππ ββπππremove ππππ’ππ from A, πΆ = πΆ
βͺππππ’ππ,
end forrefine current level : π = π + 1;
end while
scheduling method can be evaluated by the following Ratio ofLocal Data Transfer metric:
π ππΏπ·π = ππ β πΏ/π +π·π2βπ/π
where T is the total size of the application in the input nodesequence, L is the maximum capacity of the node, Ns is thenumber of the nodes in the switch and π·π2βπ is the totaldata transfer from the second level to the last level. Differentnumber of the nodes in the switch leads to different level ofstability. When the number of the nodes in the Switch ππ increase, the π·π2βπ will decrease. A good balance betweenππ and π·π2βπ will result in a small π ππΏπ·π . The smallerπ ππΏπ·π is, the more stable the hierarchy scheduling algorithmwill be and the less the scheduling result will vibrate due tothe changes in the applications of the nodes in the node subset.
VII. EXPERIMENTS AND DISCUSSIONS
The proposed scheduling algorithms has been implementedusing C++. We tested the algorithms using PC P-IV, 2.8GHzand 2GB memory. Data transfer weight matrix C (as shownin Fig6) is generated according to the distance and networkconnection among the nodes. Two kinds of input node se-quence with size between 0 and 100 are generated randomlywith a uniform distribution and a normal distribution (zeromean and a variable standard deviation of 0.3 are used). Thenapplications satisfying the size constraints of the above pre-generated input nodes are generated randomly with uniformdistribution.
A. Performance of KMNS
KMNS with 32, 64, 128 and 256 nodes of input nodesequence are generated to test the performance of the KMNS.The result is shown in Fig. 5. It seems that RoNU(K) decreaseswith respect to K and is no less than ππππ‘/π , while theRoDT(K) increases with respect to K.
162
Node Index
Nod
e In
dex
500 1000 1500 2000
500
1000
1500
2000
50 100 150 200 250
50
100
150
200
250 0
1
2
3
4
Fig. 6: Transfer cost weight matrix.
B. Comparison of DMNS
To compare DMNS, we also have implemented twoother scheduling methods: Dynamic Max Application Sorting(DMAS) and Descending Best Fit (BFD). In DMAS, the inputnode sequence are sorted according to the size of the largestapplications in each node. Table I shows the scheduling resultof DMNS, DMAS and BFD under the input node sequencewith normal distribution. Table II shows the result of inputnode sequence with uniform distribution. It seems that allthree method can achieve the least node used. DMNS is betterthan DMAS and BFD, which can achieve the least node usedtogether with the minimal data transfer, as is shown in Fig.8.
TABLE I: Comparison for DMNS with input node sequenceunder normal distribution.
N DMNS DMAS BFD32 πππππ π‘ 16.58 16.58 16.58
π·ππππ π‘ 599.77 659.01 1073.8564 πππππ π‘ 32.26 32.26 32.26
π·ππππ π‘ 1194.32 1318.99 2176.14128 πππππ π‘ 63.82 63.82 63.82
π·ππππ π‘ 2427.37 2712.19 4438.20256 πππππ π‘ 127.23 127.23 127.23
π·ππππ π‘ 4853.39 5366.64 8887.95
TABLE II: Comparison for DMNS with input node sequenceunder uniform distribution.
N DMNS DMPS BFD32 πππππ π‘ 16.16 16.16 16.16
π·ππππ π‘ 394.46 440.73 932.0764 πππππ π‘ 31.72 31.72 31.72
π·ππππ π‘ 806.39 887.54 1909.15128 πππππ π‘ 63.64 63.64 63.64
π·ππππ π‘ 1607.06 1800.07 3929.89256 πππππ π‘ 127.27 127.27 127.27
π·ππππ π‘ 3237.21 3616.10 7943.63
C. Performance of HSA
1) Minimal data transfer : We have implemented thehierarchy scheduling algorithm (HSA) proposed in section IVbased on DMNS, DMAS and BFD. 4096 nodes generated withnormal distribution and uniform distribution are used as inputnode sequence and tested with 32-, 64-, 128- and 256- portSwitch. We find that all three methods can reach the least
0 10 20 30 40 50 600
0.2
0.4
0.6
0.8
1
Rat
io
N=64N
opt=35
K
RoNU(K)RoDT(K)RoDTE(K)N
opt / N
(a) With 64 nodes as input.
0 20 40 60 80 100 1200
0.2
0.4
0.6
0.8
1
Rat
io
N=128N
opt=61
K
RoNU(K)RoDT(K)RoDTE(K)N
opt / N
(b) With 128 nodes as input.
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
Rat
io
N=256N
opt=132
K
RoNU(K)RoDT(K)RoDTE(K)N
opt / N
(c) With 256 nodes as input.
Fig. 7: Performance of KMNS.
node used. The difference of the data transfer are comparedin Fig.9. In total, DMNS cost the least data transfer amongthe three methods. It seems that the data transfer cost amongthe first level is much larger than that of the left levels. Datatransfer for normal distribution input is larger than a uniformdistribution input. The data transfer of the level 2-n for normaldistribution input is almost the same with that for uniformdistribution input, when compared to the data transfer of thefirst level. HSA based on DMNS show the best performanceamong the three methods, which cost the least data transferwhen the number of the nodes in the Switch is 32 and 64. Allthree method cost almost the same data transfer of the level2-n, when the number of the nodes in the Switch is 128 and256. For each scheduling method, the larger the number ofnodes in the Switch is, the less data transfer of the level 2-n
163
32 64 128 2560
2000
4000
6000
8000
Number of Nodes
Dat
a T
rans
fer
DMNSβUD inputDMNSβND inputDDASβUD inputDDASβND inputBFDβUD inputBFDβND input
Fig. 8: Comparison of Hierarchy Scheduling Algorithm. ThePerformance of DMNS, DMAS, BFD is tested with theuniform distribution input (UD) and normal distribution (ND)input.
between the node subset is. In total, the proposed hierarchyscheduling algorithm is a stable scheduling method.
0 0.5 1 1.5 2
x 105
3264
128256
3264
128256
3264
128256
3264
128256
3264
128256
3264
128256
Data Transfer
Num
ber
of n
odes
in th
e S
witc
h
DM
NS
DM
AS
BF
D
Data Transfer of Level 2βn with UD inputData Transfer of Level 1 with UD inputData Transfer of Level 2βn with ND inputData Transfer of Level 1 with ND input
Fig. 9: Comparison of the performance of hierarchy schedulingalgorithm (HSA) based on DMNS, DMAS and BFD. HSAbased on DMNS shows the best performance in total.
2) Comparison of the stability: We compare the stabilityof the hierarchy scheduling based on DMNS, DMAS andBFD in Fig.10. It seems that different number of the nodes inthe Switch leads to different stability. With 64 nodes in eachSwitch, all three methods can reach better stability than with32, 128 or 25 nodes in each Switch. In total HSA based onDMNS shows the better stability than HSA based on DMAS
and BFD. With 64 nodes in each Switch, HSA based onDMNS shows the best performance.
32 64 128 256 32 64 128 2563
3.5%
4.0%
4.5%
5.0%
5.5%
6.0%
6.5%
7.0%
Number of nodes in the Switch
RoL
DT
RoLDT with UD input RoLDT with ND input
DMNSDMASBFD
Fig. 10: Comparison of the stability of the hierarchy schedul-ing based on DMNS, DMAS and BFD. With 64 nodes in eachSwitch, HSA based on DMNS shows the best performance intotal.
VIII. CONCLUSION AND FUTURE WORK
In this paper we solved the application scheduling problemin large scale data center. Targeting at using least serversand minimal data transfer, a dynamic maximum node sorting(DMNS) scheduling method is proposed to optimize theapplication scheduling for the nodes connected by a com-mon switch. Based on DMNS, a stable hierarchy schedulingalgorithm (HSA) with time complexity of Ξ(π β πππ(ππππ))is developed. The stability of HSA stems from the idea ofscheduling applications subset-by-subset and level-by-level.The performance of HSA are compared with similar traditionalalgorithms via simulation. Results show that HSA outperformsthe others in terms of stability and energy consumption.
Future work includes considering more energy-aware factorsin our scheduling algorithm. For instance, the cooling condi-tions of each node are different. The application is preferred tobe scheduled to nodes with lower temperature, which leads tothe development of a temperature-aware scheduling algorithmsfor applications in large scale data center.
ACKNOWLEDGMENT
This work is supported by the National High-Tech Researchand Development Plan of China (863 program) under GrantNo.2009AA01A129-2, the Science and Technology Project ofGuangdong Province under the grant No.2010A090100028,201120510102, National Natural Science Foundation of P. R.China under the Grant No.60903116and Knowledge Innova-tion Project of the Chinese Academy of Sciences No.KGCX2-YW-131, the Science and Technology Project of Shenzhenunder the grant No. JC200903170443A, ZD201006100023A,ZYC201006130310A.
REFERENCES
[1] Anne-Cecile Orgerie, Laurent Lefevre, Jean-Patrick Gelas, Demystifyingenergy consumption in Grids and Clouds, GREENCOMP, pp.335-342,International Conference on Green Computing, 2010
164
[2] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar, Amin M.Vahdat, and Ronald P. Doyle. 2001. Managing energy and serverresources in hosting centers. In Proceedings of the eighteenth ACMsymposium on Operating systems principles (SOSP β01). ACM, NewYork, NY, USA, 103-116.
[3] Andrew Yao. New algorithm for bin-packing. Journal of ACM, 27,207-227 (1980)
[4] Akshat Verma, Puneet Ahuia, Anindya Neoqi. Power-aware dynamicplacement of HPC applications. Proceedings of the 22nd annual inter-national conference on Supercomputing (ICSβ08), 175-184.
[5] . David Johnson. Near-Optimal Bin Packing Algorithms. PhD thesis,MIT, Cambridge, Massachusetts, June 1973
[6] G. Gambosi, A. Postiglione, and M. Talamo. Algorithms for the relaxedonline bin-packing model.SIAM Journal on Computing, 30(5), 1532-1551 (2000)
[7] Peter Sanders, Naveen Sivadasan, and Martin Skutella, Online Schedul-ing with Bounded Migration, Springer, 2004
[8] Shekhar Srikantaiah, Aman Kansal, and Feng Zhao. 2008. Energyaware consolidation for cloud computing. In Proceedings of the 2008conference on Power aware computing and systems (HotPowerβ08).USENIX Association, Berkeley, CA, USA, 10-10.
[9] Laurent Lefevre, Anne-Cecile Orgerie. Designing and Evaluating anEnergy Efficient Cloud. The Journal of Supercomputing, 51(3):352-373,March 2010
[10] Baliga, J.; Ayre, R.W.A.; Hinton, K.; Tucker, R.S.; , Green CloudComputing: Balancing Energy in Processing, Storage, and Transport,Proceedings of the IEEE , 99(1), 149-167 (2011)
[11] Andrew J. Younge, Gregor von Laszewski, Lizhe Wang, Sonia Lopez-Alarcon, and Warren Carithers. 2010. Efficient resource managementfor Cloud computing environments. In Proceedings of the InternationalConference on Green Computing (GREENCOMP β10). IEEE ComputerSociety, Washington, DC, USA, 357-364.
[12] Akshat Verma, Puneet Ahuia, Anindya Neoqi. pMapper: Power andMigration Cost Aware Application Placement in Virtualized Systems.In Middleware β08: Proceedings of the 9th ACM/IFIP/USENIX Inter-national Conference on Middleware, pp. 243-264. 2008.
[13] M. Islam, P. Balaji, P. Sadayappan and D. K. Panda, QoPS: A QoSbased scheme for Parallel Job Scheduling. IEEE Springer LNCS JournalSeries, 2862, 252-268 (2003)
[14] M. Islam, P. Balaji, G. Sabin and P. Sadayappan, Analyzing and Mini-mizing the Impact of Opportunity Cost in QoS-aware Job Scheduling.In the IEEE International Conference on Parallel Processing (ICPPβ07),42.
[15] M. Islam, P. Balaji, P. Sadayappan and D. K. Panda, Towards Provisionof Quality of Service Guarantees in Job Scheduling. In the proceedingsof the IEEE International Conference on Cluster Computing (Clus-terβ04), 245-254
[16] N. Desai, P. Balaji, P. Sadayappan and M. Islam. Are Non-BlockingNetworks Really Needed for High-End-Computing Workloads?. IEEEInternational Conference on Cluster Computing (Clusterβ08), 152-159.
[17] Johnson, D.S., Demers, A., Ullman, J.D., Garey, M.R., and Graham,R.L., Worst-case Performance Bounds for Simple One-DimensionalPacking Algorithms, SZAM Journal on Computing, 3(4), 299-325(1974).
165