Heterogeneous Environment Aware Streaming …net.pku.edu.cn/~cuibin/Papers/2015TKDE.pdf1 Heterogeneous Environment Aware Streaming Graph Partitioning Ning Xu, Bin Cui, Lei Chen, Zi

1

Heterogeneous Environment AwareStreaming Graph Partitioning

Ning Xu, Bin Cui, Lei Chen, Zi Huang and Yingxia Shao

Abstract—With the increasing availability of graph data and widely adopted cloud computing paradigm, graph partitioning hasbecome an efficient pre-processing technique to balance the computing workload and cope with the large scale of input data.Since the cost of partitioning the entire graph is strictly prohibitive, there are some recent tentative works towards streaminggraph partitioning which run faster, are easily parallelized, and can be incrementally updated. Most of the existing works onstreaming partitioning assume that worker nodes within a cluster are homogeneous in nature. Unfortunately, this assumptiondoes not always hold. Experiments show that these homogeneous algorithms suffer a significant performance degradation whenrunning at heterogeneous environment. In this paper, we propose a novel adaptive streaming graph partitioning approach to copewith heterogeneous environment. We first formally model the heterogeneous computing environment with the consideration ofthe unbalance of computing ability (e.g., the CPU frequency) and communication ability (e.g., the network bandwidth) for eachnode. Based on this model, we propose a new graph partitioning objective function that aims to minimize the total executiontime of the graph-processing job. We then explore some simple yet effective streaming algorithms for this objective function thatcan achieve balanced and efficient partitioning result. Extensive experiments are conducted on a moderate sized computingcluster with real-world web and social network graphs. The results demonstrate that the proposed approach achieves significantimprovement compared with the state-of-the-art solutions.

Index Terms—Graph Partitioning, Streaming Algorithms, Heterogeneous Environment, BSP Model

F

1 INTRODUCTION

IN recent years, the scale of graph data has becomelarge and continues to increase fast [11], [9]. The

unprecedented proliferation of graph data requires ef-ficient processing mechanisms and methods to handledifferent workloads.

Traditional MapReduce frameworks have provento be inefficient for iterative computing workload ongraph [12]. Thus, recently, many graph-based paral-lelism frameworks have been proposed. For example,Pregel, GraphLab [18], [16] are proposed to exploitdifferent parallel graph processing methods. As arepresentative of recent emerging parallel graph com-puting systems, Pregel is based on the BSP (bulk-synchronous parallel) model [30] and adopts a vertex-centric model in which each vertex executes a user-defined function in a sequence of super-steps. By de-fault, Pregel uses a hash function to partition vertexeswhich is inefficient for large amounts of network traf-fic between machines. An efficient graph partition tominimize the application’s overall runtime is desired.However, the balanced graph partitioning problem ischallenging and known to be NP-complete [17].

• Ning Xu, Bin Cui and Yingxia Shao are with the Key Lab of HighConfidence Software Technologies (MOE), School of EECS, PekingUniversity, China. E-mail: {ning.xu, bin.cui, simon0227}@pku.edu.cn

• Lei Chen is with the Hong Kong University of Science and Technology,Hong Kong, China. E-mail: [email protected]

• Zi Huang is with the School of ITEE, The University of Queensland,Australia. E-mail: [email protected]

Most of the state-of-the-art partitioning works onPregel-like systems are based on k-balanced graph par-titioning [5] which aims to minimize the total com-munication cost between machines and balance thenumber of vertexes on each part. The k-balancedgraph partitioning problem has been proven to be NP-hard [5] and several effective algorithms have beenproposed. However, as the size of graph becomeslarger, approximation algorithms and even widelyused multi-level heuristic algorithms suffer a signifi-cant increasing of partitioning time. As an experimentshown in [29], multi-level approach requires morethan 8.5 hours to partition a graph from Twitter whichis sometimes longer than the time spent on processingthe workload.

Another striking factor affecting parallel graphcomputing performance is the heterogeneous comput-ing environment. Most of the current graph partition-ing strategies assume that the nodes in a cluster arehomogeneous in nature. Unfortunately, these homo-geneity assumptions do not always hold, especiallywhen the graph system is set up in a public cloudenvironment [10], [33] (e.g., Amazon Elastic ComputeCloud, EC2 [1]) or an inter-company private datacenter [22]. For example, [10] measured the networkbandwidth of 128 EC2 instances, and identified a sig-nificant network bandwidth unevenness. Some pair-wise bandwidths reach more than 500 MB/s, whilethe lowest is only 37.5MB/s. In a private cloud, thereusually exist multiple generations of hardware, withvarious computing and communication abilities as

2

well.In fact, when the cluster is mixed with heteroge-

neous hardware, classic k-balanced algorithms use thesame strategy as in homogeneous cluster. All nodesare assigned the same workload, so that faster nodesare often waiting for the slower ones which cannotfully utilize the facilities. Figure 1 shows an exampleof 10-node private cloud. We use the state-of-the-art multi-level graph partitioning tool METIS [14] topartition a graph and process a Pagerank workloadon it. Figure 1(a) shows the job execution time in theoriginal cluster and Figure 1(b) shows the result after 5nodes were upgraded with faster CPUs and network.As we can see, there is little improvement in therunning time of the job (the red lines, i.e., from 277s to259s), although half of the hardware was upgraded.

� � � � � � � ��

� ��

�

��

��

��

(a.) Original Cluster

� � � � � � � ��

� ��

�

��

��

��

��

(b.) Heterogeneous Cluster

Fig. 1: Job Execution Time on a Heterogeneous Cluster

Though [10] proposed a multi-level graph partition-ing framework considering the network bandwidthdifferences in a cloud environment, this work did nottake computing heterogeneity into consideration. Inaddition, as mentioned above, such multi-level ap-proach is not suitable for large-scale graph on accountof a huge partitioning time cost.

A recent emerging graph partitioning approach isstreaming partitioning, in which the input graph isprocessed without knowing the whole graph data.Streaming graph partitioning uses lightweight heuris-tic algorithms and provides faster speed, compara-ble partitioning performance and incremental updat-ing [27], [29].

In this paper, to alleviate the graph partitioningchallenges encountered in heterogeneous environ-ment, we propose a novel heterogeneous environ-ment aware graph partitioning approach. We formallymodel the heterogeneous environment, consideringthe unbalance of both computing ability and com-munication ability of nodes in the distributed cluster.Specifically, other than k-balance graph partitioning,we propose a new graph partitioning objective func-tion that aims to minimize the running time of thegraph-processing application. Based on the objectivefunction, we further design several novel streaminggraph partitioning heuristics that can fully utilize theheterogeneous environment.

To evaluate our proposed heterogeneous awarestreaming algorithms, we have built a prototypesystem, named HeAPS, following Google’s Pregel

paradigm [18], and implemented our new streaminggraph partitioning heuristics along with other state-of-the-art algorithms. We evaluate our graph parti-tioning approach in HeAPS using real-world socialnetwork graph datasets and several representativeworkloads on two test clusters with 26 and 28 nodes,respectively. In addition, we investigate several het-erogeneous environments which are common circum-stances in cloud. The evaluation confirms that ourheterogeneous environment aware graph partitioningapproach can significantly improve job execution timewith balanced workload among cluster nodes.

The remaining of this paper is organized as follows.In Section 2 we give an overview of related researchareas in this direction. Section 3 discusses backgroundand problem definition. In Section 4, we presentheterogeneous environment modeling and our graphpartitioning approach. Section 5 lists the experimentalstudy. And finally we conclude this work in Section 6.

2 RELATED WORK

Large-Scale Graph Computing: To meet the cur-rent prohibitive requirements of processing large-scale graphs, many distributed methods and frame-works have been proposed and become appealing.Pregel [18] and GraphLab [16] both use a vertex-centric computing model, and run a user defines pro-gram at each worker node in parallel. Giraph [2] is anopen source project, which adopts Pregel’s program-ming model, and adjusts for HDFS. In these parallelgraph processing systems, it is important to parti-tion large graph into several balanced sub-graphs, sothat parallel workers can coordinately process them.However, most of the current systems usually choosesimple hash method.

Graph Partition: Graph partitioning is a combina-torial optimization problem which has been studiedfor decades and wildly used in many fields, suchas parallel subgraph listing [24]. The wildly usedobjective function, k-balanced graph partitioning aimsto minimize the number of edges cut between parti-tions while balance the number of vertexes. Thoughthe k-balanced graph partitioning problem is an NP-Complete problem [17], several solutions have beenproposed to tackle this challenge.

Andreev et al. [5] presented a bicriteria approxi-mation algorithm which guarantees polynomial run-ning time with an approximation ratio of O(logn).Another solution was proposed by Even et al. Besidesapproximated solution, Karypis et al. [15] proposeda parallel multi-level graph partitioning algorithm tominimize bisection on each level. There are someheuristics implementations like METIS [14], paralleland multiple constraints version of METIS [23] whichare widely used in many existing systems. Pellegriniand Roman proposed Scotch [21], [20] which takesnetwork topology into account. Although they cannot

3

provide precise performance guarantee, these heuris-tics are quite effective. More heuristic approaches aresummarized in [3].

Streaming Partitioning Algorithms:. The meth-ods mentioned above are offline and require ex-pensive time cost to process. Recently, Stanton andKliot [27] proposed a series of online streaming par-titioning method using heuristics. Fennel [29] ex-tended this work by proposing a streaming partition-ing framework which combines other heuristic meth-ods. Charalampos E. Tsourakakis [28] used higherlength walks to improve the quality of the graphpartition. Nishimura, Joel and Ugander, Johan [19]futher proposed Restreaming LDG and RestreamingFennel that generated initial graph partitioning usingthe last streaming partitioning result. LogGP [32] usedhyper graph to optimize the initial partitioning result.Although, there is no mathematical proof, experimentshows that these one-pass streaming partitioning al-gorithms have comparable performance to multi-levelones with short partition time. Besides, they adaptto dynamic graphs, where offline methods becomeinefficient for the expensive computational cost whenrepartition the graph.

Heterogenous Environment Computing: However,these aforementioned implementations can not fullyutilize the cluster information to guide the partition-ing strategies. Recently, there are some works focusingon processing jobs with consideration of heteroge-neous environment. [33] studied heterogeneity inMapReduce framework to improve applications onHadoop but did not consider heterogeneity for graphpartitioning. Closer to our setting, [10] proposed anovel graph partitioning framework considering thenetwork bandwidth differences on cloud environmentusing multi level algorithms. However this work isbased on the tree topology network structure in thecloud which lacks of general consideration. Besides,it is based on multi-level algorithm that is not fastenough for current large scale graph scenarios.

3 BACKGROUND AND PROBLEM DEFINI-TION

In this section, we first discuss the properties ofheterogeneous environments and their influence oncurrent parallel graph processing systems. We thenanalyze the impact of workload on the performanceof graph query processing. At last, we formally definethe partitioning problem that we will study in thispaper.

3.1 Heterogeneous Computing EnvironmentHeterogeneous environments are universal in currentcloud and are caused by the following three reasons.

1. Natural heterogeneity: Applications running onpublic cloud usually endure uncontrollable perfor-mance variations though the hardwares are the same.

This is caused by the network topology and indi-vidual differences [10], [33]. Virtual machines whichrun on different physical host or run on the samephysical host may cause network heterogeneity. Vir-tual machines running on the different physical nodesmay have different CPU or CPU workload on itwhich causes computing heterogeneity. The same phe-nomenon can be observed in private cloud as well,Figure 2 shows bandwidth and computing abilitydifferences of a data center which consists of 26physical nodes. As we can see from Figure 2(a), thenetwork bandwidth varies significantly. The averagebandwidth is 105 MB/s for each pair-wise of 26instances. The highest speed reaches 214 MB/s whilethe lowest is only 10.4 MB/s. Figure 2(b) shows therunning time of calculating 1 million digits of PI bysingle processor. Similar to network bandwidth, thereis heterogeneity on computing ability.

2. Hardware heterogeneity: Usually, in a private cloud,organizations often own multiple generations of hard-ware. The new hardware may be equipped fasternetwork adapter and CPU than the last generationwhich results in computing and communication het-erogeneity [33].

3. Virtualization: In fact, many cloud systems usevirtual machine to fully utilize of resources. The band-width between a pair of nodes might depend on howthe instances are allocated. When two instances areallocated to the same physical node, data could betransferred between them at a high speed, whereaswhen two instances are allocated to different nodesor even nodes across different routers, the data trans-

(a.) Communication Ability Unevenness

(b.) Computing Ability Unevenness

Fig. 2: Unevenness of communication and computingability in a Private Cluster

4

ferring between them would be much slower. Thus,virtualization affects computing and communicationabilities [31].

A heterogeneous environment generally has twocharacteristics. First, the network bandwidth betweendifferent pairs of nodes may vary significantly. Trans-ferring the same amount of data between two differ-ent node pairs would cost variant amount of time.Second, the computing capacities of nodes also differradically.

� � � � � � � ��

��

�

��

��

��

�

��

��

��

��

� ��

� ��

Fig. 3: Heterogeneity Impact on Running Time

In fact, the heterogeneous environment cluster can-not be fully utilized if we use traditional graph par-titioning algorithm. We have shown an example inthe Introduction. Here we give another example toreveal that traditional graph partitioning algorithmsuffers in heterogeneous environment. Figure 3 illus-trates the performance result of a PageRank algo-rithm running in the above mentioned cluster withMETIS partitioned graph. The light color bars showthe running time of each node in the cluster with26 nodes. The total execution time is determined bythe slowest node. Obviously, compared with others,nodes 13, 14 and 15 are the bottleneck of the clusterthat severely increase the execution time (the upperred line). We then remove these slower nodes andrerun the same workload on 23 nodes. As shown bydark color bars, the running time (the lower red line)on this experiment outperforms the previous one evenwith fewer nodes.

This experiment indicates that traditional homo-geneous graph partitioning cannot be aware of theperformance of physical nodes. In fact, assigning sameworkload to the nodes results in that weaker perfor-mance nodes become the bottleneck of the system.

3.2 Workload Analysis

Besides the heterogeneous environment, graph parti-tioning should consider the workload. But this issuehas not been paid enough attention before. The queryworkload affects the time ratio between computationand communication.

Figure 4(a) shows an experiment on two workloadsStatistical Inference [6] and Two-hop Friend List. Statisti-cal Inference workload usually contains complicatedSVM computation and Two-hop Friend List needstransfer huge intermediate result between partitions.We record the average percentage of time used for

computing UDF function and sending/receiving datawhen executing them in HeAPS. As we can see, therunning time of Statistical Inference is dominated bycomputing jobs while Two-hop Friend List is dominatedby I/O data transmission.

To better understand the shift between commu-nication cost and computing cost in parallel graphprocessing system, we simulate a series of tasks. Werewrite the UDF function of PageRank [7] by addingadditional iterations of summing the PR value fromthe neighborhood nodes. We use different iterationtimes to vary the computing and communicationratio. Larger iteration times mean more computing,and the communication becomes less important. Theoriginal PageRank workload can be considered asiteration time equals to one. Figure 4(b) shows thatwhen the number of iterations is less than 20, thejob execution time is dominated by communicationpart because the running time is still the same asthe original PageRank. When the number of iterationsis larger than 20, the execution time is proportionalto iterations, and the computing part dominates theoverall time.

Thus, to balance the execution time of each worker,we should take the workload into consideration.For computing-intensive workload whose runningtime is dominated by computing, we should bal-ance the workload based on computing jobs whilecommunication-intensive workload is opposite. Tradi-tional graph partitioning frameworks lack the analysisof workload thus it is hard to balance the running timeof each worker.

(a.) Workload AnalysisIteration Times

1 10 20 30 40 50 60 70 80 90

Job

Run

ning

Tim

e (s

ecs)

40

60

80

100

120

(b.) Execution Time Shift

Fig. 4: Workload Efforts on Execution Time

3.3 Problem DefinitionIn this paper, we focus on alleviating the graph parti-tioning challenges in heterogeneous environment forparallel graph processing system.

We first formally describe the general graph parti-tioning problem. We use G = (V,E) to represent theData Graph to be partitioned. V is a set of vertexes,and E is a set of edges in the graph. The graph iseither directed or undirected. Let Pk = {V1,...,Vk} bea set of k subsets of V . Pk is said to be a partition ofG if: Vi 6= ∅, Vi ∩ Vj = ∅, and ∪Vi = V , for i, j = 1,..., k, i 6= j. In this paper, we call the elements Vi of

5

Pk the parts of the partition. The number k is calledthe cardinality of the partition. Graph partitioningproblem is a combinatorial optimization problem offinding an optimal partition Pk based on an objectivefunction. Here we give a formal definition:

The graph partitioning problem can be defined as atriplet (S, p, f) such that: S is a discrete set of all thepartitions of G. p is a predicate on S which createsa subset of S called admissible solutions set Sp. fis the graph partitioning objective function. Graphpartitioning problem aims to find a partition P thatP ∈ Sp and minimizes f(P ):

f(P ) = MinP∈Spf(P ) (1)

In particular, if the vertexes of graph arrive in someorder with the set of its neighbors, and we partitionthe graph based on the vertex stream, it is called aStreaming Graph Partitioning. Let P tk = V t1 , ..., V

tk be

a partitioning result at time t, where V ti is the set ofvertexes in partition i at time t. A streaming graphpartitioning is sequentially presented a vertex v andits neighbors N(v), and it must assign v to a partition iutilizing only the information contained in the currentpartitioning P t. Once the vertex is placed, it will notbe removed.

Most of the recent works use k-balanced graph par-titioning for parallel graph processing systems [29],[27]. The k-balanced graph partitioning problem aimsto minimize edge cuts between partitions with similarvertexes in each part. It balances the computing cost ofeach worker and minimizes the total communicationcost. However in parallel graph processing systems,especially in heterogeneous environment, balancingthe number of vertexes and minimizing communi-cation cost cannot minimize the job running time.Section 3.2 and experiment in [25] reveal that therunning time of worker is determined by both com-putation/communication job and k-balanced graphpartitioning cannot balance the running time of eachnode.

Apart from k-balanced objective function, in thispaper, we propose a more appropriate graph parti-tioning objective function with the consideration ofcomputing environment’s heterogeneity. Besides, ourobjective function analyzes workload’s communica-tion and computing ratio to accurately balance theexecution time of each node.

Based on this objective function, we then exploreseveral heterogeneous aware partitioning algorithmsbased on streaming mode with fast speed and highquality. We use streaming partitioning model for het-erogeneous aware partitioning. On one hand, the ex-isting approximated methods or multi-level methodsdo not scale to big data since they need the fullgraph information and require a long running time.On the other hand, streaming model is highly efficientwithout full access to all the graph data and provides

faster speed, comparable partition performance andincremental updating [27], [29].

4 HETEROGENEOUS AWARE STREAMINGGRAPH PARTITIONING APPROACH

In this section, we first model the heterogeneousenvironment and introduce the notations used inthe paper. We next present the new heterogeneousaware graph partitioning problem definition andsome streaming heuristics based on the new objectivefunction.

4.1 Heterogeneous Environment Modeling

In order to better partition graph based on the hetero-geneity of the physical computing environment, wefirst formally model the heterogeneous environment.

We introduce Physical Graph to model the hetero-geneous physical environment. The physical graph isrepresented as a weighted undirected graph PG =(PV, PE). PV is the set of graph vertexes. Eachvertex corresponds to one physical worker node inthe system. We use PVi to denote physical node i inthe cluster. PE is the set of edges in the graph. Eachedge represents a communication link between a pairof physical nodes. Note that in practice each physicalnode could process several partitions of vertexes ofthe data graph. Without loss of generality, we assumethat each physical node only processes one partition.Given the partition Pk = {V1,...,Vk}, we use Vi todenote the vertexes assigned to PVi and Edge(Vi, Vj)to denote the edges between Vi and Vj . In addition,suppose each physical node has the memory capacityof Mi, we make sure that Mi is larger than the dataassigned to PVi.

We next reveal how to quantify the heterogeneity ofthe distributed environment, i.e., in terms of data pro-cessing speed and network bandwidth. computing-intensive applications highly depend on data pro-cessing speed, while communication-intensive appli-cations care for network bandwidth. Therefore, het-erogeneity measurements of the environment shouldconsider both data processing speed and networkbandwidth. We use two heterogeneity metrics: thecomputing ability of each physical node and thecommunication ability between each pair of nodes.We define and collect these two metrics as follows.

Computing Ability: It is measured by the amountof computation that can be performed in a timeunit of certain physical node. In this paper, we usefloating-point operation as computation unit and useCi to denote the computing ability of PVi. It can bedetermined by running a certain benchmark that ran-domly generates two float numbers and compute themultiplication result for 1,000,000 times on PVi. Werecord the response time and calculate the time usedfor one operation denoted as TcT imei. In fact, we

6

use TcT imei as the value for experiment. However,TcT imei is a small float number, for easier to read,we use Ci denote the normalized TcT imei. If themaximum running time of these nodes is PVm andthe running time is TcT imemax, then we can get Ciusing the following formula.

Ci =TcT imemaxTcT imei

(2)

Communication Ability: The communication abil-ity between each pair of physical nodes (i, j) is mea-sured as the amount of data that can go through theirlink in a time unit; in this paper, we use a 64 bitsdata as communication unit. We use L(i, j) to denotethe communication ability between the physical nodesPVi and PVj . We assume that the communicationability in either direction across a given physical pairis the same, therefore L(i, j) = L(j, i). We let L(i, i)= 0, meaning that we omit the cost for any givenphysical node to retrieve information. We measurethe communication ability by sending a data chunkfrom node i to j. We record the transmission timefrom Mi to Mj to be TsT ime(i, j). We assume fullduplex communication in our experiments. Similar tothe computing ability, let TsT imemax be the largesttime in all the node pairs, L(i, j) can be normalizedby:

L(i, j) =TsT imemaxTsT ime(i, j)

(3)

4.2 Objective Function

After modeling the heterogeneous environment, wenow analyze the partitioning objective function. InBSP model, a job is divided into a sequence ofsuper-steps. In each super-step as shown in Figure 5,each physical node computes a user-defined functionagainst all vertexes assigned to it and sends/receivesmessages to/from its neighbors. The process of eachphysical node is parallelled. There is a synchroniza-tion step in each super-step to ensure all the nodesfinished their workload. Thus, the running time ofeach super-step is determined by the slowest node.

Fig. 5: Abstraction of the BSP Model

Here we use τn to denote the running time of then-th super-step, and τni to denote the time that PVi

spends at the n-th super-step. τn equals the maximumvalue of τni . We use JobT ime to denote the totalrunning time of the graph processing job, which isthe sum of the running time of the slowest node ineach super-step, as formally described in Formula 4.

JobT ime =∑

(τn)

=∑

(Max(τni ))(4)

For simplicity, we assume that every vertex inData Graph is active and sends messages to all itsneighbors. A discussion of other models is beyondthe scope of the current paper. In this assumption,for node PVi, the running time of each super-step isstable, denoted as τ i. If we have t super-steps in total,the execution time of the job equals:

JobT ime = t×Max(τ i) (5)

In fact, graph partition aims to minimize the Job-Time. Naturally, our objective function can be ex-pressed as Formula 6.

Min(JobT ime) = Min(Max(τ i)) (6)

As mentioned in Section 3, τ i is determined by bothcommunication time and computing time. That is :

τ i = fs(tcompi , tcommi ). (7)

The function fs is determined by the system im-plementation. If the system adapts an I/O blockingmode in which CPU and I/O are operated serially,

τ i = tcompi + tcommi (8)

If a system parallelly processes I/O operations,

τ i = Max(tcompi , tcommi ) (9)

The computing time tcompi is determined by thecomputing workload in PVi and the computing abil-ity of PVi. Communication time tcommi is determinedby communication workload in PVi and the commu-nication ability of PVi.

Now we can give the definition of our new objec-tive function of graph partitioning problem based onminimizing the job execution time which we call time-minimized graph partitioning. Given a data graphG = (V,E) and a physical graph PG = (PV, PE). Let

f(P ) = Maxi∈{1,...,k}(fs(tcompi , tcommi )) (10)

Sp = {P ∈ S and |P | = |PV | = k} (11)

Time-minimized graph partitioning aims to find thepartition P ∈ Sp that minimizes f :

f(P ) = MinP∈Spf(P ) (12)

In BSP model, our time-balance objective functionis more appropriate than k-balanced. As mentionedin 3.3, K-balanced graph partitioning aims to balancethe computing cost of each worker and minimize the

7

total communication cost. It may still have commu-nication skew on different nodes and increase thejob running time. Other than K-balanced graph par-titioning with that disadvantage, our new objectivefunction focuses on minimizing the total running timeof application.

4.3 Streaming Heuristics

NP-hardness: Different from the objective func-tion of traditional graph partitioning problem, time-minimized graph partitioning is a Min-Max version ofcombinatorial optimization problem. Now we provethat time-minimized graph partitioning problem isNP-hard. We use another Min-Max combinatorial op-timization problem, Minimum Makespan Scheduling(MMS) and reduce our problem to it. Given process-ing times for n jobs, p1, p2, ..., pn, and an integerm, find an assignment of the jobs to m identicalmachines so that the completion time is minimized,called Minimum Makespan Scheduling Problem. [8] hasproved that MMS problem is NP-hard when m ≥ 2.

Theorem 1: Time-minimized graph partitioningproblem is NP-hard when the partition cardinalityk ≥ 2.

Proof: Suppose that time-balance Graph parti-tioning problem can be solved in polynomial timeand we have a polynomial solution S. We use S topartition a graph with |V | = n into m parts. Letfs(t

compi , tcommi ) = tcompi and let the computing time

of n vertexes equal the n job processing times of Min-imum Makespan Scheduling Problem. The partitionresult of time-balance graph partitioning problem isthe result of Minimum Makespan Scheduling prob-lem with n jobs and m machines. Thus, we get apolynomial time solution S for Minimum MakespanScheduling Problem.

To solve time-minimized graph partitioning, asmentioned above, we prefer using fashionable one-pass stream rather than other approach for the fastprocessing speed and acceptable result [27], [29]. Weuse a streaming loader to read vertexes in Data Graphand serialize them in a certain order. The order ofvertexes generated from streaming loader varies inseveral ways, i.e., random, BFS/DFS from a startvertex or adversarial order. [27] proved that the BFSand DFS ways produce nearly the same result asthe random way. Meanwhile, the random way avoidsadversarial orderings. Thus, in this work, we use therandom order as the streaming loader’s output order.

The loader then sends vertexes with the set oftheir neighbors to the partition program which exe-cutes streaming graph partitioning as shown in Algo-rithm 1. The heuristic function determines the assign-ment of incoming vertex based on the current Parti-tion state, vertex information and Physical Graph.

In fact, the heuristic function determines the qualityof partitioning. Before we present the heuristic, we

Algorithm 1 Framework of One-pass Stream Parti-tioning

1: Input: Data Graph G, Physical Graph GP2: Let Pk = V1, V2, ..., Vk be the current partition

result sets.3: S = streamingloader(G)4: for each vertex v in S do5: index = HeuristicFunc(Pk, v,GP );6: Insert vertex v into Vindex;7: end for8: Output: Partitioning result Pk

first discuss how to estimate tcompi and tcommi inour stream model. In BSP model, computing work-load is caused by UDF function that vertexes exe-cuted in each super-step. Communication workloadis caused by sending/receiving messages to/fromother vertexes in order to perform the distributedcomputation. Generally, vertex generates local mes-sages to exchange messages with other vertexes inthe same partition and generates remote messagesto exchange messages with vertexes in another par-tition. In fact, producing and processing local mes-sages are extremely faster than non-local messages,so we ignore the time cost for dealing with lo-cal messages. In streaming model, we use Pk(v) ={V1(v), V2(v), ..., Vk(v)} to denote the partition set andVi(v) to denote the vertexes set assigned to PVi whenvertex v is in the head of the stream. costcompi (v)and costcommi (v) denote the estimated computing timeand communication time of Vi(v) when vertex v iscoming, respectively. Let w(vertex) be the amount ofaverage computing workload per vertex and w(edge)be the amount of average communication workloadper edge. Then costcompi (v) and costcommi (v) can becalculated by:

costcompi (v) =w(vertex)× |Vi(v)|

Ci(13)

costcommi (v) =

k∑j=1,j 6=i

|Edge(Vi(v), Vj(v))| × w(edge)

L(i, j)

(14)The parameters w(edge) and w(vertex) are based

on workload, and we will give further discussionin Section 4.4. When vertex v is assigned to a par-tition, we calculate the incremental computing andincremental communication cost generated by v. Weuse ∆costcompi (v) and ∆costcommi (v) to denote theincremental computing time cost and communicationtime cost respectively. Similarly they can be calculatedby:

∆costcompi (v) =w(vertex)

Ci(15)

8

∆costcommi (v) =

k∑j=1,j 6=i

|Edge(v, Vj(v))| × w(edge)

L(i, j)

(16)At last, when all the vertexes in the stream are

assigned, tcompi and tcommi can be estimated bycostcompi (vn) and costcommi (vn). Here, vn is the lastvertex in the stream.

After analyzing the parameters, now we formallydefine the heuristics based on our time-minimizedobjective function. In our system, we use blocking I/Omodel in which CPU and I/O are operated serially. Inthis model, τ i = tcompi + tcommi . Notice that costcompi (v)and costcommi (v) are both estimated running time thatwe can linearly sum two time unit directly.

1) Min-Workload(MW) - The most straightforwardheuristic algorithm is to assign v to the partitionwith minimal workload. If there is a tie of re-sult, we randomly assign to one partition. Min-Workload focuses more on workload balance, butweighs less for the additional cost incurred byeach vertex. As a result, it might be balanced buthas huge communication cost.

index = argmin(costcompi (v) + costcommi (v))(17)

2) Min-Increased(MI) - Assign v to the partitionthat incremental workload is minimized. Thisheuristic explores the idea that each vertex in-creases the smallest cost. In contrast to Min-Workload, Min-Increased aims to minimize incre-mental workload of each vertex while ignoringthe balance.

index = argmin(∆costcompi (v) + ∆costcommi (v))(18)

3) Balanced Min-Increased(BMI) - Assign v to thepartition that incremental workload is mini-mized and refine the score by a penalty whichis based on the total workload in that partition.This heuristic tends to find a partition that notonly has a lower current cost, but also moreimportantly incurs smaller additional cost forthe vertex. We use a penalty function to balancethe workload. [13] uses Penalty(x) = x to solveanother similar problem. However, this simplefunction is inclined to balance the workload. Inthis paper, we extend this function to a family offunctions, Penalty(x) = xλ. Particularly, whenλ = 0, Penalty(x) is a constant number, so thatBalanced Min-increased heuristic is degenerated toMin-increased. When λ becomes larger, BalancedMin-increased is more closer to Min-Workload. Thechoice of λ will be discussed in Section 5.2.index =argmin(∆costcompi (v) + ∆costcommi (v)+

Penalty(costcompi (v) + costcommi (v))(19)

4) Combined(CB) - Give a way to divide the ver-texes into high degree and low degree, and then

assign low degree vertexes using Min-Workload,high degree nodes using Balanced Min-Increased.The choice of division threshold will be dis-cussed in Section 5.2.

5) Computing Balanced Hashing(CPH) - Assign v topartition i with probability based on computingability of PVi. The number of vertexes in PVi isin proportion to Ci. This heuristic balances theworkload based on computing ability.

P (i) =Ci∑

j=1...k Cj(20)

6) Communication Balanced Hashing(CMH) - As-sign v to partition i with probability based oncommunication ability of PVi. It is similar toComputing Balanced using communication abil-ity as metric.

P (i) =

∑j 6=i L(i, j)∑p 6=q L(p, q)

(21)

In fact, we use rules 5 and 6 as the baseline ofassigning workload based on Computing ability andCommunication ability. We compare our other heuris-tics with Computing Balanced Hashing and Communica-tion Balanced Hashing. Besides, we make a comparisonwith other state-of-the-art streaming and multi-levelalgorithms. The graph partitioning algorithms aresummarized in Table 1.

Algorithm Abbr. Type HeterogeneityAware

Hashing H Hashing No

Computing Balanced Hashing CPH Hashing Computing

Communication Balanced Hashing CMH Hashing Communication

Bandwidth aware METIS BAM Multi-Level Communication

METIS (Non-Uniform Target Partition Weights) M Multi-Level Computing

Linear Deterministic Greedy LDG Streaming No

Min-Workload MW Streaming Yes

Min-Increased MI Streaming Yes

Balanced Min-Increased BMI Streaming Yes

Combined CB Streaming Yes

TABLE 1: Graph Partitioning Algorithms

It is valuable to analyze each heuristic with mathe-matical lower bound. Unfortunately with a randomorder of the stream, for one-pass streaming parti-tioning algorithm, finding the optimal lower boundis NP-hard [27]. [26] has proved that no algorithmcan obtain an O(n) approximation with a randomor adversarial stream ordering. Thus, in this paper,we propose several heuristics and will report theirperformance in the later experimental study.

4.4 Workload Based Parameters

In this subsection, we discuss how to calculatew(edge) and w(vertex) based on the workload.w(vertex) is the amount of computing jobs for onevertex. w(edge) is the amount of communication jobsfor one edge. We introduce these two parameters tonormalize the computing cost and communication

9

cost respectively and they are only used for ourheuristics. Thus we could derive the total cost byadding up the computing cost and communicationcost while being fair and unbiased.

To compute w(edge) and w(vertex), we introducethe following metrics that are specific to workload.• Asymptotic Time Complexity is an estimation

of the volume of computing job for one vertex.Suppose the average number of neighbors for onevertex on the data graph is n, then this metriccould be denoted using the O(F (n)).

• Asymptotic Communication Complexity is anestimation of the volume of communication jobfor one vertex. We use O(T (n)) to denote thismetric, n is the average number of neighbors forone vertex on the data graph.

The O(F (n)) uses the same unit of computingability Ci, and O(T (n)) uses the same unit of com-munication ability L(i, j). These two metrics are usedas a prior knowledge.

Taking the workload PageRank as an example, ifwe use floating-point operation as computing unitand 64 bits as communication unit. The computationalcomplexity for a vertex with n neighbors is F (n) =O(n), and the communication complexity is T (n) = 1.Here, n is number of neighbors of the vertex. Thesevalues, F (n) and T (n), can be easily calculated fromthe stream information. Then, we get:

w(vertex) = n

w(edge) = 1(22)

Using these two parameters, we could estimate theprocessing cost of one vertex more accurately.

5 EVALUATION

To evaluate the performance of our heterogeneousaware graph partitioning approach, we developed aprototype system (Heterogeneous environment Awaregraph Processing System, i.e., HeAPS). It followsGoogle’s Pregel design [18] and adopts a block I/Omodel, in which CPU and I/O are operated seri-ally. We implemented the above discussed streaminggraph partitioning algorithms and other state-of-the-art graph partitioning algorithms for comparison.

We conducted rigorous experiments to evaluate ourheterogeneous partitioning approach proposed in thispaper. We first present the experimental setup, whichincludes the dataset description, heterogeneous envi-ronment design, and the chosen evaluation metrics.Then we reveal the partition performance demon-strated at different environment, dataset and work-load settings. At last, we discuss the effect of theparameters used in our heuristics rules.

5.1 Experimental SetupDatasets: We used 5 real-world graphs: Live-Journal, Amazon0312, Twitter, Web-Standford and

Web-Google; and two Synthetic graph. One Syntheticgraph is generated following the Erdos-Renyi randomgraph model (Synthetic), and the other one is a ran-dom graph which contains several different Erdos-Renyi components (C-Synthetic). All the experimentdatasets are shown in Table 2. The real-world datasetsare publicly available on the web [4]. We transformedthem into undirected graph and added reciprocaledges and eliminated loop circles from the originalrelease.

Dataset Vertices Edges TypeAmazon0312 400,727 2,349,869 Web

Wiki-Talk 2,388,953 4,656,682 WebWeb-Google 875,713 8,644,106 WebLive-Journal 4,843,953 42,845,684 Social

Twitter 41,652,230 1,468,365,182 SocialSynthetic 5,000,000 100,000,000 Synthetic

C-Synthetic 50,000,000 2,000,000,000 Synthetic

TABLE 2: Graph Dataset Statistics

Heterogeneous Clusters: We conducted our experi-ments on two representative clusters and tuned somesettings to reflect a rich spectrum of heterogeneousenvironments.

The first cluster(ClusterA) is a homogeneous onewith 28 nodes. Each node has an AMD Opteron 4180CPU, 48GB memory, a 1000Mbit network adapter anda 10TB disk RAID. The CPUs have alterable CPUfrequency of 0.8 GHz, 1.8 GHz, and 2.6 GHz. Allnodes are connected by 1 Gbps routers. We manuallychanged the hardware configuration to simulate hard-ware heterogeneity. This reflected the scenarios thatwhen organizations upgrade the hardware, or someof the machines become bottleneck of the cluster.

For this cluster, we simulated the unbalance of com-munication and computing ability via the followingsteps.• Communication ability (Network bandwidth):

We simulated different communication abili-ties by different network bandwidths. We nov-elly simulated the network bandwidth using awrapped virtual network adapter (VNA), in-spired by the virtual machine design. VNA linksthe real network adapter and a HeAPS client,so that clients connected to each other via theVNA which can be set at different bandwidth.Thus, we can simulate the network bandwidthin a controlled manner.

• Computing ability: To simulate different CPUhardware, we set CPU to different frequencies.In our experiments, we used three levels of fre-quencies: 2.6GHz, 1.8GHz and 0.8GHz.

The other cluster(ClusterB) is the heterogeneouscluster shown in Figure 2. We use this cluster tosimulate a natural heterogeneity environment.

Heterogeneous Topology: We used seven sim-ulated cluster topologies (ClusterA) and one realtopology(ClusterB) in our experiment. The topologiesand their symbols are listed below.

10

Topo. CPU (Number) Network (Number)T0 1.8Ghz(28) 500Mbps(28)T1 1.8Ghz(28) 500Mbps(14), 1Gbps(14)T2 1.8Ghz(28) 100Mbps(5), 500Mbps(23)T3 1.8Ghz(14), 2.6Ghz(14) 500Mbps(28)T4 0.8Ghz(5), 1.8Ghz(23) 500Mbps(28)T5 1.8Ghz(14), 2.6Ghz(14) 500Mbps(14), 1Gbps(14)T6 0.8Ghz(5), 1.8Ghz(23) 100Mbps(5), 500Mbps(23)

TABLE 3: Topology Statistics

As shown in Table 3, T0 is treated as homogeneousenvironment.T1: All the CPU frequencies are set at 1.8Ghz. Half

of the NVAs are tuned at 500Mbps, and the other halfare tuned at 1000Mbps. This is used to simulate theimprovement of network performance. For example,upgrading network adapter bandwidth or upgradingrouter’s bandwidth.T2: All the CPU frequencies are set at 1.8Ghz. 5

of the NVAs are set at 100Mbps, and others are at500Mbps. This is to simulate that some of the nodesbecomes network bottle neck.T3: Half of the CPU frequencies are set at 2.6Ghz,

and others at 1.8Ghz. All the NVAs are at 500Mbps.This is the simulation for CPU upgrading.T4: 5 of the CPU frequencies are set at 0.8Ghz, and

others are at 1.8Ghz. All the NVAs are at 500Mbps.This is to simulate that some machine’s computingability is lower then others. We would like to testwhether computing bottle neck affects the perfor-mance of graph partitioning algorithms.T5: Half of the machines use CPU frequencies at

2.6Ghz and NVAs at 1000Mbps. The others use CPUat 1.8Ghz with NVAs at 500Mbps. This is the simula-tion for upgrading hardware with powerful CPU andnetwork.T6: 5 of the machines use CPU frequencies at 0.8Ghz

with NVAs at 100Mbps. The others use CPU at 1.8Ghzwith NVAs at 500Mbps. This is the simulation forcluster with 10 machines with lower performanceswhich are the bottleneck.Treal: We used the original network configuration

and CPU frequency in Figure 2. For each topology,we measured the metrics mentioned in 4.1 and gen-erated the Physical Graph to model the heterogeneousenvironment.

Evaluation Metrics: To systematically evaluate thepartitioning and computing performance in hetero-geneous environment, we choose a set of evaluationmetrics to measure the performance of each algorithm.

1) Edge cut ratio(ECR) indicates the percentage ofcut edges between partitions in a graph, definedas ECR = ec/|E|,where ec denotes the numberof cut edges between partitions, and |E| denotesthe total numbers of edges in the graph. Infact, in heterogeneous environment ECR cannotindicate the application’s running time. To makea better evaluation, we introduce the followingadditional metrics.

2) Job execution time(JET) measures the elapsed timefrom submitting the graph processing job toHeAPS till completion of the job. Here we onlyrecorded the super-step execution time, not in-cluding the loading time.

3) Standard Deviation of running time(SDT). We mea-sured the sum of running time of each super-step for each machine. Then we calculated theStandard Deviation to evaluate whether the par-titioning result is balanced.

Comparative Methods: In this experiment, we se-lected several state-of-the-art partitioning methods asthe baseline to demonstrate the advantage of ourproposed method. Table 1 lists the partitioning al-gorithms used in the following experiments. Themost effective homogeneous algorithms chosen hereinclude Hashing and best streaming heuristics LinearDeterministic Greedy approach [27]. Besides, we alsocompared our streaming algorithm with widely usedmulti-level approach METIS [14] and HeterogeneousBandwidth aware METIS [10]. For METIS, we enabledthe non-uniform target partition weights option. Theweights of each partition used in METIS is generatedbased on the computing ability of PVi.

5.2 Partitioning Performance

We now present the partitioning performance of dif-ferent graph partitioning algorithms in different envi-ronments. We ran the experiment against all the graphdatasets shown in Table 2 with several workloads.Due to the space limit, here we mainly present the re-sults on some of the datasets. The experimental resultson the other data sets show the similar performance.All of the experiments are conducted for five timesand average results are reported.

We partitioned the input graph into 28 partitions forClusterA and 26 partitions for ClusterB . In fact, thetopology can be divided into 4 types of environments,i.e., homogeneous environment, I/O heterogeneousenvironment, CPU heterogeneous environment andI/O& CPU heterogeneous environment.

Homogeneous Environment: As shown in Fig-ure 6(a), we test these algorithms on Live-Journaldataset in homogeneous environment T0 running a10-iteration Pagerank workload. Our heuristics areclose to the best competitor, homogeneous streaminggraph partitioning, LDG on ECR. On job executiontime in Figure 6(b), heterogeneous aware stream-ing approaches, Balanced Min-increased and Combinedoutperform LDG. The reduction is brought by theworkload balance effect on each node with our newobjective function. Multi-level algorithms outperformthe streaming ones for low edge cut ratio. As we cansee, some of our approaches have higher ECR whilehave smaller job execution time. The reason is thatthe job execution time is determined by the slowestnode in the systems. Traditional graph partitioning

11

��

��

��

��

��

�

��

��

��

��

��

(a.) Edge Cut Ratio

��

��

��

��

��

��

��

�

��

��

��

��

��

!��

(b.) Execution Time

CB MI

MW

BMIBAM M

LDG

CPHCM

H H

Sta

ndar

d D

evia

tion

0

10

20

30

40

50

(c.) Standard Deviation

CB MI

MW

BMIBAM M

LDG

CPHCM

H H

Max

imum

JE

T D

ivid

ed b

y th

e A

vera

ge

0.0

.1

.2

.3

.4

.5

.6

(d.) Maximum JET Divided bythe Average JET

Fig. 6: Partitioning Performance in Homogeneous En-vironment on Live-Journal dataset

methods, e.g. METIS and LDG, aim to minimize theedge cut ratio, while our proposed objection functionaims to minimize the execution time. The standarddeviation of JET and maximum JET divided by theaverage JET shown in Figures 6 (c) and (d) provedthat Balanced Min-increased, Combined and Min-Workload algorithms can get more balanced runningresults then the other algorithms.

I/O Heterogeneous Environment: We evaluatedthese algorithms on topologies T1, T2. In Figure 7, weshow the Job Execution Time in topologies T1, T2 andT0 on Twitter dataset. Our heterogeneous heuristicsoutperform the homogeneous streaming algorithmssignificantly.

Comparing the execution time in topology T1 toin original homogeneous environment(T0), Combined,Min-Workload and Min-Increased heuristics can reducethe execution time by about 27%, which is about 7times better than LDG(3.9%) and METIS (3%). Band-width aware METIS reduces the execution time by 10%as well. Our best heuristics is close to Bandwidth awareMETIS and at the same time can have 48% bettertime efficiency than LDG. This is because when someof the nodes have faster communication speed, ourapproach can adaptively assign more communicationworkload to these nodes and reduce the burden onslower ones. Thus the total running time can be re-duced significantly. The short execution time and lowvariance demonstrate the superiority of the proposedmethods.

In topology T2, compared to T0, there is a com-munication ability drop on 5 nodes. Homogeneousalgorithms suffer a significant performance drop com-pared with T0, with a range from 95% to 125%. Whileheterogeneous algorithms only increase running time

by about 20%. Balanced Min-Increased and Combinedheuristics outperform METIS and LGD by about 45%and 30% respectively. This is because, for homoge-neous algorithms, these 5 nodes with lower networkbandwidth are running the same communicationworkload as other nodes. Thus the running time isremarkably increased for the slower node. However,heterogeneous algorithms balance the workload basedon the communication ability so that the nodes withlower performance are assigned less workload andcan finish them faster.

CB MI MW BMI BAM M LDG CPH CMH H

Job

Exe

cutio

n T

ime(

s)

0

1000

2000

3000

4000

5000

Topology 1 Topology 0 Topology 2

(a.) Job Execution Time in T1 and T2

Fig. 7: Partition Performance in I/O HeterogeneousEnvironment


Job

Exe

cutio

n T

ime(

s)

0

100

200

300

400

500

600


(a.) Job Execution time in T3 and T4

Fig. 8: Partitioning Performance in CPU Heteroge-neous Environment

CPU Heterogeneous Environment: Topologies T3,T4 belong to this type. Figure 8 shows the JET of T3, T4and T0 on the Synthetic dataset running a 10-iterationPagerank workload. In topology T3 with a CPU abilityupgrading, Bandwidth aware METIS gains a runningtime improvement of only 3%, since it is not awareof computing ability changes. In contrast, almost allof our heuristics have over 40% improvement in JETcompared to T0. Besides, Combined heuristic outper-forms Bandwidth aware METIS which takes longer timeto partition the graph. In topology T4, both the Com-bined and Balanced Min-Increased heuristics outperformBandwidth aware METIS due to a huge running timeincrease on the latter one. METIS uses about 49% moretime to run the workload compared to the time usedin T0, the other homogeneous algorithms generallysuffers about 30% degradation in JET as well.

I/O & CPU Heterogeneous Environment: We ranthis experiment in topologies T5 and T6 on Live-Journal dataset running a 10-iteration Pagerank work-load, which considers both I/O and CPU heterogene-ity. Results in Figure 9 demonstrate that our bestheuristics, Combined and Balanced Min-Increased takeless time to process than Bandwidth aware METIS

12

both in topologies T5 and T6. There is a significantimprovement compared with LDG/METIS as well.Figures 9 (b) and (c) show the standard deviationof physical nodes. The standard deviation indicatesthe imbalance of workload. As we can observe, ho-mogeneous approaches and Bandwidth aware METISare highly imbalanced. This is because, in such CPU& I/O heterogeneous environment, homogeneous ap-proaches still partition workload to each node evenly,and Bandwidth aware METIS can account for I/Oheterogeneity. In fact, as shown in Figure 9 (a), theunbalanced workload significantly affects the overallrunning time. In contrast, our approaches can bet-ter exploit the high-performance nodes by assigningmore vertexes on them, and make the workload ofeach partition more balanced. Figures 9 (d) and (e)further show the maximum JET divided by the Aver-age JET in topologies T5 and T6 which confirms thatour approaches can balance the JET time on nodes.


Job

Exe

cutio

n T

ime(

s)

0

100

200

300

400

500

600

700


(a.) Job Execution Time in T5 and T6

��

�

��

��

��

��

��

��

�

��

��

��

��

�

!�

"�

(b.) Standard Deviation of T5

��

�

��

��

��

��

��

�

��

��

��

��

(c.) Standard Deviation of T6

CB MI

MW

BMIBAM M

LDG

CPHCM

H H

Max

imum

JE

T D

ivid

ed b

y th

e A

vera

ge

0.0

.2

.4

.6

.8

(d.) Maximum JET Divided bythe Average JET of T5

CB MI

MW

BMIBAM M

LDG

CPHCM

H H

Max

imum

JE

T D

ivid

ed b

y th

e A

vera

ge

0.0

.1

.2

.3

.4

.5

.6

(e.) Maximum JET Divided bythe Average JET of T6

Fig. 9: Partitioning Performance in CPU & I/O Het-erogeneous Environment

Natural Heterogeneity: We continued to evaluatethe performance of our proposed heterogeneous parti-tioning approach on a real cluster Treal with 26 nodes.We measured the physical graph on this cluster andran a Pagerank workload with 10 iterations on Live-Journal dataset. As shown in Figure 10, the best algo-rithm is Combined, followed by Balanced Min-Increased

and Bandwidth aware METIS. LDG takes about twicetime than Combined. This experiment confirms thatour approach is suitable for natural heterogeneityenvironment which is especially common in moderncloud infrastructures.

From the above three experiments, we can concludethat our proposed partition algorithm is robust andefficient over different situations and can bring im-pressive benefits when the computing environmentchanges from homogeneity to heterogeneity. Besides,Combined and Balanced Min-Increased heuristics showsignificant improvement compared with that of CPHin CPU heterogeneous environment and CMH in I/Oheterogeneous environment. This indicates that ourheuristics can not only balance the workload, but alsoreduce the communication cost to minimize the totalrunning time of processing job.


Job

Exe

cutio

n T

ime

(s)

0

50

100

150

200

250

300

Fig. 10: Job Execution time in Treal

Other Graph Datasets and Workloads: BesidesLive-Journal, Twitter and Synthetic datasets, we alsoconducted a series of experiments on Amazon0312,Wiki-Talk, Web-Google and C-Synthetic datasets. Wepartitioned each dataset into 26 partitions on topologyTreal. We ran a communication-intensive workloadTwo-hop Friend List workload on Amazon0312 andWeb-Google. We ran a computing-intensive workloadstatistical inference on Wiki-Talk and C-Synthetic. Herewe compare our heuristics with the results of METIS,LDG and Fennel [29]. The parameter γ used in Fennelis 3

2 . We summarize the execution time on Figure 11and Table 4.

From Figure 11(a), we observe that the differencesof running time on Wiki-Talk are not significant. It in-dicates that there is no obvious computing bottleneckin Treal. However, except Min-Increased, our heuristicsstill outperform METIS, LDG and even Bandwidthaware METIS. The result of Fennel is slower than LDGon Wiki-Talk, while it is faster than LDG for otherdatasets. For the communication-intensive workloadrunning on Amazon0312 in Figure 11(b), Bandwidthaware METIS shows its advantages. While our bestapproaches can still execute faster than LDG, Fenneland METIS. While running communication-intensiveworkload on Web-Google, CB is almost the sameas Bandwidth aware METIS. For the largest graphused in our experiment, C-Synthetic, our heuristicCB and Bandwidth aware METIS show much bet-ter performance than others. Similar to the resultsdemonstrated in Live-Journal dataset, results here also

13

reveal the generality of proposed approach and itswide usefulness of partitioning and data arrangement.

CB MI

MW

BMI

BAM MLD

G FE

Job

Exe

cutio

n T

ime

(s)

0

50

100

150

200

250

300

(a.) Wiki-Talk

CB MI

MW

BMI

BAM MLD

G FEJo

b E

xecu

tion

Tim

e (s

)0

20

40

60

80

100

120

140

160

(b.) Amazon0312

CB MI

MW

BMI

BAM MLD

G FE

Job

Exe

cutio

n T

ime

(s)

0

100

200

300

400

500

(c.) Web-Google

CB MI

MW

BMI

BAM MLD

G FE

Job

Exe

cutio

n T

ime

(s)

0

5000

10000

15000

20000

25000

30000

35000

(d.) C-Synthetic

Fig. 11: Execution Performance on Different Graphsand Workloads

Graph CB MI MW BMI BAM M LDG FEWiki 212.0 250.4 200.3 210.0 240.4 226.7 275.6 255.1Ama. 46.3 137.4 91.0 52.3 41.1 68.5 103.5 97.2

Google 149.2 439.68 285.74 162.59 147.4 232.9 414.0 323.5C-Syn. 19083 27505 24130 18690 25681 26386 30041 28241

TABLE 4: Execution Time on Different Datasets (secs)

Parameters in Heuristic We further investigatedthe parameters used in Balanced Min-increased andCombined heuristics.

1. Penalty Function for Balanced Min-increased: Weused a family of functions, Penalty(x) = xλ as penaltyfunction. We conducted a series of experiments to in-vestigate the selection of λ. As shown in Figure 12(a),the best result of running time is when λ is around0.5. This is a reasonable factor since when λ ≥ 0.75,Balanced Min-increased may focus much on balanceand enlarge the total cost. And when λ ≤ 0.25,Balanced Min-increased is degenerated to Min-increasedwhich is unbalanced.

2. Degree threshold for Combined: To explore theselection of the threshold, we sorted the vertexes inData Graph in a non-increased order, we chose thethreshold of high-degree vertex based on the percent-age of the vertexes in the sorted list. Figure 12(b)shows the running time of job when we chose differ-ent thresholds in several experiment settings. The bestresult is 30% of vertexes using Min-Workload heuristicand other 70% vertexes using Balanced Min-Increased.This is reasonable because we prefer to assign morevertexes based on Min-increased heuristic to reducethe total cost. However, Min-increased heuristic can-not guarantee high balance. So we use Min-balancedheuristic to refine the result. This experiment indicatesthat using 30% lower degree vertexes can balance theworkload well.

0 10 20 30 40 50 60 70 80 90 100

Job

Exe

cutio

nTim

e (s

)

0

20

40

60

80

100

120

140

160

T_real #Pagerank @Live-JouralT_real #Two-hop @amazon T_1 #Pagerank @Live-Joural T_Real #statistical inference @Wiki-talk

(a.) Vertex Degree Threshold Tuning

0.00 .25 .50 .75 1.00 1.25 1.50 1.75 2.00

Job

Exe

cutio

n T

ime

(s)

0

50

100

150

200

250

300

T_real #Pagerank @Live-JouralT_real #Two-hop @amazon T_1 #Pagerank @Live-Joural T_Real #statistical inference @Wiki-talk

(b.) λ Tuning

Fig. 12: Heuristics Parameters Tuning

6 CONCLUSION

In this paper, we systematically investigated andmodelled the characters of heterogeneous computingenvironment in current parallel graph processing sys-tems. We then proposed a novel graph partitioningobjective function to tackle the current challengesin complex heterogeneous computing environments.Base on this strategy, we designed several streamingbased heuristics which can significantly improve thepartition performance and reduce the job executiontime. Prototype and experimental results verify thesuperiority of our new approaches.

There are several promising directions for our fu-ture work. First, a further through theoretical anal-ysis of heterogeneous partition scenario is valuable.Second, a new ordering of streaming processing is apotential way to increase the partition performance.At last, we can dive into the workload world and ex-tract pattens to better guide and improve the parallelgraph processing system.

7 ACKNOWLEDGEMENT

The research is supported by NSFC under Grant No.61272155. Lei Chen is supported by the Hong KongRGC Project N HKUST637/13, 973 Program of Chinaunder Grant 2014CB340303, Microsoft Research AsiaGift Grant and Google Faculty Award 2013. Zi Huangis supported by ARC FT130101530.

REFERENCES

[1] Amazon EC2. http://aws.amazon.com/cn/ec2/.[2] Apache Giraph. https://github.com/apache/giraph/.[3] Graph Archive Dataset. http://staffweb.cms.gre.ac.uk/∼wc06/

partition/.[4] Snap Dataset. http://snap.stanford.edu/data/index.html.[5] Konstantin Andreev and Harald Racke. Balanced graph

partitioning. Theor. Comp. Sys., 39(6):929–939, November 2006.[6] Gerard Biau and Kevin Bleakley. Statistical inference on

graphs. Statistics & Decisions, 24(2):209–232, 2006.

http://aws.amazon.com/cn/ec2/

https://github.com/apache/giraph/

http://staffweb.cms.gre.ac.uk/~wc06/partition/

http://staffweb.cms.gre.ac.uk/~wc06/partition/

http://snap.stanford.edu/data/index.html

14

[7] Sergey Brin and Lawrence Page. The anatomy of a large-scalehypertextual web search engine. Computer networks and ISDNsystems, 30(1):107–117, 1998.

[8] J. Bruno, E. G. Coffman, Jr., and R. Sethi. Scheduling inde-pendent tasks to reduce mean finishing time. Commun. ACM,pages 382–387, 1974.

[9] Jinchuan Chen, Yueguo Chen, Xiaoyong Du, Cuiping Li, Jia-heng Lu, Suyun Zhao, and Xuan Zhou. Big data challenge:a data management perspective. Frontiers of Computer Science,7(2):157–164, 2013.

[10] Rishan Chen, Mao Yang, Xuetian Weng, Byron Choi, Bing-sheng He, and Xiaoming Li. Improving large graph processingon partitioned graphs in the cloud. In Proc. of SOCC, pages1–13, 2012.

[11] Bin Cui, Hong Mei, and Beng Chin Ooi. Big data: the driverfor innovation in databases. National Science Review, 1(1):27–30,2014.

[12] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplifieddata processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

[13] Oscar H Ibarra and Chul E Kim. Heuristic algorithmsfor scheduling independent tasks on nonidentical processors.Journal of the ACM (JACM), 24(2):280–289, 1977.

[14] George Karypis and Vipin Kumar. Multilevel graph partition-ing schemes. In Proc. 24th Intern. Conf. Par. Proc., III, pages113–122, 1995.

[15] George Karypis and Vipin Kumar. Parallel multilevel k-waypartitioning scheme for irregular graphs. In Supercomputing,1996.

[16] Yucheng Low, Danny Bickson, Joseph Gonzalez, CarlosGuestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributedgraphlab: a framework for machine learning and data miningin the cloud. Proc. of VLDB Endow., 5(8):716–727, April 2012.

[17] D. S. Johnson M. R. Garey and L. Stockmeyer. some simplifiednp-complete problems. In Proc. of STOC, pages 47–63, 1974.

[18] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik,James C. Dehnert, Ilan Horn, Naty Leiser, and GrzegorzCzajkowski. Pregel: a system for large-scale graph processing.In Proc. of SIGMOD, pages 135–146, 2010.

[19] Joel Nishimura and Johan Ugander. Restreaming graph par-titioning: Simple versatile algorithms for advanced balancing.In Proceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 1106–1114. ACM,2013.

[20] Francois Pellegrini and Jean Roman. Experimental analysis ofthe dual recursive bipartitioning algorithm for static mapping.In TR 1038-96, LaBRI, URA CNRS 1304, Univ. Bordeaux I.Citeseer, 1996.

[21] Francois Pellegrini and Jean Roman. Scotch: A softwarepackage for static mapping by dual recursive bipartitioningof process and architecture graphs. In High-Performance Com-puting and Networking, pages 493–498. Springer, 1996.

[22] A. Greenberg P. Patel S. Kandula, S. Sengupta and R. Chaiken.The nature of data center traffic: measurements & analysis. InProc. of IMC, pages 202–208, 2009.

[23] Kirk Schloegel, George Karypis, and Vipin Kumar. Parallelstatic and dynamic multi-constraint graph partitioning. Con-currency and Computation: Practice and Experience, 14(3):219–240,2002.

[24] Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and NingXu. Parallel subgraph listing in a large-scale graph. In Proc.of SIGMOD, 2014.

[25] Yingxia Shao, Junjie Yao, Bin Cui, and Lin Ma. Page: Apartition aware graph computation engine. In Proc. of CIKM,2013.

[26] Isabelle Stanton. Streaming balanced graph partitioning algo-rithms for random graphs. In Technical Report, UC. Berkeley,2012.

[27] Isabelle Stanton and Gabriel Kliot. Streaming graph partition-ing for large distributed graphs. In Proc. of KDD, pages 1222–1230, 2012.

[28] Charalampos E Tsourakakis. Streaming graph partitioning inthe planted partition model. arXiv preprint arXiv:1406.7570,2014.

[29] Charalampos E. Tsourakakis, Christos Gkantsidis, BozidarRadunovic, and Milan Vojnovic. Fennel: Streaming graph par-

titioning for massive scale graphs. Technical report, Microsoft,2012.

[30] Leslie G. Valiant. A bridging model for parallel computation.Commun. ACM, 33(8):103–111, August 1990.

[31] Guohui Wang and TS Eugene Ng. The impact of virtualizationon network performance of amazon ec2 data center. In Proc.of INFOCOM, pages 1–9, 2010.

[32] Ning Xu, Lei Chen, and Bin Cui. Loggp: A logbased dynamicgraph partitioning method. In Proc. of VLDB, 2014.

[33] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, RandyKatz, and Ion Stoica. Improving mapreduce performance inheterogeneous environments. In Proc. of OSDI, pages 29–42,2008.

Ning Xu is a fourth year PhD candidate inthe School of EECS, Peking University. Hisresearch interests include large-scale graphpartitioning and parallel computing frame-work.

Bin Cui is a Professor in the School of EECSat Peking University. His research interestsinclude database performance issues, queryand index techniques, multimedia databases,Web data management, data mining. He hasserved in the Technical Program Committeeof various international conferences includ-ing SIGMOD, VLDB and ICDE, and as VicePC Chair of ICDE 2011, Demo CO-Chair forICDE 2014, Area Chair of VLDB 2014.

Lei Chen received his BS degree in Com-puter Science at Tian Jin University, and anMA degree in computer science at Asian In-stitute of Technology Asian Institute of Tech-nology. He received a Ph.D. degree in Com-puter Science at University of Waterloo. Hisresearch interests include crowd-souring onsocial networks, uncertain and probabilisticdatabases, Web data management, multime-dia and time series databases.

Zi Huang is an ARC Future Fellow inSchool of ITEE, The University of Queens-land. She received her BSc degree fromDepartment of Computer Science, TsinghuaUniversity, China, and her PhD in ComputerScience from School of ITEE, The Universityof Queensland. Dr. Huang’s research inter-ests mainly include multimedia indexing andsearch, social data analysis and knowledgediscovery.

Yingxia Shao is a third year PhD candi-date in the School of EECS, Peking Univer-sity. Topics he has been working on includelarge-scale graph analysis, parallel comput-ing framework and scalable data processing.

Documents

Heterogeneous Environment Aware Streaming …net.pku.edu.cn/~cuibin/Papers/2015TKDE.pdf1 Heterogeneous Environment Aware Streaming Graph Partitioning Ning Xu, Bin Cui, Lei Chen, Zi