14
SAE: Toward Efficient Cloud Data Analysis Service for Large-Scale Social Networks Yu Zhang, Xiaofei Liao, Member, IEEE, Hai Jin, Senior Member, IEEE and Guang Tan Member, IEEE Abstract—Social network analysis is used to extract features of human communities and proves to be very instrumental in a variety of scientific domains. The dataset of a social network is often so large that a cloud data analysis service, in which the computation is performed on a parallel platform in the could, becomes a good choice for researchers not experienced in parallel programming. In the cloud, a primary challenge to efficient data analysis is the computation and communication skew (i.e., load imbalance) among computers caused by humanity’s group behavior (e.g., bandwagon effect). Traditional load balancing techniques either require significant effort to re-balance loads on the nodes, or cannot well cope with stragglers. In this paper, we propose a general straggler-aware execution approach, SAE, to support the analysis service in the cloud. It offers a novel computational decomposition method that factors straggling feature extraction processes into more fine-grained sub-processes, which are then distributed over clusters of computers for parallel execution. Experimental results show that SAE can speed up the analysis by up to 1.77 times compared with state-of-the-art solutions. Index Terms—Cloud service, Social network analysis, Computational skew, Communication skew, Computation decomposition 1 I NTRODUCTION S OCIAL network analysis is used to extract features, such as neighbors and ranking scores, from social network datasets, which help understand human so- cieties. With the emergence and rapid development of social applications and models, such as disease mod- eling, marketing, recommender systems, search en- gines and propagation of influence in social network, social network analysis is becoming an increasingly important service in the cloud. For example, k-NN [1], [2] is employed in proximity search, statistical classi- fication, recommendation systems, internet marketing and so on. Another example is k-means [3], which is widely used in market segmentation, decision sup- port and so on. Other algorithms include connected component [4], [5], katz metric [6], [7], adsorption [8], PageRank [7], [9], [10], [11], [12], SSSP [13] and so on. These algorithms often need to repeat the same process round by round until the computing satisfies a convergence or stopping condition. In order to ac- celerate the execution, the data objects are distributed over clusters to achieve parallelism. However, because of the humanity’s group behav- ior [14], [15], [16], [17], the key routine of social net- Yu Zhang, Xiaofei Liao and Hai Jin are with Service Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China. E-mail: zhang [email protected]; {hjin, xfliao}@hust.edu.cn Guang Tan is with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China. E-mail: [email protected] work analysis, namely feature extraction process (FEP), suffers from serious computational and communica- tion skew in the cloud. Specifically, some FEPs need much more computation and communication in each iteration than others. Take the widely used data set of Twitter web graph [18] as an example, less than one percent of the vertices are adjacent to nearly half of all edges. It means that tasks hosting this small fraction of vertices may require many times more computation and communication than an average task does. Moreover, the involved data dependency graph of FEPs may be known only at execution time and changes dynamically. It not only makes it hard to evaluate each task’s load, but also leaves some com- puters underutilized after the convergence of most features in early iterations. In the PageRank algorithm running on a Twitter web graph, for example, the majority of the vertices require only a single update to get their ranking scores, while about 20% of the vertices require more than 10 updates to converge. This implies that many computers may become idle in a few iterations, while others are left as stragglers burdened with heavy workloads. Current load balancing solutions try to mitigate load skew either at task level or at worker level. At the task level, these solutions partition the data set according to profiled load cost [19], or use Power- Graph [20] for static graph, which partitions edges of each vertex to get balance among tasks. The former method is quite expensive, as it has to periodically profile load cost of each data object. PowerGraph [20] can only statically partition computation for graphs with fixed dependencies and therefore cannot adap- tively redistribute sub-processes over nodes to maxi- Cloud Computing, IEEE Transactions on (Volume:PP , Issue: 99 ),23 March 2015.

SAE: Toward Efficient Cloud Data Analysis Service … Toward Efficient Cloud...SAE: Toward Efficient Cloud Data Analysis Service for Large-Scale Social Networks Yu Zhang, Xiaofei

  • Upload
    letuyen

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

SAE: Toward Efficient Cloud Data AnalysisService for Large-Scale Social Networks

Yu Zhang, Xiaofei Liao, Member, IEEE, Hai Jin, Senior Member, IEEE and GuangTan Member, IEEE

Abstract—Social network analysis is used to extract features of human communities and proves to be very instrumental in avariety of scientific domains. The dataset of a social network is often so large that a cloud data analysis service, in which thecomputation is performed on a parallel platform in the could, becomes a good choice for researchers not experienced in parallelprogramming. In the cloud, a primary challenge to efficient data analysis is the computation and communication skew (i.e.,load imbalance) among computers caused by humanity’s group behavior (e.g., bandwagon effect). Traditional load balancingtechniques either require significant effort to re-balance loads on the nodes, or cannot well cope with stragglers. In this paper,we propose a general straggler-aware execution approach, SAE, to support the analysis service in the cloud. It offers a novelcomputational decomposition method that factors straggling feature extraction processes into more fine-grained sub-processes,which are then distributed over clusters of computers for parallel execution. Experimental results show that SAE can speed upthe analysis by up to 1.77 times compared with state-of-the-art solutions.

Index Terms—Cloud service, Social network analysis, Computational skew, Communication skew, Computation decomposition

F

1 INTRODUCTION

SOCIAL network analysis is used to extract features,such as neighbors and ranking scores, from social

network datasets, which help understand human so-cieties. With the emergence and rapid development ofsocial applications and models, such as disease mod-eling, marketing, recommender systems, search en-gines and propagation of influence in social network,social network analysis is becoming an increasinglyimportant service in the cloud. For example, k-NN [1],[2] is employed in proximity search, statistical classi-fication, recommendation systems, internet marketingand so on. Another example is k-means [3], which iswidely used in market segmentation, decision sup-port and so on. Other algorithms include connectedcomponent [4], [5], katz metric [6], [7], adsorption [8],PageRank [7], [9], [10], [11], [12], SSSP [13] and soon. These algorithms often need to repeat the sameprocess round by round until the computing satisfiesa convergence or stopping condition. In order to ac-celerate the execution, the data objects are distributedover clusters to achieve parallelism.

However, because of the humanity’s group behav-ior [14], [15], [16], [17], the key routine of social net-

• Yu Zhang, Xiaofei Liao and Hai Jin are with Service ComputingTechnology and System Lab, Cluster and Grid Computing Lab, Schoolof Computer Science and Technology, Huazhong University of Scienceand Technology, Wuhan, 430074, China.E-mail: zhang [email protected]; {hjin, xfliao}@hust.edu.cn

• Guang Tan is with Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences, Shenzhen, 518055, China.E-mail: [email protected]

work analysis, namely feature extraction process (FEP),suffers from serious computational and communica-tion skew in the cloud. Specifically, some FEPs needmuch more computation and communication in eachiteration than others. Take the widely used data setof Twitter web graph [18] as an example, less thanone percent of the vertices are adjacent to nearly halfof all edges. It means that tasks hosting this smallfraction of vertices may require many times morecomputation and communication than an average taskdoes. Moreover, the involved data dependency graphof FEPs may be known only at execution time andchanges dynamically. It not only makes it hard toevaluate each task’s load, but also leaves some com-puters underutilized after the convergence of mostfeatures in early iterations. In the PageRank algorithmrunning on a Twitter web graph, for example, themajority of the vertices require only a single updateto get their ranking scores, while about 20% of thevertices require more than 10 updates to converge.This implies that many computers may become idlein a few iterations, while others are left as stragglersburdened with heavy workloads.

Current load balancing solutions try to mitigateload skew either at task level or at worker level. Atthe task level, these solutions partition the data setaccording to profiled load cost [19], or use Power-Graph [20] for static graph, which partitions edges ofeach vertex to get balance among tasks. The formermethod is quite expensive, as it has to periodicallyprofile load cost of each data object. PowerGraph [20]can only statically partition computation for graphswith fixed dependencies and therefore cannot adap-tively redistribute sub-processes over nodes to maxi-

Cloud Computing, IEEE Transactions on (Volume:PP , Issue: 99 ),23 March 2015.

IEEE TRANSACTIONS ON CLOUD COMPUTING 2

0 < 1 0 < 1 0 0 < 1 0 0 0< 1 0 0 0 0

< 1 0 0 0 0 0< 1 0 0 0 0 0 0

7 07 58 08 59 09 5

1 0 0

Ratio o

f nodes

(%)

D e g r e e s

R a t i o o f n o d e s

(a) Degree distribution of the twitter graph.

0 < 2 < 1 0 < 2 0 < 3 0 < 4 0 < 5 05 0

6 0

7 0

8 0

9 0

1 0 0

Ratio o

f nodes

(%)

N u m b e r o f u p d a t e s

R a t i o o f n o d e s

(b) Update counts distribution for PageRank.

Fig. 1. Skewed computation load distribution for a twitter follower graph.

mize the utilization of computation resources. At theworker level, the state-of-the-art solutions, namelypersistence-based load balancers (PLB) [21] and re-tentive work stealing (RWS) [21], can dynamicallybalance load via tasks redistribution/stealing accord-ing to the profiled load from the previous iterations.However, they cannot support the computation de-composition of straggling FEPs. The task partitioningfor them mainly considers evenness of data size, andso the corresponding tasks may not be balanced inload. This may cause serious computational and com-munication skew during the execution of program.

In practice, we observe that a straggling FEP islargely decomposable, because each feature is anaggregated result from individual data objects. Assuch, it can be factored into several sub-processeswhich perform calculation on the data objects inparallel. Based on this observation, we propose ageneral straggler-aware computational partition anddistribution approach, named SAE, for social net-work analysis. It not only parallelizes the major partof straggling FEPs to accelerate the convergence offeature calculation, but also effectively uses the idletime of computers when available. Meanwhile, theremaining non-decomposable part of a straggling FEPis negligible which minimizes the straggling effect.

We have implemented a programming model anda runtime system for SAE. Experimental results showthat it can speed up social network analysis by afactor of 1.77 compared with PowerGraph. Besides,it also produces a speedup of 2.23 against PUC [19],which is a state-of-the-art task-level load balancingscheme. SAE achieves speedups of 2.69 and 2.38against PLB [21] and RWS [21], respectively, whichare two state-of-the-art worker level load balancingschemes.

In summary, we make the following three contribu-tions:

1) A general approach to supporting efficient so-cial network analysis, using the fact the FEPis largely decomposable. The approach includesa method to identify straggling FEPs, a tech-nique to factor FEP into sub-processes and to

adaptively distribute these sub-processes overcomputers.

2) A programming model along with an implemen-tation of the runtime system, which efficientlysupports such an approach.

3) Extensive experimental results showing the ad-vantages of SAE over the existing solutions.

The remainder of the paper is organized as follows:Section 2 provides our motivation. The details ofour approach are presented in Section 3. Section 4describes the implementation details of SAE, followedby an experimental evaluation in Section 5. Section 6gives a brief review of related works. Finally, weconclude the paper in Section 7.

2 MOTIVATION

Social network analysis is used to analyze the be-havior of human communities. However, because ofhuman’s group behavior, some FEPs may need largeamounts of computation and communication in eachiteration, and may take many more iterations to con-verge than others. This may generate stragglers whichslow down the analysis process.

Consider a Twitter follower graph [18], [22] con-taining 41.7 million vertices and 1.47 billion edges.Fig. 1(a) shows the degree distribution of the graph.It can be seen that less than one percent of thevertices in the graph are adjacent to nearly half ofthe edges. Also, most vertices have less than tenneighbors, and most of them choose to follow thesame vertices because of the collective behavior ofhumanity. This implies that tasks hosting these smallpercentage of vertices will assume enormous compu-tation load while others are left underloaded. Withpersistence-based load balancers [21] and retentivework stealing [21], which cannot support the com-putation decomposition of straggling FEPs, high loadimbalance will be generated for the analysis program.

For some applications, the data dependency graphof FEPs may be known only at run time and maychange dynamically. This not only makes it hard toevaluate each task’s load, but also leaves some com-puters underutilized after the convergence of most

IEEE TRANSACTIONS ON CLOUD COMPUTING 3

features in early iterations. Fig. 1(b) shows the dis-tribution of update counts after a dynamic PageRankalgorithm converges on the twitter follower graph.From this figure, we can observe that the majority ofthe vertices require only a single update, while about20% of the vertices require more than 10 updatesto reach convergence. Again, this means that somecomputers will undertake much more workload thanothers. However, to tackle this challenge, the task-level load balancing approach based on profiled loadcost [19] has to pay significant overhead to period-ically profile load cost for each data object and todivide the whole data set in iterations. For Power-Graph [20], which tends to statically factor computa-tion tasks according to the edge distribution of graphs,this challenge also renders this static task planningineffective and leaves some computers underutilized,especially when most features have converged inearly iterations.

3 OUR APPROACH

This section describes the main idea and details ofour approach, followed by a quantitative performanceanalysis. The notation used is summarized in Table 1

3.1 Main ideaUsually, a major part, called the decomposable part,of the computation task in an FEP is decomposable,because each feature can be seen as a total contribu-tion from several data objects. Thus, each FEP canbe factored into several sub-processes which calculatethe value of each data object separately. This allowsus to design a general approach to avoid the impactof load skew via a straggler-aware execution method.The remaining non-decomposable part of a stragglingFEP is negligible and has little impact on the overallperformance.

3.2 Computation decomposition and distribution3.2.1 Decomposition schemeSocial network data analysis typically runs in iter-ations. The calculation of the ith feature in the kth

iteration can be expressed as:

xi(k) = Fi(Li(k − 1)), and i = 1, . . . , n (1)

where Fi() is a user given function to calculate thevalue of xi(k) and n the number of all features. Toparallelize the decomposable part of a straggling FEPand speed up the convergence of the feature calcula-tion, the straggling FEP can be factored into severalsub-processes.

To ensure the correctness of this decomposition,the decomposable part should satisfy the accumulativeproperty and the independence property. The formerproperty means that the results of this part can becalculated based on results of its sub-processes; the

Gi()

ACC()

Virtual Master

Extra()

Virtual Worker

Extra()

Virtual Worker...

Fig. 2. Illustration of a straggling feature process.

latter property means that the sub-processes are in-dependent of each other and can be executed in anyorder. To ensure these two properties, we factor thecalculation of feature xi(k) as follows:

xi(k) = Gi(Yi1 (k − 1), . . . , Y it (k − 1))

Y ih(k − 1) = Aihm, and h = 1, 2, . . . , t

Aihl= Acc(Aihl−1

, Sihl(k − 1))

Sihl(k − 1) = Extra(Lil(k − 1))

∪ml=1Lil(k − 1) = Li(k − 1)

,

(2)where Aih0

=η is a constant value and l = 1, 2, . . . ,m.The function Extra() is used to extract the localattributes of subset Lil(k-1). Acc() is used to gatherand accumulate these local attributes, and Gi() is usedto calculate the value of xi(k) for the kth iterationaccording to accumulated attributes.

Clearly, this decomposition can ensure the two re-quired properties because the results of all decom-posable parts, such as Y ih(k-1), can be calculated onlybased on the results of its sub-processes, such as Sihl

(k-1), and its sub-processes are independent of eachother as well. Note that Equation 2 is only one ofthe possible ways of decomposition. More specifically,analysis process mainly runs in the following way.

First, all decomposable parts of a straggling fea-ture’s calculation are evenly factored into several sub-processes, and each sub-process, namely Extra(), onlyprocesses values in the subset Lil(k-1), where thevalues of set Lil(k-1) are broadcast to each worker atthe beginning of each iteration. To achieve this goal,the set Li(k-1) needs to be partitioned into equal-sizedsubsets and evenly distributed over the cluster.

When all assigned sub-processes for the featureare finished, a worker only needs to accumulate andprocess the attribute results of these subsets to obtainthe final value of this feature. In this way, the loadof straggling the FEP’s decomposable part is evenlydistributed over clusters.

However, for non-straggling features, the extractionprocesses’ input set Li(k-1) may contain few values,whose number is determined by the dependencygraph of features. Thus, in practice, the input val-ues of sub-processes are processed in units of blocks.Specifically, each task processes a block of values anda block may contain the input values of differentsub-processes, where the block partition is similarto a traditional partition of data set. In reality, theprocessing of each straggling feature is done as if it

IEEE TRANSACTIONS ON CLOUD COMPUTING 4

TABLE 1Notation used in the paper.

Notation DescriptionI Raw data object set I= {D1, D2, . . . , Dn} (each data object owns a feature).xi(k) Feature value of the ith data object Di at the kth iteration.L(k) A feature set at the kth iteration, where L(k)= {x1(k), x2(k), . . . , xn(k)}.Li(k) A subset of L(k) and contains all the values needed by the calculation of xi(k + 1).Lil(k) The lth subset of set Li(k).

Y ih(k) The hth attributes of the set Li(k). It can be the maximum value, minimum value, numbers, sum of value and so on.

It is the result of a decomposable part for the calculation of xi(k+ 1). In other words, the code to calculate Y ih(k) is a

decomposable part of the process for xi(k + 1).Sihl(k) The hth attribute of the set Li

l(k).Torigin(O) Processing time of feature O needed in an iteration without partition and distribution of its decomposable parts.Treal(O) Processing time of feature O needed in an iteration after the partition and distribution of its decomposable parts.α Constant time to accumulate result value of two partitioned blocks for features.B(O) Total number of partitioned blocks for feature O.Tb Processing time needed before the redistribution of blocks.Ta Processing time needed after the redistribution of blocks.C Extra time caused by redistribution.Ts Processing time spared after the redistribution of blocks.N Number of redistributed blocks.

was executed over a virtual cluster of nodes, whereits blocks are handled by virtual workers. Fig. 2illustrates the execution of a straggling FEP.

3.2.2 Identification and decomposition of stragglingFEPs

Because the data dependency graph of features formany social network analysis algorithms is gener-ated and changed at run time, the number of valuesneeded for a feature’s processing is unknown at thefirst iteration. Also, in the beginning, we do not knowwhich FEP will be the straggling one. Thus we do notfactor computation for any FEP at this moment. Fortu-nately, the FEP often needs to repeat the same processfor several iterations until its value converges and thechange of features’ dependency between successive it-erations may be minor although the data dependencygraph varies with time. Thus the number of valuesneeded by each feature at the processing iteration canbe easily obtained, and can also help identify eachstraggling feature for subsequent iterations.

Based on this observation, we can periodically pro-file load cost in previous iterations, and then usethis information to identify straggling features, andto guide the decomposition and distribution of eachstraggling feature for subsequent iterations. This way,it not only reduces runtime overhead caused by loadcost profiling, block partitioning and distribution, butalso has little impact on the load evaluation anddistribution for these iterations, because the profiledload cost from the previous iteration may remain validfor the current iteration.

Now, we discuss how to determine a stragglingfeature and parallelize its computation. To speed upthe convergence, it seems partitioning it into moreblocks results in better performance. However, this isnot always the case, because of the increased compu-tation for aggregating the blocks’ attributes and the

increased communication cost. Therefore, we need todetermine a proper number of blocks.

Assume all the workers are idle. Then the process-ing time of a feature’s decomposable part is Torigin(O)

B(O)

after its data set is divided into B(O) blocks. Obvi-ously, both the time to accumulate local attributes andthe cost caused by the decomposition and distributionof blocks are determined by the number of blocks andcan be approximately evaluated as α ·B(O). Thus wehave

Treal(O) =Torigin(O)

B(O)+ α×B(O), (3)

where Torigin(O) and α can be obtained from the exe-cution of the previous execution. To get the minimumvalue of Treal(O), the suitable value of B(O) shouldbe Bmax(O)=(Torigin(O)

α )1/2.In practice, we identify a feature O as a straggling

one if Torigin(O)Bmax(O) ≥ Sb, where Sb is a constant specified

by the user to represent the size of a block. When-ever a straggling feature O is identified according tothe information gained from the previous iteration,its value set I is divided into Bmax(O) numbers ofequally-sized blocks in advance. While for other non-straggling features, their decomposable part’s inputvalues are inserted into equally-sized blocks, wherethe size of block is Sb.

3.2.3 Adaptive computation distribution

The above computational decomposition only at-tempts to create more chances for straggling FEPsto be processed with roughly balanced load amongtasks. After the decomposition, some workers maybe more heavily loaded than others. For example, asdescribed in Section 2, most FEPs have convergedduring the first several iterations. Then, the workersassigned with these converged FEPs may become idle.Consequently, it needs to identify whether it should

IEEE TRANSACTIONS ON CLOUD COMPUTING 5

Algorithm 1 Redistribution algorithm1: /∗ Executed on the master. ∗/2: procedure REDISTRIBUTION(WSet, Fidle)3: Wd.Load ← 04: /∗ Calculate all workers’ mean load. WSet con-

tains the profiled remaining load of all workers. ∗/5: LAverage ← GetMeanLoad(WSet)6: while (Fidle∧ Wd.Load ≤ LAverage) ∨Lc ≥ ε do7: if Fidle then //Triggered by an idle worker.8: /∗ Get the idle worker. ∗/9: Wd ← GetIdleWorker()

10: else11: /∗ Get the worker with the least load. ∗/12: Wd ← GetFastestWorker(WSet)13: end if14: /∗ Get the slowest worker. ∗/15: Ws ← GetSlowestWorker(WSet)16: L′c ← LAverage −Wd.Load17: L′′c ← Ws.Load− LAverage18: /∗ Get the minimum value. ∗/19: Lc ←Min(L′c, L′′c )20: Migration(Ws, Wd, Lc)21: Ws.Load ← Ws.Load− Lc22: Wd.Load ← Wd.Load+ Lc23: end while24: end procedure

redistribute blocks among workers based on the pre-vious load distribution to make the load cost of allworkers balanced and to accelerate the convergenceof the remaining FEPs.

Now, we show the details of deciding when toredistribute blocks according to a given cluster’s con-dition. In reality, whenever the periodically profiledremaining load of workers is received, or a workerbecomes idle, it determines whether a block redistri-bution is needed. It redistributes blocks only when

Ta < Tb − C. (4)

The intuition is as follows. If it decides to redistributeblocks, its gained benefits should be more than itscaused cost. In other words, the redistribution makessense only if the spared processing time Ts is greaterthan the cost overhead C, where the expected sparedprocessing time Ts would be Tb subtracting the newexpected processing time Ta, after paying redistribu-tion overhead C. The calculation details of them aregiven as follows.

The processing time Tb and Ta can be approximatelyevaluated via the straggling worker before and afterblock redistribution at the previous iteration, respec-tively. Specifically, Tb can be directly evaluated bythe finish time of the slowest worker at the previousiteration. The approximate value of Ta is the averagecompletion time of all workers at the previous itera-

Algorithm 2 Migration algorithm1: /∗ Executed on the worker. It is employed to

migrate Lc of load from worker Ws to Wd. ∗/2: procedure MIGRATION(Ws, Wd, Lc)3: NeedDecomposition ← False4: Lt ← 05: while Lt < Lc do6: /∗ Select the unprocessed feature in this7: iteration and with load less than Lc −Lt. ∗/8: Ft ← Ws.SelectFeature(Lc − Lt)9: if Ft = ∅ then //No more suitable features

10: NeedDecomposition ← Ture11: end if12: if NeedDecomposition = False then13: /∗ Insert the values needed to be14: processed for it into a suitable block. ∗/15: Bset.InsertNonStraggling(Ft)16: Lt ← Lt+ Ft.Load17: else18: /∗ Randomly get a straggling feature. ∗/19: Ft ← SelectFeature(Max)20: Bmax(Ft) ← (

Torigin(Ft)α )1/2

21: Lm ← Min(Lc − Lt, Ft.LoadRemaining)22: N(Ft) ← b Lm

Torigin(Ft)c ×Bmax(Ft)

23: /∗ Insert N(Ft) numbers of Ft’s24: unprocessed blocks into Bset. ∗/25: Bset.InsertStraggling(Ft, N(Ft))26: Lt ← Lt+ Lm27: end if28: end while29: /∗ Transform all blocks contained in Bset. ∗/30: Transform(Wd, Bset)31: end procedure

tion. Thus, Ta can be gotten as follows:

Ta =

∑Np

i=1 LiNp

, (5)

where Np is the total number of workers, Li is the loadof worker i and the unit of Li is the number of dataobjects. Because the redistribution time C is mainlydetermined by the number of redistributed blocks, wecan approximately evaluate the redistribution time Cas follows:

C = θ1 + θ2 ×N, (6)

where constants θ1 and θ2 can be obtained from theblock redistribution of the previous iteration.

The process of block redistribution is given in Al-gorithm 1. It incrementally redistributes blocks basedon the block distribution of the previous iteration. Italways migrates the load of the slowest worker tothe idle worker or the fastest worker via directly mi-grating blocks, until the idle worker’s load is almostequal to LAverage or the value of Lc is less than ε(a constant value given by user). In reality, the value

IEEE TRANSACTIONS ON CLOUD COMPUTING 6

of Lc is less than ε in Algorithm 1 means that theload of any worker is almost the same. Because theaccumulation of blocks’ attributes causes communica-tion cost, the migration algorithm (See Algorithm 2)always trend to migrate the non-straggling featuresand only distributes straggling features’ blocks overworkers when there is no choice. Because the load ofstraggling workers maybe several times more than theaverage, redistribution algorithm (See Algorithm 1)may need several times to migrate its load to a setof idle workers or fastest workers in an asynchronousway.

3.3 Performance analysis

In this section we analyze SAE’s performance anddiscuss application scenarios in which it performsbest.

Assume the number of data objects for a stragglingfeature is d, this feature is partitioned into b blocks,the decomposable part ratio of straggling FEP is DPR,the average execution time of an iteration is E, theruntime overhead of SAE is O and the computationalimbalance degree for algorithm without SAE is λL,where

λL =LmaxLavg

− 1, (7)

Lmax is the maximum computation time needed byany worker and Lavg is the mean computation timeneeded by all workers. Then the execution time of aniteration without employing SAE is

Ep = E × (1 + λL). (8)

With SAE, the decomposable parts of straggling fea-tures can be evenly distributed over workers. Assumethe computational skew of the decomposable parts isλo after the work distribution, then the execution timeof the decomposable parts is

Ed = E ×DPR× (1 + λo), (9)

where λo can be approximately evaluated as

λo =NAllNWork

− 1, (10)

NAll and NWork are the total number of workersand the number of workers that are processing dataobjects, respectively. Because the decomposable partsof straggling FEPs can be evenly distributed overmore idle workers with SAE, the NWork of SAE ismuch larger than without SAE, and can be veryclose to or equal NAll. Thus λo is always much lessthan λL and is almost equal to 0 most of the time.Note that as described in Equation 3, there may notbe enough blocks to be distributed considering thecost of computational decomposition and distributionthemselves, and λo may not be zero finally when themost of the features have converged.

TABLE 2Programming interfaces

InterfacesExtra(FeatureSet Oset)It is employed to extract the local attributes of features in setOset.Barrier(FeatureSet Oset)Check whether all attributes needed by features contained inOset are available. It also can contain a feature.Get(Feature O)Gather the results of all distributed sub-processes for decom-posable parts of feature O’s calculation.Acc(FeatureSet Oset)It is employed to gather and accumulate the local attributescalculated by distributed sub-processes, namely Extra(), forfeatures contained in Oset. It can be executed only when allattributes needed by features contained in Oset are available.Diffuse(FeatureSet Dset, FeatureSet Oset)Spread the value of features contained in FeatureSet Oset to allworkers processing user specified features contained in Dset

for the next iteration. In reality, it just notifies the runtimethat the values need to be transferred to which place. The realtransformation is done by runtime and occurred at the end ofeach iteration in together with others.Termination()Used to check whether termination condition is met after eachiteration.

Thus with SAE the maximum execution time of aniteration is

Eo = E×DPR×(1+λo)+E×(1−DPR)×(1+λL)+O,(11)

whereDPR =

α1 × dα1 × d+ α2 × b

(12)

and α1 is the time to calculate a data object’s localcontribution and α2 is the time to accumulate a blockof data objects’ local contributions. Therefore, thespeedup of our approach is

S =Ep

Eo

= 1+λL

1+DPR×λo+(1−DPR)×λL+OE

. (13)

From this analysis, we can observe that for algo-rithm with a low decomposable part ratio DPR orlow imbalance degree λL its performance may not beimproved much. Thus our approach is more suitableto applications with high load imbalance degree andhigh decomposable part ratio.

4 IMPLEMENTATION

4.1 Programming modelIn reality, the computation decomposition approachproposed in Section 3.2.1 for the processing of afeature can be mainly abstracted as follows: 1) receivethe values of related features and calculate severalneeded local attributes for this feature according toreceived value; 2) when all the local attributes neededby this feature are calculated and available, gatherand accumulate these attributes for this feature; 3)

IEEE TRANSACTIONS ON CLOUD COMPUTING 7

Iteration

Calculate the new value of the feature for the next iteration according

to all accumulated attributes

When all needed attributes are available, it gathers and accumulates

these attributes calculated over clusters for the feature

Each sub-process calculates attributes of the specified block

Adaptively distribute sub-processes over clusters

Diffuse the new value of the feature for the next iteration

Fig. 3. Execution overview of a feature in an iteration.

calculate the new value of this feature based on theabove accumulated attributes, then diffuse the newvalue of this feature to related features for the nextiteration. The whole execution progress of a featurecan be depicted in Fig. 3.

To support such a data-centric programming model,several interfaces are provided. The interfaces thatrequire application developers to instantiate are sum-marized in Table 2. The decomposable part is factoredinto several sub-processes, which are abstracted byExtra(). These sub-processes are distributed and ex-ecuted over workers by runtime. The Barrier() con-tained in system code SysBarrier() (described as inFig. 4) is used to determine whether all needed at-tributes of specified features are available. What’smore, the Barrier(), which is global in default andcan be overwritten by programmer. When all those at-tribute values needed by a set of features are available,Acc() contained in SysBarrier() will be triggered andexecuted on a worker to accumulate the results of re-lated Extra() for this set of features. Then it calculatesand outputs the new value of these features for thenext iteration. In reality, Acc() is the nondecomposablepart of FEP.

In the following part, we take the PageRank al-gorithm as an example to show how to use theabove programming model to meet the challenges ofstraggling feature. PageRank is a popular algorithmproposed for ranking web pages. Formally, the weblinkage graph is a graph where the node set V is theset of web pages, and there is an edge from nodei to node j if there is a hyperlink from page i topage j. Let W be the column-normalized matrix thatrepresents the web linkage graph. That is, W (j, i) =1/deg(i) (where deg(i) is the outdegree of node i) ifthere is a link from i to j. Otherwise, W (j, i)= 0. Thus,the PageRank vector R with each entry indicating apage’s ranking score can be expressed by the abovediscussed programming model as Algorithm 3.

As described in Algorithm 3, the decomposable partis parallelized into several Extra(), which processesa block of other pages’ ranking scores. To tackle thechallenges faced by the page linked with numerousother pages, the computation and communicationneeded by its extraction process is distributed overworkers via evenly distributing all its Extra() over

Algorithm 3 PageRank algorithm with SAE model1: procedure EXTRA(FeatureSet Oset)2: for each R(k, j) ∈ Oset do3: V (j).sum ← V (j).sum+ R(k, j)4: end for5: end procedure6: procedure ACC(R(j))7: ValueSet ← Get(R(j))8: R(j).sum ← 09: for each V (j) ∈ ValueSet do

10: R(j).sum ← R(j).sum+ V (j).sum11: end for12: R(j) ← (1- d)+ d × R(j).sum13: 〈links〉 ← look up outlinks of j14: for each link < j, i > ∈ 〈links〉 do15: R(j, i) ← R(j)/deg(j)16: Diffuse(i, R(j, i))17: end for18: end procedure

/*It is the system code automatically triggered by

runtime after the finish of each function Extra().*/

SysBarrier(FeatureSet Oset){

/*If all needed values are available.*/

if(Barrier(Oset))

Acc(Oset)

}

Fig. 4. The pseudocode of SysBarrier.

clusters. After all specified blocks are processed, theuser defined function Acc() is employed to accumu-late the attribute results (the sum of other pages’ranking scores in the algorithm) of these Extra() andthen to calculate its new ranking score according tothese accumulated attributes. In this way, the negativeeffect caused by straggling FEPs can be alleviated.Finally, its new ranking score is diffused for the nextiteration to process.

4.2 SAE

To efficiently support the distribution and executionof sub-processes, a system, namely SAE, is realized.It is implemented in the Piccolo [23] programmingmodel. The architecture of SAE is presented in Fig. 5.It contains a master and multiple workers. The mastermonitors status of workers and detects the termina-tion condition for applications. Each worker receivesmessages, triggers related Extra() operations to pro-cess these messages and calculates new value forfeatures as well. In order to reduce communicationcost, SAE also aggregates these messages that are sentto the same node.

Each worker loads a subset of data objects intomemory for processing. All data objects on a workerare maintained in a local in-memory key-value store,namely state table. Each table entry corresponds to

IEEE TRANSACTIONS ON CLOUD COMPUTING 8

Master

Wo

rker

Receive

Messages distribution

Extra() Synchronous Table

Sy

sBa

rrier

Acc()

Get

Calculation

Barrier

Diffuse

Wo

rker

...

Computation distribution

Fig. 5. The architecture of SAE.

a data object indexed by its key and contains threefields. The first field stores the key value j of a dataobject, the second its value; and the third the indexcorresponding to its feature recorded in the followingtable.

To store the value of features, a feature table is alsoneeded, which is indexed by the key of features. Eachtable entry of this table contains four fields. The firstfield stores the key value j of a feature, the secondits iteration number, the third its value in the currentiteration; and the fourth the attribute list.

At the first iteration, SAE only divides all dataobjects into equally-sized partitions. Then it can getthe load of each FEP from the finished iteration. Withthis information, in the subsequent iterations, eachworker can identify straggling features and partitiontheir related value set into a proper number of blocksaccording to the ability of each worker. In this way,it can create more chances for the straggling FEPs tobe executed and achieve rough load balance amongtasks. At the same time, the master detects whetherthere is necessity to redistribute blocks according toits gained benefits and the related cost, after receivingthe profiled remaining load of each worker, or whensome workers become idle. Note that the remain-ing load of each worker can be easily obtained byscanning the number of unprocessed blocks and thenumber of values in these blocks in an approximateway. While the new iteration proceeds as follows inan asynchronously way without the finish of blockredistribution, because only the unprocessed blocksare migrated.

When a diffused message is received by a worker,it triggers an Extra() operation and makes it processa block of values contained in this message. Afterthe completion of each Extra(), it sends its resultsto the worker w, where the feature’s original infor-mation is recorded on this worker’s feature table.After receiving this message, worker w records theavailability of this block on its synchronization tableand stores the results, where these records will beused by Barrier() in SysBarrier() to determine whetherall needed attributes are available for related features.

Then SysBarrier() is triggered on this worker. Whenall needed attributes are available for a specifiedfeature, the related Acc() contained in SysBarrier() is

triggered and used to accumulate all calculated resultsof distributed decomposable parts for this feature.Then Acc() is employed to calculate a new value ofthis feature for the next iteration. After the end of thisiteration, this feature’s new value is diffused to spec-ified other features for the next iteration to process.At the same time, to eliminate the communicationskew occurred at the value diffusion stage, these newvalues are diffused in a hierarchical way. In this way,the communication cost is also evenly distributed overclusters at the value diffusion stage.

5 EXPERIMENTAL EVALUATION

Platform and benchmarks The hardware platformused in our experiments is a cluster with 256 coresresiding on 16 nodes, which are interconnected by a2-Gigabit Ethernet. Each node has a 2-way octuple-core with Intel Xeon CPU E5-2670 at 2.60 GHz CPUsand 64 GB memory, running a Linux operation systemwith kernel version 2.6.32. A maximum of 16 workersare spawned for each node to run the applications.Data communication is performed using openmpiversion 1.6.3. The program is compiled with cmakeversion 2.6.4, gcc version 4.7.2, python version 2.7.6and protobuf version 2.4.1. In order to evaluate ourapproach against current solutions, four benchmarksare implemented:

1) Adsorption [8]. It is a graph-based label prop-agation algorithm, which provides personalizedrecommendation for contents and is often em-ployed in recommendation systems. Its labelpropagation proceeds to label all of the nodesbased on the graph structure, ultimately produc-ing a probability distribution over labels for eachnode.

2) PageRank [7], [9], [10], [11]. It is a popularlink analysis algorithm, that assigns a numericalweighting to each element, aiming to measure itsrelative importance within the set. The algorithmmay be applied to applications, such as searchengine, which needing to ranking a collection ofentities with references.

3) Katz metric [6], [7]. It is a often used link pre-diction approach via measuring the proximitybetween two nodes in a graph and is computedas the sum over the collection of paths betweentwo nodes, exponentially damped by the pathlength with a damping factor β. With this algo-rithm, we can predict links in the social network,understand the mechanisms by which the socialnetwork evolve and so on.

4) Connected components (CCom) [4], [5]. It isoften employed to find connected componentsin a graph by letting each node propagate itscomponent ID to its neighbours. In social net-work, it is often used in graph segmentation,

IEEE TRANSACTIONS ON CLOUD COMPUTING 9

TABLE 3Data sets summary

Data sets ScaleTwitter graph # nodes 41.7 million, # edges 1.47 billionTSS # vehicles 1 billion

the structure analysis and measurement of graphand so on.

The data sets used for these algorithms are de-scribed in Table 3. We mainly use two data sets:a Twitter graph and a transportation simulation set(TSS). The former data set downloaded from web-site [18]. The TSS is the snapshot of the 300th it-eration results for transportation simulation with 1billion vehicles, where vehicles are evenly distributedover simulated space at the beginning. Moreover, forgraph-based social network analysis algorithms, thedistance between any two vertices is less than a giventhreshold r, and they are considered to be connectedwith each other.

Performance metrics The performance evaluationmainly uses the following metrics.

1) Computational imbalance degree λL: we take thecomputational imbalance degree

λL =LmaxLavg

− 1 (14)

to evaluate the computational skew, where Lmaxis the maximum computation volume needed byany worker and Lavg is the mean computationvolume needed by all workers.

2) Communication imbalance degree λC : we take thecommunication imbalance degree

λC =CmaxCavg

− 1 (15)

to evaluate the communication skew, whereCmax is the maximum communication volumeon any worker and Cavg is the mean communi-cation volume over all workers.

3) Decomposable part ratio DPR: Decomposable partratio shows how much computation can befactored for the extraction of a feature, and iscalculated as follows:

DPR =CdCa

, (16)

where Cd is the execution time of the decompos-able part for the extraction of a feature, and Cais the total execution time for the extraction of afeature.

Compared schemes The performance of our ap-proach is mainly compared with the following foursystems leveraging different state-of-the-art schemes.Note that the main difference among these four sys-tems and SAE is just being the load balance methods.

1) PUC [19], which partitions the data sets accord-ing to profiled load cost;

2) PowerGraph [20], which partitions the edges foreach vertex of graph;

3) PLB [21], which employs hierarchicalpersistence-based rebalancing algorithmand performs greedy localized incrementalrebalancing based on the previous taskdistribution. It greedily makes local decision ateach node about which child to assign workbased on the average child load (total load ofthat subtree divided by the subtree size);

4) RWS [21], which is an active-message-based hi-erarchical retentive work stealing algorithm. Itemploys split task queues and a portion of tasksin the queue can be randomly stolen by a thief.

PUC and PowerGraph represent the approachesthat focus on task-level load balancing. It tries tobalance load among tasks and then distributes themover workers. Note that PowerGraph can only beemployed for the social network analysis with staticgraph as its data set. Consequently, we only employPowerGraph to balance tasks for benchmarks whenits data set is graph. Otherwise, PUC is used. PLBand RWS represent the approaches that focus onworker-level load balancing. In these two schemes,the input data is divided into a fixed set of tasks atthe beginning and then is redistributed in response toload changes, aiming to balance load among workers.

5.1 Skew of different approaches

To show the inefficiency of current solutions, wefirstly evaluated the computational and communica-tion skew of different solutions over benchmarks.

Fig. 6 presents the computational skew of bench-marks employing different solutions over the Twittergraph and TSS, respectively. As shown in Fig. 6(a),we can observe that PowerGraph faces significantcomputational skew and its computational imbalancedegree is even up to 1.43 for Adsorption algorithm.This is because that for the Twitter graph the mostfeatures has converged at the first several iterations,inducing much idle time of clusters for PowerGraph.

Although PLB and RWS try to remit computationskew via dynamically adjusting task distribution overthe cluster, their computational imbalance degree isstill high for the computational skew among tasks. Forthe CCom algorithm, the computational imbalancedegree of PLB is even up to 2.85 and 3.62 for the Twit-ter graph and TSS, respectively. On this algorithm,we can also observe that the computational imbalancedegree of RWS are also up to 2.43 and 2.24 for theTwitter graph and TSS, respectively.

To demonstrate the load imbalance in differentschemes, we also evaluated the convergence condi-tions of Adsorption algorithm with different schemesover the Twitter graph. Fig. 7(a) depicts the distri-bution of the number of iterations required for thesocial analytic features of Adsorption algorithm with

IEEE TRANSACTIONS ON CLOUD COMPUTING 10

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 40 . 81 . 21 . 62 . 02 . 42 . 83 . 2

Comp

utatio

nal im

balan

ce de

gree

B e n c h m a r k s

P o w e r G r a p h P L B R W S S A E

(a) Twitter graph

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 51 . 01 . 52 . 02 . 53 . 03 . 54 . 0

Comp

utatio

nal im

balan

ce de

gree

B e n c h m a r k s

P U C P L B R W S S A E

(b) TSS

Fig. 6. The computational skewof different schemes tested on theTwitter graph and TSS, respectively.

1 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 34 05 06 07 08 09 0

1 0 0

Ratio

of co

nverg

ed fe

atures

(%)

N u m b e r o f i t e r a t i o n s

P o w e r G r a p h P L B R W S S A E

(a) The distribution of the number of iterationsrequired for the social analytic features

1 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 30

2 04 06 08 0

1 0 01 2 01 4 01 6 01 8 02 0 02 2 02 4 02 6 0

Numb

er of

idle co

res

N u m b e r o f i t e r a t i o n s

P o w e r G r a p h P L B R W S S A E

(b) The number of idle cores after each iteration

Fig. 7. The convergence conditionsof Adsorption algorithm with differ-ent schemes over the Twitter graph.

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 40 . 81 . 21 . 62 . 02 . 42 . 83 . 2

Comm

unica

tion i

mbala

nce d

egree

B e n c h m a r k s

P o w e r G r a p h P L B R W S S A E

(a) Twitter graph

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 51 . 01 . 52 . 02 . 53 . 03 . 54 . 0

Comm

unica

tion i

mbala

nce d

egree

B e n c h m a r k s

P U C P L B R W S S A E

(b) TSS

Fig. 8. The communication skewof different schemes tested on theTwitter graph and TSS, respectively.

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0

2 0

4 0

6 0

8 0

1 0 0

Ratio (%

)

B e n c h m a r k s

T w i t t e r g r a p h T S S

Fig. 9. Decomposable part ratio of benchmarks testedon different data sets.

different schemes, while the number of idle coreswith increasing iterations of Adsorption algorithm fordifferent schemes are described in Fig. 7(b). Fromthese two figures, we can find that only a smallportion of features need several iterations to convergeand may cause many cores to be idle at the executiontime.

From Fig. 6(b), we can find that PUC can get lowcomputational imbalance degree via partitioning dataset according to the profiled load cost. Its computa-tional imbalance degree can even be only 0.33 for Katzmetric algorithm. Yet, as the following discussion, itslow computational imbalance degree comes at thehigh runtime overhead.

The communication imbalance degree of differentsolutions is shown in Fig. 8. From Fig. 6 and Fig. 8,we can find that the high computational skew exists

along with significant communication skew becausethe data sets needed to be processed by stragglingFEP is much larger than others. For example, whenthe computational imbalance degree of PLB is 2.85 forCCom algorithm executed on the Twitter graph, itscommunication imbalance degree is also up to 2.91.

5.2 Decomposable part ratioIn this part, we show how much computation canbe factored for the processing of a straggling feature.From the Fig. 9, we can observe that the decomposablepart ratio of these benchmarks all are higher than76.3%. The decomposable part ratio of PageRank caneven up to 86.7% on the TSS data set. It means thatmost computational part can be partitioned and dis-tributed over workers to redress the load imbalancecaused by straggling FEPs. It also means that thenegative effects caused by the very little nondecom-posable computational part is negligible. Meanwhile,we find that the decomposable ratios of the algorithmsfor the TSS and Twitter datasets are very close to eachother. This is because the decomposable part mainlydepends on the algorithm rather than on the data sets.

5.3 Runtime overheadAfter that, the execution time break down of bench-marks with our approach is also evaluated to showthe runtime overhead of our approach. As describedin Fig. 10, we can observe that the maximum time

IEEE TRANSACTIONS ON CLOUD COMPUTING 11

A d s o r p t i o n

P a g e R a n k

K a t z m e t r i c

C C o m

0 2 0 4 0 6 0 8 0 1 0 0E x e c u t i o n t i m e r a t i o ( % )

Benc

hmark

s

R u n t i m e w a s t e d t i m e C o m m u n i c a t i o n t i m e D a t a p r o c e s s i n g t i m e

(a) Twitter graph

A d s o r p t i o n

P a g e R a n k

K a t z m e t r i c

C C o m

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0E x e c u t i o n t i m e r a t i o ( % )

Benc

hmark

s

R u n t i m e w a s t e d t i m e C o m m u n i c a t i o n t i m e D a t a p r o c e s s i n g t i m e

(b) TSS

Fig. 10. Execution time break downof different benchmarks executedover SAE.

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 20 . 40 . 60 . 81 . 01 . 21 . 4

Relat

ive co

mputa

tiona

l ove

rhead

B e n c h m a r k s

P o w e r G r a p h P L B R W S S A E

(a) Twitter graph

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0123456789

Relat

ive co

mputa

tiona

l ove

rhead

B e n c h m a r k s

P U C P L B R W S S A E

(b) TSS

Fig. 11. The relative computationaloverhead against SAE tested on theTwitter graph and TSS, respectively.

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 20 . 40 . 60 . 81 . 01 . 21 . 4

Relat

ive co

mmun

icatio

n ove

rhead

B e n c h m a r k s

P o w e r G r a p h P L B R W S S A E

(a) Twitter graph

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0

1

2

3

4

5

6

Relat

ive co

mmun

icatio

n ove

rhead

B e n c h m a r k s

P U C P L B R W S S A E

(b) TSS

Fig. 12. The relative communica-tion overhead against SAE testedon the Twitter graph and TSS.

wasted by our approach only occupies 21.0% of thetotal execution time. For katz metric algorithm, thetime wasted by our approach even only occupies12.1% of an iteration’s total execution time. Moreover,the time to process features and the time to commu-nicate intermediate results between iterations are upto 72.7% and 15.2%, respectively.

Then the extra computational and extra communi-cation overhead of our approach and current solutionsare presented in Fig. 11 and Fig. 12, respectively.Note that the results in these two figures are therelative computational and communication overheadagainst related overhead of SAE, respectively. We canobserve that the computational and communicationcost of our approach are more than PLB and RWS,where its runtime overhead mainly includes the timeof straggling FEPs identification and decompositionas well as its computation redistribution. Take theTwitter graph as the example, for the PageRank al-gorithm the computational overhead of SAE is even1.92 times more than PLB and 1.73 times more thanRWS, respectively. Yet, as shown in the following part,we can observe that its overhead is negligible againstthe benefits gained from the skew-resistance for strag-gling FEPs. Obviously, it is a trade-off between thecaused overhead and the imbalance degree of socialnetwork analysis.

From Fig. 11 and Fig. 12, we can also observe thatthe runtime overhead of PowerGraph is almost thesame as SAE. However, the computational overheadand communication overhead of PUC are 8.4 and 5.5

times more than SAE, respectively. This is because thatPUC has to pay much extra overhead to ensure lowcomputational and communication imbalance degreevia profiling load cost and partitioning the whole datasets.

Because the load of each worker is sent to the mas-ter for redistribution decision, the master’s memoryoverhead increases with more workers. In our case,its memory overhead is about 124.8 MB for 1024workers per node (262,144 works in total). As such,the memory may become a bottleneck for workers atthe order of millions under our current configuration.

5.4 PerformanceFrom Fig. 6(a) and Fig. 6(b), we can observe that SAEcan guarantee low computational imbalance degreeand also low communication imbalance degree viastraggler-aware execution. It also demonstrates thatthe load imbalance caused by the remaining compu-tational part of straggling FEPs is small. Take the Ad-sorption algorithm as the example, its computationalimbalance degree are only 0.12 and 0.18 for the Twittergraph and TSS, respectively. Its communication imbal-ance degree also are only 0.33 and 0.36 for the Twittergraph and TSS, respectively.

Because of such a low computational imbalancedegree and communication imbalance degree, ourapproach can get a better performance against currentsolutions. The speedup of current solutions againstthe performance of PLB on the above benchmarks areevaluated. The results are described in Fig. 13(a) and

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TCC.2015.2415810, IEEE Transactions on Cloud Computing

IEEE TRANSACTIONS ON CLOUD COMPUTING 12

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 51 . 01 . 52 . 02 . 53 . 0

Spee

dup

B e n c h m a r k s

P o w e r G r a p h P L B R W S S A E

(a) Twitter graph

A d s o r p t i o n P a g e R a n k K a t z m e t r i c C C o m0 . 00 . 51 . 01 . 52 . 02 . 53 . 03 . 5

Spee

dup

B e n c h m a r k s

P U C P L B R W S S A E

(b) TSS

Fig. 13. The speedup of different approaches against the performanceof PLB tested on the Twitter graph and TSS, respectively.

1 6 3 2 6 4 1 2 8 2 5 6012345678

Spee

dup

N u m b e r o f c o r e s

T w i t t e r g r a p h T S S

Fig. 14. The scalability of CComalgorithm executed on SAE for dif-ferent data sets.

Fig. 13(b), where the execution times of Adsorption,PageRank, Katz metric and CCom using PLB overthe Twitter graph are 279.2, 217.5, 174.9 and 157.1seconds, respectively. For the data set TSS, the ex-ecution times of PLB are 5455.0, 4071.3, 3294.2 and3180.8 seconds, respectively. As shown in these twofigures, it is obvious that SAE can always get the bestperformance because it can get low computationaland communication imbalance degree with relativelow runtime overhead. The performance of SAE iseven up to 2.53 times and 2.69 times against PLB forKatz metric algorithm on the Twitter graph and forCCom algorithm on TSS, respectively.

However, for Katz metric algorithm on the Twit-ter graph, PowerGraph only has 1.55 times speedupagainst PLB because PowerGraph can not efficientlydistribute load to exploit the idle time of workers,while PUC also only has 1.21 times speedup againstPLB for CCom algorithm on TSS because of the highruntime overhead to get the balanced load. From thefigure, obviously, SAE can get a speedup up to 1.77times against PowerGraph on Adsorption algorithm,and also gets a speedup up to 2.23 times against PUCon CCom algorithm, respectively.

Finally, the scalability of SAE is tested as well.Fig. 14 describes the speedup of CCom algorithmexecuted on SAE with different number of cores. Itis obvious that SAE can get a good scalability. This isbecause that the computational skew and communica-tion skew of SAE can be efficiently ensured to be verylow. The algorithms executed on SAE can efficientlyexploit the underneath clusters.

6 RELATED WORKS

Load balancing is an important problem for parallelexecution of social network analysis in the cloud. Cur-rent solutions for this problem either focus on task-level load balancing or on worker-level balancing.

Task-level load balancing. SkewReduce [19] is astate-of-the-art solution for reducing load imbalanceamong tasks, in view that in some scientific applica-tions, different partitions of the data object set take

vastly different amounts of time to run even if theyhave an equal size. It proposes to employ user definedcost function to guide the division of the data ob-ject set into equally-loaded, rather than equally-sized,data partitions. However, in order to ensure low loadimbalance for social network analysis, it has to paysignificant overhead to periodically profile load costfor each data object and to divide the whole data set initerations. To accurately evaluate load cost, Pearce etal. [24] proposed an element-aware load model basedon application elements and their interactions.

PowerGraph [20] considers the load imbalanceproblem of distributed graph applications with a fo-cus on partitioning and processing edges for eachvertex over the machines. Yet, it can only staticallypartition computation for graph with fixed depen-dencies and also can not adaptively redistribute sub-processes over computer cluster to exploit the idletime with the execution of social network analysis.

Worker-level load balancing. Persistence-basedload balancers [21], [25], [26] and retentive workstealing [21] represent the approaches to balance loadsamong workers for iterative applications. Persistence-based load balancers redistributes the work to beperformed in a given iteration based on measuredperformance profiled from previous iterations. Re-tentive work stealing is used for applications withsignificant load imbalance within individual phases,or applications with workloads that cannot be easilyprofiled. Retentive work stealing records the taskinformation of previous iterations for work stealing toachieve higher efficiency. Yet, both these two solutionsjust distribute or steal tasks and cannot support thecomputation decomposition of straggling FEPs. As aresult, social network analysis with these two solu-tions may have significant computational and com-munication skew caused by the high load imbalanceamong initial tasks.

Mantri [27] identifies three causes for load imbal-ance of MapReduce tasks, monitors and predicts thosecauses, and takes proper actions to reduce load imbal-ance by restarting, migrating and replication. How-ever, it also faces the same problem as persistence-

IEEE TRANSACTIONS ON CLOUD COMPUTING 13

based load balancers and retentive work stealing.SkewTune [28] finds a task with the greatest expectedremaining processing time via scanning the remain-ing input data object set, proactively repartitions theunprocessed data object set of the straggling task ina way that fully utilizes the machines of computercluster. However, for social network analysis, it pro-longs the tackling time of skew and also induces highruntime overhead, because this approach needs muchoverhead to identify straggling task and divide thedata sets for each straggling task one by one at eachiteration although there may be many straggling FEPswithin each iteration.

SAE compared with previous work. Against cur-rent solutions, SAE addresses the problem of compu-tational and communication skew at both task andworker levels for social network analysis in a differ-ent way based on the fact that the computation ofFEP is largely decomposable. Specifically, it proposesan efficient approach for social network analysis tofactor straggling FEPs into several sub-processes thenadaptively distribute these sub-processes over com-puters, aiming to parallelize the decomposable partof straggling FEP and to accelerate its convergence.

7 CONCLUSION

For social network analysis, the convergence of strag-gling FEP may need to experience significant numbersof iterations and also needs very large amounts ofcomputation and communication in each iteration,inducing serious load imbalance. However, for thisproblem, current solutions either require significantoverhead, or can not exploit underutilized computerswhen some features converged in early iterations, orperform poorly because of the high load imbalanceamong initial tasks.

This paper identifies that the most computationalpart of straggling FEP is decomposable. Based on thisobservation, it proposes a general approach to factorstraggling FEP into several sub-processes along witha method to adaptively distribute these sub-processesover workers in order to accelerate its convergence.Later, this paper also provides a programming modelalong with an efficient runtime to support this ap-proach. Experimental results show that it can greatlyimprove the performance of social network analy-sis against state-of-the-art approaches. As shown inSection 5, the master in our approach may becomea bottleneck. In future work, we will study how toemploy our approach in a hierarchical way to reducethe memory overhead and evaluate its performancegain.

ACKNOWLEDGMENTS

This work was supported by National High-techResearch and Development Program of China (863Program) under grant NO. 2012AA010905, China

National Natural Science Foundation under grantNO.61322210, 61272408 and Natural Science Founda-tion of Hubei under grant NO. 2012FFA007. XiaofeiLiao is the corresponding author.

REFERENCES

[1] Z. Song and N. Roussopoulos, “K-nearest neighbor search formoving query point,” Lecture Notes in Computer Science, vol.2121, pp. 79–96, July 2001.

[2] X. Yu, K. Q. Pu, and N. Koudas, “Monitoring k-nearest neigh-bor queries over moving objects,” in Proceedings of the 21stInternational Conference on Data Engineering. IEEE, 2005, pp.631–642.

[3] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko,R. Silverman, and A. Y. Wu, “An efficient k-means clusteringalgorithm: Analysis and implementation,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp.881–892, July 2002.

[4] L. Di Stefano and A. Bulgarelli, “A simple and efficient con-nected components labeling algorithm,” in Proceedings of theInternational Conference on Image Analysis and Processing. IEEE,1999, pp. 322–327.

[5] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman,G. Mehta, K. Vahi, G. B. Berriman, J. Good et al., “Pegasus:A framework for mapping complex scientific workflows ontodistributed systems,” Scientific Programming, vol. 13, no. 3, pp.219–237, January 2006.

[6] L. Katz, “A new status index derived from sociometric analy-sis,” Psychometrika, vol. 18, no. 1, pp. 39–43, March 1953.

[7] D. Liben-Nowell and J. Kleinberg, “The link prediction prob-lem for social networks,” in Proceedings of the 12th internationalconference on Information and knowledge management. ACM,2003, pp. 556–559.

[8] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar,D. Ravichandran, and M. Aly, “Video suggestion and dis-covery for youtube: taking random walks through the viewgraph,” in Proceedings of the 17th international conference onWorld Wide Web. ACM, 2008, pp. 895–904.

[9] S. Brin and L. Page, “The anatomy of a large-scale hypertextualweb search engine,” Computer networks and ISDN systems,vol. 30, no. 1, pp. 107–117, April 1998.

[10] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar,D. Ravichandran, and M. Aly, “Video suggestion and dis-covery for youtube: taking random walks through the viewgraph,” in Proceedings of the 17th international conference onWorld Wide Web. ACM, 2008, pp. 895–904.

[11] H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu,“Scalable proximity estimation and link prediction in onlinesocial networks,” in Proceedings of the 9th ACM SIGCOMMconference on Internet Measurement Conference. ACM, 2009, pp.322–335.

[12] Y. Zhang, X. Liao, H. Jin, and G. Min, “Resisting Skew-accumulation for Time-stepped Applications in the Cloud viaExploiting Parallelism,” IEEE Transactions on Cloud Computing,doi: 10.1109/TCC.2014.2328594, May 2014.

[13] M. E. Newman and M. Girvan, “Finding and evaluatingcommunity structure in networks,” Physical review E, vol. 69,no. 2, pp. 26–113, February 2004.

[14] I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin, “Effectiveleadership and decision-making in animal groups on themove,” Nature, vol. 433, no. 7025, pp. 513–516, February 2005.

[15] S. Banerjee and N. Agarwal, “Analyzing collective behaviorfrom blogs using swarm intelligence,” Knowledge and Informa-tion Systems, vol. 33, no. 3, pp. 523–547, December 2012.

[16] G. Wang, M. Salles, B. Sowell, X. Wang, T. Cao, A. Demers,J. Gehrke, and W. White, “Behavioral simulations in mapre-duce,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp.952–963, September 2010.

[17] B. Raney and K. Nagel, “Iterative route planning for large-scale modular transportation simulations,” Future GenerationComputer Systems, vol. 20, no. 7, pp. 1101–1118, October 2004.

[18] H. P. Haewoon Kwak, Changhyun Lee and S. Moon,“What is twitter, a social network or a new media?”http://an.kaist.ac.kr/traces/WWW2010.html, 2010.

IEEE TRANSACTIONS ON CLOUD COMPUTING 14

[19] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skew-resistantparallel processing of feature-extracting scientific user-definedfunctions,” in Proceedings of the 1st ACM symposium on Cloudcomputing. ACM, 2010, pp. 75–86.

[20] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin,“Powergraph: Distributed graph-parallel computation on nat-ural graphs,” in Proceedings of the 10th USENIX conferenceon Operating Systems Design and Implementation. USENIXAssociation, 2012, pp. 17–30.

[21] J. Lifflander, S. Krishnamoorthy, and L. V. Kale, “Work stealingand persistence-based load balancers for iterative overdecom-posed applications,” in Proceedings of the 21st international sym-posium on High-Performance Parallel and Distributed Computing.ACM, 2012, pp. 137–148.

[22] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, asocial network or a news media?” in Proceedings of the 19thinternational conference on World Wide Web. ACM, 2010, pp.591–600.

[23] R. Power and J. Li, “Piccolo: building fast, distributed pro-grams with partitioned tables,” in Proceedings of the 9thUSENIX conference on Operating Systems Design and Implemen-tation. USENIX Association, 2010, pp. 1–14.

[24] O. Pearce, T. Gamblin, B. R. de Supinski, M. Schulz, and N. M.Amato, “Quantifying the effectiveness of load balance algo-rithms,” in Proceedings of the 26th ACM international conferenceon Supercomputing. ACM, 2012, pp. 185–194.

[25] G. Zheng, E. Meneses, A. Bhatele, and L. V. Kale, “HierarchicalLoad Balancing for Charm++ Applications on Large Super-computers,” in Proceedings of the 39th International Conferenceon Parallel Programming Workshops. IEEE, 2010, pp. 436–444.

[26] B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida,X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, andL. Kale, “Parallel Programming with Migratable Objects:Charm++ in Practice,” in Proceedings of the 2014 InternationalConference on High Performance Computing, Networking, Storageand Analysis. IEEE, 2014, pp. 647–658.

[27] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica,Y. Lu, B. Saha, and E. Harris, “Reining in the outliers inmap-reduce clusters using mantri,” in Proceedings of the 9thUSENIX conference on Operating Systems Design and Implemen-tation. USENIX Association, 2010, pp. 1–16.

[28] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skewtune:Mitigating skew in mapreduce applications,” in Proceedings ofthe 2012 international conference on Management of Data. ACM,2012, pp. 25–36.

Yu Zhang is now a Ph.D. candidatein computer science and technology ofHuazhong University of Science and Tech-nology (HUST), Wuhan, China. His researchinterests include big data processing, cloudcomputing and distributed systems. His cur-rent topic mainly focuses on application-driven big data processing and optimizations.

Xiaofei Liao received a Ph.D. degree incomputer science and engineering fromHuazhong University of Science and Tech-nology (HUST), China, in 2005. He is now aProfessor in school of Computer Science andEngineering at HUST. His research interestsare in the areas of system virtualization, sys-tem software, and Cloud computing.

Hai Jin is a Cheung Kung Scholars ChairProfessor of computer science and engineer-ing at Huazhong University of Science andTechnology (HUST) in China. He is nowDean of the School of Computer Scienceand Technology at HUST. Jin received hisPhD in computer engineering from HUST in1994. In 1996, he was awarded a GermanAcademic Exchange Service fellowship tovisit the Technical University of Chemnitz inGermany. Jin worked at The University of

Hong Kong between 1998 and 2000, and as a visiting scholarat the University of Southern California between 1999 and 2000.He was awarded Excellent Youth Award from the National ScienceFoundation of China in 2001. Jin is the chief scientist of ChinaGrid,the largest grid computing project in China, and the chief scientistsof National 973 Basic Research Program Project of VirtualizationTechnology of Computing System, and Cloud Security. Jin is a seniormember of the IEEE and a member of the ACM. He has co-authored15 books and published over 500 research papers. His researchinterests include computer architecture, virtualization technology,cluster computing and cloud computing, peer-to-peer computing,network storage, and network security.

Guang Tan is currently an Associate Pro-fessor at Shenzhen Institutes of AdvancedTechnology (SIAT), Chinese Academy of Sci-ences, China, where he works on the de-sign of distributed systems and networks. Hereceived his PhD degree in computer sci-ence from the University of Warwick, U.K., in2007. From 2007 to 2010 he was a postdoc-toral researcher at INRIA-Rennes, France.He has published more than 30 research ar-ticles in the areas of peer-to-peer computing,

wireless sensor networks, and mobile computing. His research issponsored by National Science Foundation of China and ChineseAcademy of Sciences. He is a member of IEEE, ACM, and CCF.