Social Network GPU

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

IMGPU: GPU Accelerated InfluenceMaximization in Large-scale Social NetworksXiaodong Liu, Mo Li, Member, IEEE, Shanshan Li, Member, IEEE, Shaoliang Peng, Member, IEEE,

Xiangke Liao, Member, IEEE, and Xiaopei Lu, Student Member, IEEE,

Abstract—Influence Maximization aims to find the top-K influential individuals to maximize the influence spread within a socialnetwork, which remains an important yet challenging problem. Proven to be NP-hard, the influence maximization problem attractstremendous studies. Though there exist basic greedy algorithms which may provide good approximation to optimal result, theymainly suffer from low computational efficiency and excessively long execution time, limiting the application to large-scale socialnetworks. In this paper, we present IMGPU, a novel framework to accelerate the influence maximization by leveraging theparallel processing capability of GPU. We first improve existing greedy algorithms and design a bottom-up traversal algorithmwith GPU implementation, which contains inherent parallelism. To best fit the proposed influence maximization algorithm withthe GPU architecture, we further develop an adaptive K-Level combination method to maximize the parallelism and reorganizethe influence graph to minimize the potential divergence. We carry out comprehensive experiments with both real-world andsynthetic social network traces and demonstrate that with IMGPU framework we are able to outperform the state-of-the-artinfluence maximization algorithm up to a factor of 60, and shows potential to scale up to extraordinarily large-scale networks.

Index Terms—Influence maximization, GPU, large-scale social networks, IMGPU, bottom-up traversal algorithm.

�

1 INTRODUCTION

SOCIAL networks such as Facebook and Twitterplay an important role as efficient media for fast

spreading information, ideas, and influence amonghuge population [12], and such effect has been greatlymagnified with the rapid increase of online users.The immense popularity of social networks presentsgreat opportunities for large-scale viral marketing, amarketing strategy that promotes products through“word-of-mouth” effects. While the power of socialnetworks has been more and more explored to maxi-mize the benefit of viral marketing, it becomes vital tounderstand how we can maximize the influence overthe social network. This problem, referred to as influ-ence maximization, is to select within a given socialnetwork a small set of influential individuals as initialusers such that the expected number of influencedusers, called influence spread, is maximized.

The influence maximization problem is interestingyet challenging. Kempe et al. [12] proved this prob-lem to be NP-hard and proposed a basic greedyalgorithm that provides good approximation to theoptimal result. However, their approach is seriouslylimited in efficiency because it needs to run Monte-

• X. Liu, S. Li, S. Peng, X. Liao and X. Lu are with the Schoolof Computer, National University of Defense Technology, NO. 137Yanwachi Street, Changsha 410073, China.E-mail: {liuxiaodong, shanshanli, shaoliangpeng, xkliao, xiaopeilu}@nudt.edu.cn

• M. Li is with School of Computer Engineering, Nanyang TechnologicalUniversity, N4-02c-108, 50 Nanyang Avenue, 639798 Singapore.E-mail: [email protected]

Carlo simulation for considerably long time period toguarantee an accurate estimate. Although a numberof successive efforts have been made to improve theefficiency, state-of-the-art approaches still suffer fromexcessively long execution time due to the high com-putational complexity for large-scale social networks.

On the other hand, Graphics Processing Unit (GPU)has recently been widely used as a popular general-purpose computing device and shown promising po-tential in accelerating computation of graph problemssuch as Breadth First Search [9] and Minimum Span-ning Tree [20], due to its parallel processing capacityand ample memory bandwidth. Therefore, in thispaper, we explore the use of GPU to accelerate thecomputation of influence maximization problem.

However, the parallel processing capability of GPUcan be fully exploited in handling tasks with regulardata access pattern. Unfortunately, the graph struc-tures of most real-world social networks are highlyirregular, making GPU acceleration a non-trivial task.For example, Barack Obama, the U.S. president, hasmore than 11 million followers in Twitter, while morethan 90% Twitter users’ follower number is under100 [13]. Such irregularities may lead to severe per-formance degradation. The main challenges of fullGPU acceleration lie in the following aspects. First,the parallelism of influence spread computation foreach possible seed sets is limited by the numberof nodes at each level. Therefore the computationalpower of GPU cannot be fully exploited if we directlymap the problem to GPU for acceleration. Second, asthe degree of nodes in most social networks mainlyfollow a power-law distribution, severe divergence

Digital Object Indentifier 10.1109/TPDS.2013.41 1045-9219/13/$31.00 © 2013 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


between GPU threads will occur during influencespread computation, seriously degrading the over-all performance. Third, due to the irregular natureof real-world social networks, the memory accessesshow poor spatial locality, making it hard to fit theGPU computational model.

In order to address the above challenges, wepropose a GPU accelerated Influence Maximizationframework, IMGPU, which aims at fully leverag-ing the parallel processing capability of GPU. Wefirst convert the social graph into a Directed AcyclicGraph (DAG) to avoid redundant calculation. Then aBottom-Up Traversal Algorithm (BUTA) is designedand mapped to GPU with CUDA programmingmodel. Our approach provides substantial improve-ment to existing sequential approaches by takingadvantage of the inherent parallelism in processingnodes within a social network. Based on the featureof the influence maximization problem, we propose aset of adaptive mechanisms to explore the maximumcapacity of GPU and optimize the performance ofIMGPU. In particular, we develop an adaptive K-Level combination method to maximize the paral-lelism among GPU threads. Meanwhile, we reorga-nize the graph by level and degree distribution in or-der to minimize the potential divergence and coalescethe memory access to the utmost extent. We conductextensive experiments with both real-world and syn-thetic social network traces. Compared with the state-of-the-art algorithm MixGreedy, IMGPU achieves upto 60× speedup in execution time and is able to scaleup to extraordinarily large-scale networks which werenever expected with existing sequential approaches.

As a summary, the contributions of this paperare mainly twofold. First, we present BUTA, an ef-ficient bottom-up traversal algorithm which containsinherent parallelism for the influence maximizationproblem. We further map BUTA to GPU architectureto exploit the parallel processing capability of GPU.Second, to best fit the GPU computational model, wepropose several effective optimization techniques inorder to maximize the parallelism, avoid potentialdivergence and coalesce memory access.

The rest of this paper is organized as follows.Section 2 provides preliminaries on influence max-imization and reviews related work. The IMGPUframework and corresponding GPU optimizations arepresented in Section 3 and Section 4, respectively. Weevaluate the IMGPU design by extensive experimentsand report the experimental results in Section 5. Weconclude this work in Section 6.

2 PRELIMINARIES AND RELATED WORK

In this section, we present preliminary introduction toinfluence maximization, and review related work.

In influence maximization, an online social networkis modeled as a directed graph G = (V,E,W ), where

V = {v1, v2, · · · , vn} represents the set of nodes in thegraph, each of which corresponds to an individualuser. Each node can be either active or inactive, andwill switch from being inactive to being active if itis influenced by others nodes. E ⊂ V × V is a set ofdirected edges representing the relationship betweendifferent users. Take Twitter as an example. A directededge (vi, vj) will be established from node vi to vj ifvi is followed by vj , which indicates that vj is opento receive tweets from vi and thus may be influencedby vi. W = {w1, w2, · · · , wn} is the weight of eachnode which indicates its contribution to the influencespread. The weight is initialized as 1 for each node,meaning that if this node is influenced by other nodes,its contribution to the influence spread is 1. The sizeof node set is n, and the number of edges is m. Nodevi is called a sink if its outdegree is 0, and called asource if its indegree is 0.

Independent Cascade (IC) Model [12] is one of themost well-studied diffusion models. Given an initialset S, the diffusion process of IC model unfoldsas follows. At step 0, only nodes in S are active,while other nodes stay in the inactive state. At stept, for each node vi which has just switched frombeing inactive to being active, it has a single chanceto activate each currently inactive neighbor vw, andsucceeds with a probability pi,w. If vi succeeds, vw willbecome active at step t+ 1. If vw has multiple newlyactivated neighbors, their attempts in activating vw aresequenced in an arbitrary order. Such a process runsuntil no more activations are possible [12]. We useσ(S) to denote the influence spread of the initial setS, which is defined as the expected number of activenodes at the end of influence propagation.

Given a graph G = (V,E,W ) and a parameter K,the influence maximization problem in IC model is toselect a subset of influential nodes S ⊂ V of size Ksuch that the influence spread σ(S) is maximized atthe end of influence diffusion process.

Algorithm 1 Basic Greedy1: Initialize S = ∅2: for i = 1 to K do3: Select v = argmaxu∈(V \S)(σ(S ∪ u)− σ(S))4: S = S ∪ {v}5: end for

The influence maximization problem is first intro-duced by Richardson and Domingos in [7], [19]. In2003, Kempe et al. [12] proved that the influencemaximization problem is NP-hard and proposed abasic greedy algorithm as shown in Algorithm 1.Their approach works in K iterations, starting withan empty set S (line 1). In each iteration, a node viwhich brings the maximum marginal influence spreadσS(vi) = σ(S ∪ vi)−σ(S) is selected to be included inS (lines 3 and 4). The process ends when the size of Sreaches K. However, computing the influence spread



σ(S) of an arbitrary set S is proved to be #P-Hardin [5]. In order to guarantee computation accuracy,Kempe et al. suggest running Monte-Carlo simulationfor a long time to obtain a close estimate, resulting inlow computation efficiency of the greedy algorithm.

Wei Chen et al. proposed MixGreedy [4] that re-duces the computational complexity by computingthe marginal influence spread for each node vi ∈(V \S) in one single simulation. MixGreedy first de-termines whether an edge would be selected forpropagation or not with a given probability. Then allthe edges not selected are removed to form a newgraph G′ = (V,E′,W ) where E′ ⊆ E. With thistreatment, the marginal gain σS(vi) from adding nodevi to S is the number of nodes that are reachablefrom vi but unreachable from all the nodes in S. Tocompute the influence spread for each node, a basicimplementation is doing BFS for all vertices whichtakes O(mn) time. Therefore MixGreedy incorporatesCohen’s randomized algorithm [6] for estimating themarginal influence spread for every node, and thenselects the node that offers the maximal influencespread. Adopting the above optimization methods,MixGreedy can run much faster. However, the im-provement is not effective enough to reduce executiontime to an acceptable range especially for large-scalesocial networks. Moreover, Cohen’s randomized algo-rithm provides no accuracy guarantee.

In [17], Liu et al. propose ESMCE, a power-lawexponent supervised Monte-Carlo method that effi-ciently estimates the influence spread by randomlysampling only a portion of the child nodes. In par-ticular, ESMCE roughly predicts the number of childnodes needed to be sampled according to the power-law exponent of the given social network. Afterwards,multiple iterative steps are employed to randomlysample more nodes until the precision requirement isfinally achieved. Although ESMCE can considerablyaccelerate the influence maximization problem, it isobvious that the cost of its acceleration is the accuracy.

There have been also many other algorithms andheuristics proposed for improving the efficiency issue,such as [5], [8], [10], [11], [16], [21]. More details aboutthese works can be found in Appendix A. Differentfrom all existing studies, in this work we demonstratethat by exploiting the parallel processing capabil-ity of GPU, we can greatly accelerate the influencemaximization problem while performing consistentlyamong the best algorithms in terms of accuracy.

3 IMGPU FRAMEWORKIn this section, we describe the IMGPU frameworkthat enables GPU accelerated processing of influencemaximization. First, we develop BUTA which canexploit inherent parallelism and effectively reduce thecomplexity with ensured accuracy. Then we map thealgorithm implementation to GPU environment underCUDA programming model.

Fig. 1. Bottom-up traversal.

3.1 Bottom-Up Traversal Algorithm

As mentioned in Section 2, we can get a new graphG′ = (V,E′,W ) from the original graph after ran-domly selecting edges from G. Afterwards, we needto compute the marginal gain for each node to findthe best candidate. Instead of doing BFS for each nodewhich is rather inefficient, we find that the marginalinfluence computation of each node only relies onits child nodes; thereby we could get the influencespreads for all the nodes by traversing the graphonly once in a bottom-up way. Moreover, as thereis no dependency relationship among nodes at thesame level in a DAG, their computation can be donein parallel. Here the level of a node vi, denoted aslevel(vi), is defined by the following equation:

level(vi) =

{maxvj∈out[vi] level(vj) + 1, if |out[vi]| > 0

0, if |out[vi]| = 0

where out[vi] denotes the set of child nodes of vi. Asa motivating example shown in Fig. 1, the influencespread computation for nodes a, b, c and d at level1 can be done concurrently. Therefore, their compu-tation can be mapped to different GPU threads andprocessed in parallel to overlap the execution time. Tothis end, we proposed BUTA to explore the inherentparallelism. We first convert the graph to a DAG toavoid redundant computation and potential deadlock.Then the DAG is traversed by level in a bottom-upway to compute the marginal influence for all nodesin parallel on GPU.

To start with, we first prove a theorem that themarginal gains for nodes belonging to one StronglyConnected Component (SCC) are identical. The de-tailed proof can be found in Appendix B. From sucha theorem, we know that influence spreads of nodesin a SCC are equal in quantity, thus it is unnecessaryto do repeated computation for all of those nodes.Therefore, we merge all nodes belonging to the sameSCC into one node with a weight being the size of theSCC. By doing so, first, we can avoid a large numberof unnecessary computations. Second, the graph G′

can be converted into a DAG G∗ = (V ∗, E∗,W ∗),and the following BUTA algorithm can be applied tocompute the influence spread in parallel.

Note that influence spread of a node has closerelation to its child nodes. As illustrated in Fig. 2(a),



(a) (b)

Fig. 2. Relation of nodes.

node c has two child nodes, a and b. Every nodereachable from a and b can also be reached from c. Asthe subgraph of node a has no overlap with that of b,influence spread of c can be easily calculated by thesum of influence spread of a, b as well as the weightof node c. Fig. 2(b) shows a special example wherethe influence spread of a and b overlaps at node d.When computing the influence spread of c, we needto subtract the overlap part (node d) from the sum.To summarize, the influence spread of source node vican thus be obtained by

σS(vi) =∑

vj∈out[vi]

σS(vj) + wvi − overlap(out[vi]) (1)

where overlap(out[vi]) denotes the number of overlapamong influence spreads of all the nodes in out[vi].Computation of the overlap(·) function is based onthe label of nodes. Each node is assigned with a labelrecording where this node may overlap with others.When computing the overlap for a given node set,we check the label of each node in the node set andcompute the amount of redundant among their labels.In Appendix C, we give the details of label assignmentcriteria and overlap computation algorithm.

Inspired by Equation 1, we find it not necessaryto do BFS for each node. Instead we can traversethe graph only once in a bottom-up way and obtaininfluence spreads of all the nodes. Moreover, as theinfluence spread of nodes at the same level dependsonly on their child nodes, the computation can bedone in parallel by using GPU for acceleration.

Algorithm 2 presents the details of BUTA whereR denotes the number of Monte-Carlo simulations.In each round of simulation, the graph is first re-constructed by selecting edges at a given probability(line 5) and converting into a DAG (line 6). Then westart the bottom-up traversal level by level (lines 7- 13). We use the “in parallel” construct (line 8) toindicate the codes that can be executed in parallel byGPU. Influence spreads of all nodes at the same levelcan be calculated in parallel (line 9) and the labelof each node is then determined for future overlapcomputation (line 10). Such a process will iterate untilall nodes are computed. As G∗ is a DAG, there will beno deadlock at run time. After R rounds of simulation,

Algorithm 2 BUTAInput: Social network G = (V,E,W ), parameter KOutput: The set S of K influential node

1: Initialize S = ∅2: for i = 1 to K do3: Set Infl to zero for all nodes in G4: for j = 1 to R do5: G′ ← randomly select edges under IC model6: G∗ ← convert G′ to DAG7: for each level l from bottom up in G∗ do8: for each node v at level l in parallel do9: Compute the influence spread σS(v)

10: Compute the label L(v)11: Infl[v] ← Infl[v] + σS(v)12: end for13: end for14: end for15: vmax = argmaxv∈(V \S) Infl[v]/R16: S = S ∪ {vmax}17: end for

the node providing the maximal marginal gain will beselected and added to the set S (lines 15-16).

The total running time of algorithm 2 is O(KR(m+m∗)), where m and m∗ denote the number of edgesin graphs G and G∗, respectively. The detailed com-plexity analysis can be found in Appendix D. Con-sequently, compared with the basic greedy algo-rithm taking O(KRnm) time and MixGreedy takingO(KRTm) time, BUTA can greatly reduce the timecomplexity.

To summarize, the advantages of BUTA are as fol-lows. First, we can greatly reduce the time complexityto O(KR(m + m∗)) through DAG conversion andbottom-up traversal. Second, BUTA can guaranteebetter accuracy than MixGreedy as we accuratelycompute influence spread for each node while Mix-Greedy approximates them from Cohen’s algorithm.Last but not least, BUTA is designed with inherentparallelism and can be mapped to GPU to exploitthe parallel processing capability of GPU, thus furtherreducing the execution time, as we will show inSection 3.2.

3.2 Baseline GPU ImplementationIn this subsection, we first describe the graph datastructure used in this work, and then present the base-line implementation of IMGPU in detail. In AppendixE, we give a short introduction to the internal archi-tecture of GPU and briefy describe the correspondingCUDA programming model.

3.2.1 Data RepresentationTo implement IMGPU over the GPU architecture, thetraditional adjacency-matrix representation is not agood choice especially for large-scale social networks.



Fig. 3. Graph data representation.

The reasons are twofold. First, it costs n× n memoryspace which significantly restricts the size of socialnetwork that can be handled by GPU. Second, thelatency of data transfer from host to device as wellas global memory access is high, degrading the over-all performance. Therefore, we use the compressedsparse row (CSR) format which is widely used forsparse matrix representation [3]. As depicted by Fig.3, the graph is organized into arrays. The edgeOutarray consists of adjacency lists of all the nodes. Eachelement in the nodeOut array stores the start indexpointing to the edges outgoing from that node inthe edgeOut array. The CSR representation requiresn+ 1 +m memory space in sum.

3.2.2 Baseline Implementation

In our CUDA implementation, the graph data is firsttransferred to the global memory of GPU. Then weassign one thread for each node to run the influencespread computation kernel. All threads are dividedinto a set of blocks and scheduled into GPU streamprocessors to execute in parallel. The influence spreadcomputation kernel works iteratively by level. In eachiteration, the influence spreads of nodes at the samelevel are calculated in parallel by different streamprocessors. The iteration ends when all the influencespread computations are finished. This process willrun for R times to select the node that provides themaximal average influence spread. In this way, theparallel processing capability of GPU is exploited forinfluence maximization acceleration. In Appendix E,we illustrate a figure that depicts the above CUDAimplementation of BUTA on GPU architecture.

We adopt the fast random graph generation algo-rithm PPreZER [18] to select edges in random andCUDA-based Forward-Backward algorithm [1] to findSCCs of the selected graph in parallel in GPU. After-wards the kernel that computes the influence spreadwill run iteratively until all nodes are visited. Algo-rithm 3 shows the influence spread kernel written inCUDA. visited is a boolean array identifying whethera node has been visited or not, unfinished is an arrayrecording the nodeID of the first child node whose

Algorithm 3 Influence Spread Computation Kernel1: int tid = Thread ID;2: if(visited[tid] == false){3: int index = unfinished[tid];4: int num out = nodeOut[tid+ 1]− nodeOut[tid];5: int∗outpoint = &edgeOut[nodeOut[tid]];6: for(; index < num out; index++){7: if(visited[outpoint[index]] == false){8: unfinished[tid] = index;9: break;

10: }}11: if(index == num out) {12: Infl[tid] = weight[tid];13: for(int i = 0; i < num out; i++) {14: Infl[tid]+ = Infl[outpoint[i]];}15: int overlapNo = overlap(Infl, label);16: Infl[tid]− = overlapNo;17: visited[tid] = true;18: }}

influence spread computation is yet unfinished, andInfl[tid] denotes the influence spread of node tid.Lines 2 - 11 decide whether the thread is ready forinfluence computation. This is done by checking thevisited variable of all its child nodes. We use theunfinished array to record the first unfinished childnodes and thus we can avoid the repeated check on al-ready finished child nodes. If all the child nodes havebeen visited, then lines from 12 to 16 compute theinfluence spread for each qualified thread in parallelaccording to Equation 1. Finally, visited[tid] is set totrue denoting that the node has been visited (line 17).

In short, the influence spread kernel works itera-tively by level and nodes at the same level take part inexecution in parallel during each iteration. The paral-lelism depends on the number of nodes at each level.The worst case happens when the graph is extremelylinear which will result in one node being processedat each iteration. According to tests over real-worldsocial networks, however, such a case never happens.As modern social networks are typically of large scaleand the interconnections are rich, much parallelismcan be exploited to reduce execution time.

Note that the baseline implementation is far fromoptimal. The actual parallelism is limited by the num-ber of nodes at each level. Moreover, execution pathdivergence will occur when threads in a warp processnodes belonging to different levels and nodes withdifferent degrees. All these elements can directly affectthe performance. In Section 4 we describe how wetackle these challenges to pursue better performance.

4 GPU-ORIENTED OPTIMIZATION

In this section, we analyze factors that affect theperformance of baseline GPU implementation andprovide three effective optimizations to achieve better



0 10 20 30 400

1000

2000

3000

4000

5000

6000

7000

8000

level number

# of

nod

esNetPHYNetHEPT

Fig. 4. Node number at different levels.

performance. First, we reorganize the graph data bylevels and degrees to minimize potential divergence.Second, we combine the influence spread computa-tion for nodes of multiple levels together for betterparallelism. Third, we unroll the memory access loopto coalesce accesses to consecutive memory addresses.

4.1 Data ReorganizationAs previously mentioned, execution path divergencemay lead to serious performance degradation. In thispaper, BUTA executes level by level in a bottom-upway. Threads in a warp are responsible for processingdifferent nodes. However, due to the SIMT feature ofGPU, threads in a warp execute the same instructionat each clock cycle. Consequently, if threads in a warpare assigned to process nodes at different levels (lines2 and 11 in algorithm 3), divergence will occur andinduce different execution paths, which will signifi-cantly degrade the performance.

In addition, during BUTA execution, threads haveto obtain the visit information and the influencespreads of their child nodes (lines 6 and 13 in al-gorithm 3). As the degrees of nodes in real-worldsocial networks mainly follow a power-law distri-bution, their may exist great disparity between thedegree of different nodes. Therefore, the workloadsof influence spread computation for different nodesmay vary widely and thus a thread that processes anode with large degree will usually block the otherthreads in the same warp from successive running.

Such divergence will severely reduce the utiliza-tion of GPU cores and degrade the performance. Toaddress these issues, we reorganize the graph bypre-sorting the graph by level and degree, with thepurpose of making threads in a warp process nodesthat are at the same level and with similar degree asmuch as possible. In such a way, we hope that threadsin the same warp would mostly follow the sameexecution path, minimizing the possible divergence.

However, since edges are selected in random dur-ing each Monte-Carlo simulation, a node may havedifferent degrees and belong to different levels during

Fig. 5. K-level combination.

various simulations. Therefore, a naive solution is thatwe sort the selected graph after each simulation toassure the least divergence. Obviously, such a methodis extremely time-consuming and unrealistic to im-plement. Fortunately, we observe from experimentson real-world datasets that the probability that nodeswith the same degree (resp. level) in the originalgraph still have the same degree (resp. level) in therandomly selected graph is as much as 95% (resp.88%). Therefore, we can achieve the objective of mak-ing threads in the same warp processing nodes withthe same level and similar degree by pre-sorting theoriginal graph, even though the level and degree ofnode in the randomly selected graph may be incon-sistent with that in the original graph. To this end, wereorganize the original graph data through pre-sortingthe nodes by their levels and degrees, where level isthe primary key, and outdegree is the secondary key.Experimental results shown in Section 5 validate theeffectiveness of such an optimization approach.

4.2 Adaptive K -Level CombinationBaseline IMGPU implementation computes influencespreads of nodes from bottom up by level, andthereby its parallelism is limited by the number ofnodes at each level. We can benefit more if there aresufficient nodes belonging to the same level to beprocessed, otherwise the parallel processing capabilityof GPU would be under exploited. For most cases,there is adequate parallelism to exploit since the real-world social network is typically of large scale. How-ever, there do exist some particular levels which onlycontain a small number of nodes due to the inherentgraph irregularity of social networks. Fig. 4 depictsthe statistics of number of nodes at different levelsof two real-world datasets, NetHEPT and NetPHY.We can clearly see that the number of nodes variesseriously between different levels. More than sixtypercent of levels just have less than 100 nodes. Ap-parently, the parallel computation capability of GPUcannot be fully utilized for these levels.

In order to exploit more parallelism and make fulluse of GPU cores, we propose an adaptive K-Levelcombination optimization. We logically combine Kadjacent levels with small number of nodes into avirtual level, and compute their influence spread con-currently. Fig. 5 exemplifies such an approach. There



100 102 104 106100

101

102

103

104

105

106

107

Degree

# of

nod

es

Fig. 6. Degree distribution of Twitter dataset.

are seven nodes at level 0, while the number of nodesat level 1, 2 and 3 is relatively small. Consequently,we logically combine level 1, 2 and 3 into a virtuallevel 1′. After the combination, we can compute theinfluence spreads for nodes 0, 1, · · · , 7 concurrentlyto pursue higher parallelism.

The K-Level combination, though providing a goodway to explore better parallelism, introduces anotherproblem. That is, we cannot directly compute theinfluence spread for node at upper level (e.g., node0 in Fig. 5) according to Equation 1 because theinfluence spreads of its child nodes (node 1, 2 and3 in this example) may be yet unknown. To addressthis problem, we divide the influence spread for nodesat upper levels into two parts. The first part is thenumber of nodes under the critical level. Here thecritical level represents the level which is going to bevisited next (e.g. level 1 in Fig. 5). Since the influencespread of nodes under the critical level have alreadybeen computed, this part can be easily computedaccording to Equation 1. While the second part is thenumber of nodes above the critical level. For this part,we propose to do BFS from the target node till thecritical level and count the number of reachable nodes.The influence spread of node at upper levels is thusthe sum of these two parts. In Fig. 5, since levels 1,2 and 3 are logically combined into level 1′, whencomputing the influence spread of node 0, we firstperform BFS from node 0, and we get 8 nodes abovelevel 1. Then based on Equation 1, we compute thenumber of nodes under level 1 and get 7 nodes. Thusthe influence spread of node 0 is 15 in total.

It is important to notice that through such an op-timization, we can obtain better parallelism becausenodes at adjacent K levels can be processed concur-rently. However, this is gained at the cost of additionaltime spent on BFS counting. The parameter K standsas a tradeoff between parallelism and extra countingtime. If K is too small, there may still be room formining parallelism. On the other hand, if K is toolarge, although more parallelism can still be exploited,the extra time cost for BFS will finally overwhelm

0 2 4 6 8 10x 106

Node 1

Node 2

Node 3

Node 4

Node 5

Child Node ID

Fig. 7. Child node distribution of five large nodes ofTwitter dataset.

the gained efficiency. To achieve the best performance,the parameter K is adaptively set for different graphsaccording to Equation 2:

K = {argmaxK

l0+K∑i=l0

nodeNO(i) ≤ 256} (2)

where l0 denotes the critical level and nodeNO(i)denotes the number of nodes at level i. In otherwords, K is adaptively set to be the maximal numberof adjacent levels so that the total number of nodesat these K levels is less than 256 which is half ofthe stream processor number of GPU used in thiswork. As tested in the experiments, the K-Level com-bination optimization can significantly improve theparallelism, especially for large-scale social network.

4.3 Memory Access CoalescenceAs described in lines 13-14 of algorithm 3, when wecompute the influence spread of a node, the threadneeds to access the influence spreads of all the childnodes. Therefore, for nodes with large degree, thiswill result in a large number of memory accesseswhich will take very long execution time. Such nodes,though accounting for a small percentage of the entiregraph, substantially exist in many real-world socialnetworks. As depicted in Fig. 6, the degree distri-bution of Twitter users is highly skewed. Althoughthe degree of 90.24% Twitter user is under 10, theredo exist 1089 nodes with a degree larger than 10K,producing an enormous quantity of memory accesses.

Fortunately, these memory accesses, though largein amount, show good spatial locality after data reor-ganization. Fig. 7 depicts the child node distributionof five extra-large nodes in Twitter. We find a largepercentage of target memory addresses are relativelyconsecutive so that we can unroll the memory accessloop to coalesce these memory accesses. The loop isunrolled by a predefined step length. We tested theperformance of loop unrolling under different steplengths (4, 8, 16, 32, 64 and 128), and 16 was finally



0 10 20 30 40 500

50

100

150

# of seeds (NetHEPT)

Influ

ence

spr

ead

MixGreedyIMGPU\BUTA_CPUIMGPU_OESMCEPMIARandom

(a)

0 10 20 30 40 500

50

100

150

200

250

300

350

# of seeds (NetPHY)

Influ

ence

spr

ead


(b)

0 10 20 30 40 500

20

40

60

80

100

120

140

160

180

200

# of seeds (Amazon)

Influ

ence

spr

ead


(c)

0 10 20 30 40 500

2

4

6

8

10

12

14

16

18x 104

# of seeds (Twitter)

Influ

ence

spr

ead

IMGPUIMGPU_OESMCEPMIARandom

(d)

Fig. 8. Influence spread of different algorithms on four real-world datasets.

selected for its best result. In such a way, first, 16memory accesses are requested in each loop. Due tothe spatial locality of child nodes distribution, thesememory accesses could be coalesced to reduce theexecution time. Second, we can reduce the executionof conditional statement (line 13 in algorithm 3) byloop unrolling. This optimization, though simple inprinciple, can significantly improve the performance,especially on large-scale datasets. Our experimentalresults show that such a method significantly reducesthe execution time.

5 EXPERIMENTS5.1 Experimental SetupIn our experiments, we use traces of four real-worldsocial networks of different scales and different types,i.e., NetHEPT [14], NetPHY [14], Amazon [15] andTwitter [22]. Table 1 summarizes the statistical in-formation of the datasets. We compare IMGPU andits optimization version IMGPU O with two existinggreedy algorithms and two heuristic algorithms, i.e.,MixGreedy [4], ESMCE [17], PMIA [5] and Random.Moreover, we also implement a CPU based versionof BUTA, referred to as BUTA CPU, to evaluate theperformance of BUTA and the effect of parallelization.The detailed description of the datasets and algo-rithms can be found in Appendix F.

We examine two metrics, influence spread andrunning time in the experiments for evaluating theaccuracy and efficiency of different algorithms. Theseexperiments are carried out on a PC equipped with aNVIDIA GTX 580 GPU and an Intel Core i7 CPU 920.

5.2 Performance AnalysisWe first evaluate the performance of different al-gorithms in terms of influence spread and runningtime on real-world social networks. Then we examinethe effectiveness of different optimization methods.In order to study the generality and scalability ofdifferent algorithms, we also test their performance onsynthetic networks generated by GTgraph [2] follow-ing Random and Power-law distribution. The detailcan be found in Appendix G.

TABLE 1Statistics of Real-world Datasets

Datasets Types Nodes Edges Avg. Degree

NetHEPT Real-world 15,233 32,235 4.23NetPHY Real-world 37,154 180,826 9.73Amazon Real-world 262,111 1,234,877 9.42Twitter Real-world 11,316,811 85,331,846 15.08

5.2.1 Influence Spread on Real-World Datasets

Influence spread directly indicates how large the por-tion of the network is finally influenced. A largerinfluence spread represents a better approximationto influence maximization, i.e., better accuracy. Wetest the influence spreads of different algorithms andheuristics under the IC model with propagation prob-ability p = 0.01. To obtain an accurate estimate ofσ(S), we execute 20000 runs for each possible seedset S. The number of seed sets K varies from 1 to 50.Such an experiment is carried out over four real-worlddatasets and the results are shown in Fig. 8.

Fig. 8(a) depicts the experimental result of influencespread on NetHEPT. Note that NetHEPT has only15K nodes, which is the smallest among all datasets.In accordance with expectation, the BUTA-series algo-rithms, IMGPU, IMGPU O and BUTA CPU, performwell and their accuracies match that of MixGreedy,producing the best results at all values of K. Sucha result validates the proposed BUTA algorithm. Theperformance of ESMCE is slightly lower that IMGPUin terms of influence spread; the influence spread ofESMCE is 4.53% less than that of IMGPU when Kis 50. This is mainly because ESMCE estimate the in-fluence spread by supervised sampling, thus affectingthe accuracy. Meanwhile, Random performs the worstamong all test algorithms (as much as 42.83% lessthan IMGPU), because it contains no considerationof influence spread when selecting seeds. Comparedwith Random, PMIA performs much better. However,the influence spread of PMIA is still 3.2% lower thanthat of IMGPU on average.

Fig. 8(b) depicts the influence spread on NetPHY.



NetHEPT NetPHY Amazon Twitter0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000ex

ecut

ion

time

(sec

)

MixGreedyBUTA_CPUESMCEIMGPUIMGPU_OPMIARandom

Fig. 9. Running time on four real-world datasets.

In this case, IMGPU, IMGPU O and BUTA CPUslightly outperform MixGreedy by 1.5% on average.The reason is that MixGreedy estimates the influencespread which inevitably induces errors, while IMGPUaccurately calculates the influence spread. Comparedwith IMGPU, ESMCE is 3.35% lower on average forK from 1 to 50. Similar to NetHEPT, Random stillproduce the worst influence spread which is only15.90% of IMGPU. This clearly demonstrates that itis important to carefully select the influential nodesso as to obtain the largest influence spread. The gapbetween PMIA and IMGPU becomes larger (3.2% and5.4% on NetHEPT and NetPHY, respectively) with thegrowth of social network size, showing that heuristicsperform poorly particularly on large-scale networks.

For the case of Amazon (262K nodes and 1.23Medges), as shown in Fig. 8(c), the accuracies of BUTA-series algorithms are slightly better than that of Mix-Greedy. As expected, the BUTA-series algorithms per-forms much better than ESMCE, PMIA and Random(4.18%, 20.82% and 256.38% on average, respectively).We can see that ESMCE clearly outperforms the MIAheuristic in this case; this is due to the reason thatESMCE controls the estimation error with a specifiederror threshold and iteratively sample more needsuntil the specified accuracy is finally achieved.

Finally we carry out experiments on Twitter. Thisdataset, containing more than 11M nodes and 85Medges, is so far the largest dataset tested in influencemaximization experiments. The result is presented inFig. 8(d). MixGreedy and BUTA CPU becomes infea-sible to execute due to the unbearably long runningtime. The disparity between IMGPU and ESMCE isstably kept around 4% when the number of seed Kis large than 10. On the other hand, IMGPU performssubstantially better than the PMIA and Random asmuch as 42.25% and 368.21% respectively when K is50. This evidently demonstrates the effectiveness ofIMGPU in terms of influence spread.

5.2.2 Running Time on Real-World DatasetsFig. 9 reports the execution time of different ap-proaches for selecting 50 seeds on four datasets, by

which we examine the efficiency. Note that the over-head of DAG conversion is relatively low, and takesonly 1.74% and 3.7% of the execution time of IMGPUon NetHEPT and Amazon. The execution time ofIMGPU and IMGPU O includes such a preprocessingtime. The y-axis of Fig. 9 is in logarithmic scale.

As we pointed out, MixGreedy and BUTA CPUcan only run on NetHEPT, NetPHY and Amazon.Furthermore, MixGreedy costs the longest runningtime; it takes as much as 1.93 × 103 seconds forNetHEPT and 2.02 × 104 seconds for NetPHY. Com-pared with MixGreedy, BUTA CPU performs muchbetter; BUTA CPU could accelerate the influence com-putation by 2.58×, 6.49× and 8.07×, respectively.These results are in accordance with the complexityanalysis and demonstrate the efficiency of BUTA.

ESMCE utilize supervised sampling to estimate theinfluence spread, thus greatly reducing its runningtime when compared with MixGreedy. As shown inFig. 9, the execution time of ESMCE is only 10.21%and 3.73% of MixGreedy on NetHEPT and Amazon.

IMGPU and IMGPU O perform significantly bet-ter in comparison with MixGreedy. IMGPU achieves9.69×, 15.61× and 46.87× speedups on NetHEPT,NetPHY and Amazon respectively. The optimizationsof IMGPU O work effectively and produce as muchas 13.39×, 19.00× and 60.14× speedups respectively.Moreover, IMGPU and IMGPU O show better scala-bility; IMGPU and IMGPU O is able to handle largerdatasets, such as Twitter, which is definitely beyondthe capability of MixGreedy. When compared withESMCE, the efficiency of IMGPU is slightly lower onNetHEPT and NetPHY. However, IMGPU completelyoutperforms ESMCE when they are applied to large-scale datasets such as Amazon and Twitter. The reasonis that large-scale networks contain more inherentparallelism, thus the parallel processing capability ofGPU can be fully exploited for acceleration.

The two heuristics approaches, Random and PMIArun much faster. However, according to our previousaccuracy evaluation, they show poor performance interms of influence spread. This drawback severelylimits their application to real-world social networks.

5.2.3 Evaluation of Optimization Methods

To evaluate the effect of proposed optimization meth-ods, we separately test the proposed three optimiza-tion methods. We denote them as follows:

• IMGPU K: IMGPU with K-Level combination.• IMGPU R: IMGPU with data reorganization.• IMGPU U: IMGPU with memory access coales-

cence optimization.The experiments are performed on four real-world

social networks. All the speedups are based on therunning time of baseline IMGPU. Fig. 10 presents theexperimental results. IMGPU K performs much betteron large-scale networks, such as Amazon (1.12×) and



NetHEPT NetPHY Amazon Twitter0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

spee

dup

IMGPUIMGPU_KIMGPU_RIMGPU_UIMGPU_O

Fig. 10. Effects on different optimizations.

Twitter (1.21×), for the reason that the number oflevels in small networks is relatively small and thereis little room to combine the computation. On thecontrary, IMGPU R achieves much performance gainfor networks of small scale. The speedup on Twitteris 1.08× while that on other three datasets is 1.15× onaverage. This is because, although we can minimizepossible divergence by reorganizing the graph, such again is obtained at the cost of pre-sorting nodes, whichwill consume much time on large-scale networks.IMGPU U show substantial improvement on Twitter(1.33×) due to the fact that there are many nodeswith large amount of child nodes in Twitter. Threadsprocessing these ”hub” nodes needs frequent memoryaccesses and thus degrades the overall performance.IMGPU U thereby significantly reduces the cost bycoalescing memory access to consecutive memory ad-dresses. As those optimization methods are orthogo-nal in nature, by integrating them together, IMGPU Oachieves accumulative speedup and thus significantlyoutperforms the baseline IMGPU.

6 CONCLUSIONS

In this paper, we present IMGPU, a novel frame-work that accelerate influence maximization by tak-ing advantage of GPU. In particular, we design abottom-up traversal algorithm, BUTA, which greatlyreduces the computational complexity and containsinherent parallelism. To adaptively fit BUTA with theGPU architecture, we also explore three effective op-timizations. Extensive experiments demonstrate thatIMGPU significantly reduces the execution time ofexisting sequential influence maximization algorithmwhile maintaining satisfying influence spread. More-over, IMGPU shows better scalability and is able toscale up to large-scale social networks.

ACKNOWLEDGMENTS

We would like to thank our anonymous reviewers fortheir helpful comments. This research was supportedby KJ-12-06 and NSFC under grant NO. 61272483,61272056, 61272482 and 61170285.

REFERENCES[1] J. Barnat, P. Bauch, L. Brim, and M. Ceska, ”Computing

Strongly Connected Components in Parallel on CUDA,” Proc.IEEE 25th Int’l Parallel Distributed Processing Symposium(IPDPS), pp. 544-555, 2011.

[2] D. Bader, and K. Madduri, ”GTgraph: A suite of syn-thetic graph generators,” http://www.cse.psu.edu/∼madduri/software/GTgraph/, 2006, Accessed on Nov. 2012.

[3] N. Bell, and M. Garland, ”Efficient sparse matrix-vector multi-plication on CUDA,” NVIDIA Technical Report: NVR-2008-04,Dec. 2008.

[4] W. Chen, Y. Wang, and S. Yang, ”Efficient Influence Maximiza-tion in Social Networks,” Proc. ACM Int’l Conf. KnowledgeDiscovery and Data mining (SIGKDD), pp. 199-208, 2009.

[5] W. Chen, C. Wang, and Y. Wang, ”Scalable Influence Maxi-mization for Prevalent Viral Marketing in Large-Scale SocialNetworks,” Proc. ACM Int’l Conf. Knowledge Discovery andData mining (SIGKDD), pp. 1029-1038, 2010.

[6] E. Cohen, ”Size-Estimation Framework with Applications toTransitive Closure and Reachability,” J. Comput. Syst. Sci., vol.55, no. 3, pp. 441-453, 1997.

[7] P. Domingos and M. Richardson, ”Mining the network value ofcustomers,” Proc. ACM Int’l Conf. Knowledge Discovery andData mining (SIGKDD), pp. 57C66, 2001.

[8] A. Goyal, W. Lu, and L. V. S. Lakshmanan, ”CELF++: Optimiz-ing the Greedy Algorithm for Influence Maximization in SocialNetworks,” Proc. Int’l Conf. World Wide Web (WWW), pp. 47-48, 2011.

[9] P. Harish and P. J. Narayanan, ”Accelerating Large GraphAlgorithms on the GPU Using CUDA,” Proc. High PerformanceComputing (HiPC), pp. 197-208, 2007.

[10] Q. Jiang, G. Song, G. Cong, Y. Wang, W. Si, and K. Xie,”Simulated Annealing Based Influence Maximizat ion in SocialNetworks,” Proc. 25th AAAI Int’l Conf. Artificial Intelligence(AAAI), pp. 127-132, 2011.

[11] K. Jung, W. Heo and W. Chen, ”IRIE: A Scalable Influ-ence Maximization Algorithm for Independent Cascade Modeland Its Extensions,” CoRR arXiv preprint arXiv:1111.4795,http://arxiv.org/abs/1111.4795/, pp. 1-20, 2011.

[12] D. Kempe, J. Kleinberg, and E. Tardos, ”Maximizing theSpread of Influence through a Social Network,” Proc. ACM Int’lConf. Knowledge Discovery and Data mining (SIGKDD), pp.137-146, 2003.

[13] H. Kwak, C. Lee, H. Park, and S. Moon, ”What is Twitter,a Social Network or a News Media?,” Proc. Int’l Conf. WorldWide Web (WWW), pp. 591-600, 2010.

[14] J. Leskovec, J. Kleinberg, and C. Faloutsos, ”Graphs over time:densification laws, shrinking diameters and possible explana-tions,” Proc. ACM Int’l Conf. Knowledge Discovery and Datamining (SIGKDD), pp. 177-187, 2005.

[15] J. Leskovec,L. Adamic, and B. Huberman, ”The dynamics ofviral marketing,” ACM Trans. Web, vol. 1, no. 1, Article 5, 2007.

[16] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. Van-Briesen, and N. Glance, ”Cost-effective Outbreak Detect ion inNetworks,”

[17] X. Liu, S. Li, X. Liao, L. Wang, and Q. Wu, ”In-Time Estimationfor Influence Maximization in Large-Scale Social Networks,”Proc. ACM EuroSys Workshop on Social Network Systems, pp.1-6, 2012.

[18] S. H. Nobari, X. Lu, P. Karras, and S. Bressan, ”Fast RandomGraph Generation,” Proc. 14th Int’l Conf. Extending DatabaseTechnology (EDBT), pp. 331-342, 2011.

[19] M. Richardson and P. Domingos, ”Mining knowledge-sharingsites for viral marketing,” Proc. ACM Int’l Conf. KnowledgeDiscovery and Data mining (SIGKDD), pp. 61C70, 2002.

[20] V. Vineet, P. Harish, S. Patidar, and P. J. Narayanan, ”FastMinimum Spanning Tree for Large Graphs on the GPU,” Proc.High Performance Graphics Conference (HPG), pp. 167-171,2010.

[21] Y. Wang, G. Cong, G. Song, and K. Xie, ”Community-basedGreedy Algorithm for Mining Top-K Influential Nodes inMobile Social Networks,” Proc. ACM Int’l Conf. KnowledgeDiscovery and Data mining (SIGKDD), pp. 1039-1048, 2010.

[22] R. Zafarani, and H. Liu, ”Social Computing Data Repositoryat ASU,” http://socialcomputing.asu.edu/, 2009, Accessed onNov. 2012.



Xiaodong Liu received the BS degree in2007 and the MS degree in 2009, both in theSchool of Computer Science from NationalUniversity of Defense Technology, Chang-sha, China. He is currently a PhD Studentat National University of Defense Technology.His main research interests include paral-lel computing, social network analysis andlarge-scale data mining, etc. He is a studentmember of the ACM.

Mo Li received the BS degree in the Depart-ment of Computer Science and Technologyfrom Tsinghua University, Beijing, China, in2004 and the PhD degree in the Depart-ment of Computer Science and Engineeringfrom Hong Kong University of Science andTechnology. Currently, he is working as anassistant professor in the School of Com-puter Engineering of Nanyang TechnologicalUniversity of Singapore. He won ACM HongKong Chapter Professor Francis Chin Re-

search Award in 2009 and Hong Kong ICT Award Best Innovationand Research Grand Award in 2007. His research interests includesensor networking, pervasive computing, mobile and wireless com-puting, etc. He is a member of the IEEE and ACM.

Shanshan Li received the MS and PhD de-grees from the School of Computer Science,National University of Defense Technology,Changsha, China, in 2003 and 2007, respec-tively. She was a visiting scholar at HongKong University of Science and Technologyin 2007. She is currently an assistant pro-fessor in the School of Computer, NationalUniversity of Defense Technology. Her mainresearch interests include distributed com-puting, social network and data center net-

work. She is a member of the IEEE and ACM.

Shaoliang Peng received the PhD de-gree from the School of Computer Science,National University of Defense Technology,Changsha, China, in 2008. Currently, he is anassistant professor at National University ofDefense Technology. His research interestsinclude distributed system, high performancecomputing, cloud computing and wirelessnetworks. He is a member of the IEEE andACM.

Xiangke Liao received his BS degree in theDepartment of Computer Science and Tech-nology from Tsinghua University, Beijing,China, in 1985, and his MS degree in 1988from National University of Defense Technol-ogy, Changsha, China. He is currently a fullprofessor and the dean of School of Com-puter, National University of Defense Tech-nology. His research interests include paralleland distributed computing, high-performancecomputer systems, operating systems, cloud

computing and networked embedded systems. He is a member ofthe IEEE and ACM.

Xiaopei Lu received his BS degree fromTsinghua University, Beijing, China, in 2006,and received the MS degree from NationalUniversity of Defense Technology, Chang-sha, China, in 2008. He is currently aPhD student at School of Computer Sci-ence, National University of Defense Tech-nology. His research interests include dis-tributed systems, wireless networks, andhigh-performance computer systems. He is astudent member of the IEEE.


Documents

Social Network GPU