12
Fast Incremental and Personalized PageRank Bahman Bahmani Stanford University [email protected] Abdur Chowdhury Twitter Inc. [email protected] Ashish Goel Stanford University [email protected] ABSTRACT In this paper, we analyze the efficiency of Monte Carlo meth- ods for incremental computation of PageRank, personalized PageRank, and similar random walk based methods (with focus on SALSA), on large-scale dynamically evolving social networks. We assume that the graph of friendships is stored in distributed shared memory, as is the case for large social networks such as Twitter. For global PageRank, we assume that the social network has n nodes, and m adversarially chosen edges arrive in a random order. We show that with a reset probability of , the expected total work needed to maintain an accurate estimate (using the Monte Carlo method) of the PageRank of every node at all times is O( n ln m 2 ). This is significantly better than all known bounds for incremental PageRank. For instance, if we naively recompute the PageRanks as each edge arrives, the simple power iteration method needs Ω( m 2 ln(1/(1)) ) total time and the Monte Carlo method needs O(mn/) total time; both are prohibitively expensive. We also show that we can handle deletions equally efficiently. We then study the computation of the top k personal- ized PageRanks starting from a seed node, assuming that personalized PageRanks follow a power-law with exponent α< 1. We show that if we store R>q ln n random walks starting from every node for large enough constant q (us- ing the approach outlined for global PageRank), then the expected number of calls made to the distributed social net- work database is O(k/(R (1α))). We also present experi- mental results from the social networking site, Twitter, ver- Work done while interning at Twitter Work done while at Twitter. This research was supported in part by NSF award IIS-0904325, and the Army Re- search Laboratory, and was accomplished under Cooper- ative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 3 Copyright 2010 VLDB Endowment 2150-8097/10/12... $ 10.00. ifying our assumptions and analyses. The overall result is that this algorithm is fast enough for real-time queries over a dynamic social network. 1. INTRODUCTION Over the last decade, PageRank [30] has emerged as a very effective measure of reputation for both web graphs and so- cial networks (where it was historically known as eigenvector centrality [12]). Also, collaborative filtering has proved to be a very effective method for personalized recommendation systems [9,28]. In this paper, we will focus on fast incremen- tal computation of (approximate) PageRank, personalized PageRank [14,30], and similar random walk based methods, particularly SALSA [22] and personalized SALSA [29], over dynamic social networks, and its applications to reputation and recommendation systems over these networks. Incre- mental computation is useful when edges in a graph arrive over time, and it is desirable to update the PageRank values right away rather than wait for a batched computation. Surprisingly, despite the fact that computing PageRank is a well studied problem [5], some simple assumptions on the structure of the network and the data layout lead to dramatic improvements in running time, using the simple Monte Carlo estimation technique. In large scale web applications, the underlying graph is typically stored on disk, and either edges are streamed [10] or a map-reduce computation is performed. The Monte Carlo method requires random access to the graph, and has not found widespread practical use in these applications. However, for social networking applications, it is crucial to support random access to the underlying network, since messages flow on edges in a network in real-time, and ran- dom access to the social graph is necessary for the core func- tionalities of the network. Hence, the graph is usually stored in distributed shared memory, which we denote as “Social Store”, providing a data access model very similar to Scal- able Hyperlink Store [27]. We use this feature strongly in obtaining our results. In this introduction, we will first provide some background on PageRank and SALSA and typical approaches to com- puting them, and then outline our results along with some basic efficiency comparisons. The literature related to the problem studied in this paper is really vast, and in the body of the paper, due to space constraints, we only compare our results with some of the main previous results in the literature. A detailed review of the related literature and comparisons with our method is given in Appendix A. 173

Fast Incremental and Personalized PageRank - VLDB Endowment Inc

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

Fast Incremental and Personalized PageRank

Bahman Bahmani∗

Stanford [email protected]

Abdur ChowdhuryTwitter Inc.

[email protected]

Ashish Goel†

Stanford [email protected]

ABSTRACTIn this paper, we analyze the efficiency of Monte Carlo meth-ods for incremental computation of PageRank, personalizedPageRank, and similar random walk based methods (withfocus on SALSA), on large-scale dynamically evolving socialnetworks. We assume that the graph of friendships is storedin distributed shared memory, as is the case for large socialnetworks such as Twitter.

For global PageRank, we assume that the social networkhas n nodes, and m adversarially chosen edges arrive in arandom order. We show that with a reset probability of, the expected total work needed to maintain an accurateestimate (using the Monte Carlo method) of the PageRankof every node at all times is O(n lnm

2). This is significantly

better than all known bounds for incremental PageRank.For instance, if we naively recompute the PageRanks aseach edge arrives, the simple power iteration method needs

Ω( m2

ln(1/(1−)) ) total time and the Monte Carlo method needs

O(mn/) total time; both are prohibitively expensive. Wealso show that we can handle deletions equally efficiently.

We then study the computation of the top k personal-ized PageRanks starting from a seed node, assuming thatpersonalized PageRanks follow a power-law with exponentα < 1. We show that if we store R > q lnn random walksstarting from every node for large enough constant q (us-ing the approach outlined for global PageRank), then theexpected number of calls made to the distributed social net-work database is O(k/(R(1−α)/α)). We also present experi-mental results from the social networking site, Twitter, ver-

∗Work done while interning at Twitter†Work done while at Twitter. This research was supportedin part by NSF award IIS-0904325, and the Army Re-search Laboratory, and was accomplished under Cooper-ative Agreement Number W911NF-09-2-0053. The viewsand conclusions contained in this document are those of theauthors and should not be interpreted as representing theofficial policies, either expressed or implied, of the ArmyResearch Laboratory or the U.S. Government.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 37th International Conference on Very Large Data Bases,August 29th - September 3rd 2011, Seattle, Washington.Proceedings of the VLDB Endowment, Vol. 4, No. 3Copyright 2010 VLDB Endowment 2150-8097/10/12... $ 10.00.

ifying our assumptions and analyses. The overall result isthat this algorithm is fast enough for real-time queries overa dynamic social network.

1. INTRODUCTIONOver the last decade, PageRank [30] has emerged as a very

effective measure of reputation for both web graphs and so-cial networks (where it was historically known as eigenvectorcentrality [12]). Also, collaborative filtering has proved tobe a very effective method for personalized recommendationsystems [9,28]. In this paper, we will focus on fast incremen-tal computation of (approximate) PageRank, personalizedPageRank [14,30], and similar random walk based methods,particularly SALSA [22] and personalized SALSA [29], overdynamic social networks, and its applications to reputationand recommendation systems over these networks. Incre-mental computation is useful when edges in a graph arriveover time, and it is desirable to update the PageRank valuesright away rather than wait for a batched computation.

Surprisingly, despite the fact that computing PageRankis a well studied problem [5], some simple assumptions onthe structure of the network and the data layout lead todramatic improvements in running time, using the simpleMonte Carlo estimation technique.

In large scale web applications, the underlying graph istypically stored on disk, and either edges are streamed [10]or a map-reduce computation is performed. The MonteCarlo method requires random access to the graph, and hasnot found widespread practical use in these applications.

However, for social networking applications, it is crucialto support random access to the underlying network, sincemessages flow on edges in a network in real-time, and ran-dom access to the social graph is necessary for the core func-tionalities of the network. Hence, the graph is usually storedin distributed shared memory, which we denote as “SocialStore”, providing a data access model very similar to Scal-able Hyperlink Store [27]. We use this feature strongly inobtaining our results.

In this introduction, we will first provide some backgroundon PageRank and SALSA and typical approaches to com-puting them, and then outline our results along with somebasic efficiency comparisons. The literature related to theproblem studied in this paper is really vast, and in the bodyof the paper, due to space constraints, we only compareour results with some of the main previous results in theliterature. A detailed review of the related literature andcomparisons with our method is given in Appendix A.

173

Page 2: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

1.1 Background and Related WorkIn this paper, we focus on (incremental) computation of

PageRank [30], personalized PageRank [14,30], SALSA [22],and personalized SALSA [29]. So, in this subsection, we pro-vide a quick review of these methods. Here and throughoutthe paper, we denote the number of nodes and edges in thenetwork by, respectively, n and m.

PageRank is the stationary distribution of a random walkwhich, at each step, with a certain probability jumps toa random node, and with probability 1 − follows a ran-domly chosen outgoing edge from the current node. Per-sonalized PageRank is the same as PageRank, except thatall the jumps are made to the seed node for which we arepersonalizing the PageRanks.

SALSA, just like HITS [17], associates two scores witheach node v, called hub score, hv, and authority score av.These scores represent how good a hub or authority eachnode is. The scores are related as follows:

hv =

(v,x)∈E

ax

indeg(x), ax =

(v,x)∈E

hv

outdeg(v)

where E is the set of edges of the graph, and indeg(x) andoutdeg(v) are, respectively, the in-degrees and out-degreesof the nodes. Notice that SALSA corresponds to a forward-backward random walk, where the walk alternates betweenforward and backward steps. The personalized version ofSALSA, that we consider, allows random jumps (to the seednode) at forward steps. Thus, personalizing over node ucorresponds to the following equations:

hv = δu,v +(1− )

(v,x)∈E

ax

indeg(x), ax =

(v,x)∈E

hv

outdeg(v)

Notice that in our setting, hub scores and authority scorescan be interpreted, respectively, as similarity measures andrelevance measures. Hence, we obtain a simple system forrecommending additional friends to a user: just recommendthose with the highest relevance. We will now outline twobroad approaches to computing PageRank and SALSA; amore detailed overview is presented in Appendix A. Thefirst approach is to use linear algebraic techniques, the sim-plest of which is the power iteration method. In this method,the PageRank of node v is initialized to π0(v) = 1/n andthe following update is repeatedly performed:

∀v,πi+1(v) = /n+

w| (w,v)∈E

πi(w)(1−)/outdeg(w). (1)

This gives exponential reduction in error per iteration, re-sulting in a running time of O(m) per iteration and O(m/ln(1/(1− )) for getting the total error down to a constant.The other broad approach is Monte Carlo, where we directlydo random walks to estimate the PageRank of each node. Inthe simplest instantiation, we do R “short random walks”of geometric lengths (with mean 1/) starting at each node.Each short random walk simulates one continuous session bya random surfer who is doing the PageRank random walk.While the error does not decay as fast as in the power it-eration method, R = ln(n/) or even R = 1 give provablygood results, and have running time of O(nR/), which, fornon-sparse graphs, is much better than that of power itera-

tion. SALSA can be computed using obvious modificationsto either approach.

1.2 Our ResultsWe study two problems. First, efficient incremental com-

putation of PageRank (and its variants) over dynamicallyevolving networks. Second, efficiently computing personal-ized PageRank (and its variants) under the power-law net-work model. Both of these problems are considered to beamong the most important problems regarding PageRank[20]. Below, we overview our results in more detail.

We present formal analyses of incremental computationof random walk based reputations and collaborative filtersin the context of evolving social networks. In particular, wefocus on very efficiently approximating both global and per-sonalized variants of PageRank [30] and SALSA [22]. Weperform experiments to validate each of the assumptionsmade in our analysis, and also study the empirical perfor-mance of our algorithms.

For global PageRank, we show that in a network withn nodes, and m adversarially chosen edges arriving in ran-dom order, we can maintain very accurate estimates of thePageRank (and authority score) of every node at all timeswith only O(n lnm

2) total expected work. This is a dramatic

improvement over the naıve running time Ω( m2

ln(1/(1−)) ) (e.g.,by recomputing the PageRanks upon arrival of each edge,using the power iteration method) or Ω(mn

) (e.g., using theMonte Carlo method from scratch each time an edge ar-rives). Similarly, we show that in a network with m edges,upon removal of a random edge, we can update all thePageRank approximations using only O(n/m2) expectedwork. Our algorithm is a Monte Carlo method [2] thatworks by maintaining a small number of short random walksegments starting at each node in the social graph. Thesame approach works for SALSA. For global SALSA, theauthority score of a node is exactly its in-degree as the resetprobability goes to 0, so the primary reason to store theserandom walk segments is to aid in computing personalizedSALSA scores. It is important to note that it is only theefficiency analysis of our algorithms that depends on therandom order assumption; it is also important to note thatthe random order assumption is weaker than assuming thewell-known generative models for power-law networks, suchas Preferential Attachment [4].

We then study the problem of finding the k nodes withhighest personalized PageRank values (or personalized au-thority scores). We show that we can use the same buildingblocks used for global PageRank and SALSA, that is, thestored walk segments at each node, to very efficiently findvery accurate approximations for the top k nodes. We provethat, assuming that the personalized scores follow a power-law with exponent 0 < α < 1, if we cache R > q lnn randomwalk segments starting at every node (for large enough con-stant q), then the expected number of calls made to the

distributed social network database is O(k/R1−αα ). This is

significantly better than n and even k. Notice that with-out the power-law assumption, in general, one would haveto find the scores of all the nodes and then find the top kresults (hence, one would need at least Ω(n) work).

Remark 1. While the algorithms we consider do not de-pend on the assumption 0 < α < 1, and it is only in ouranalysis that we use this assumption, it should be noted that

174

Page 3: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

all of our analyses go through similarly for α > 1. However,in our experiments, we observed that only 2 percent of thenodes in our network have α > 1. So, we omit the caseα > 1 in the analyses and experimental results presented inthis paper.

We present the results of the experiments we did to val-idate our assumptions and analyses. We used the networkdata from the social networking site Twitter. The access tothis data was through a database, called FlockDB, stored indistributed shared memory. Our experiments support ourrandom order assumption on edge arrivals (or at least thespecific claim that we need for our results to go through).Also, we observe that not only do the global PageRankscores and in-degrees follow the same power-laws (as previ-ously proved under mild assumptions in the literature [24]),but also the personalized PageRanks follow power-laws withaverage exponent roughly equal to the exponent for PageR-ank and in-degree. Finally, our experiments also supportthe proved theoretical bounds on the number of calls to thesocial network database.

Random walk based methods have been reported to bevery effective for the link prediction problem on social net-works [23]. We also did some preliminary experiments to ex-plore this further. The results are presented in appendix Band indicate that random walk based algorithms (i.e., per-sonalized PageRank and personalized SALSA) significantlyoutperform HITS as a recommendation system for twitterusers; we present this comparison not as significant originalresearch but as an interesting data point for readers who areinterested in practical aspects of recommendation systems.

2. INCREMENTAL COMPUTATION OFPAGERANK

In this section, we explain the details of the Monte Carlomethod for approximating the (global) PageRank values.Then, we will prove our surprising result on the total amountof work needed to keep the PageRank estimates updated allthe time. Even though we will focus on PageRank, ourresults will extend (with some minor modifications that willbe pointed out) to other random walk based methods, suchas SALSA.

2.1 Approximating PageRankTo approximate PageRank, we do R random walks start-

ing at each node of the network. Each of these random walksis continued until its first reset (Hence, each one has averagelength 1/). We store all these walk segments in a database,where each segment is stored at every node that it passesthrough. Assume for each node v, Xv is the total numberof times that any of the stored walk segments visit v. Then,we approximate the PageRank of v, denoted by πv, with:

πv =Xv

nR/

Then, we have the following theorem:

Theorem 1. πv is sharply concentrated around its expec-tation, which is πv.

The fact that E[πv] = πv is already proved in [2]. Theproof of the sharp concentration is straightforward and ispresented in Appendix C. The exact concentration bounds

follow from the proof. But, to summarize, as pointed outin [2], the obtained approximations are quite good even forR = 1. Thus, the defined πv’s are accurate approximationsof the actual PageRank values πv.

2.2 Updating the ApproximationsAs the underlying network evolves, the PageRank values

of the nodes also change. For a lot of applications, we wouldlike to have updated approximations of the PageRank valuesall the time. Thus, here we analyze the cost of keeping theapproximations updated at all times. To keep the approx-imations updated, we only need to keep the random walksegments stored at the network nodes updated. Thus, weanalyze the amount of work in keeping these walks updated.

We first prove the following proposition:

Proposition 2. Assume (ut, vt) is the random edge ar-riving at time t (1 ≤ t ≤ m). Define Mt to be the numberof random walk segments that need to be updated at time t(1 ≤ t ≤ m). Then, we have:

E[Mt] ≤nR

E[πut

outdegut(t)

]

where the expectation on the right hand side is over the ran-dom edge arrival at time t (i.e., E[·] = E(ut,vt)[·]), andoutdegu(t) is the outdegree of node u after t edges have ar-rived.

Proof. The basic intuition in this proposition is thatmost random walk segments miss most network edges. Moreprecisely, a walk segment needs to change only if it passesthrough the node ut and the random step at ut picks vtas the next node in the walk. In expectation, the num-ber of times each walk segment visits ut is

πut . For each

such visit, the probability for the walk to need a reroute is1

outdegut(t)

. Hence, by a union bound, the probability that

a walk segment needs an update is at mostπut

1

outdegut(t)

.

Also, there is a total of nR walk segments. Therefore, bylinearity of expectation:

E[Mt] ≤

u

nR

πu1

outdegu(t)Pr[ut = u]

=nR

E[πut

outdegut(t)

]

which proves the lemma.

In the above proposition, E[πut

outdegut(t)

] depends on the

exact network growth model. The model that we assumehere is the random permutation model, in which m adver-sarially chosen directed edges arrive in random order. Noticethat this is a weaker assumption on the network evolutionmodel than most of the popular models, such as the Pref-erential Attachment model [4]. We also experimentally val-idate this model, and present the results, confirming thatthis is actually a very good assumption, later in this paper.

For this model, we have the following lemma:

Lemma 3. If (ut, vt) is the edge arriving at time t (1 ≤t ≤ m) in a random permutation over the edge set, then:

E[πut

outdegut(t)

] =1t

175

Page 4: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

Proof. For random arrivals, we have:

Pr[ut = u] =outdegu(t)

t

Hence,

E[πut

outdegut(t)

] =

u

πu1

outdegu(t)Pr[ut = u]

=

u

πu1

outdegu(t)outdegu(t)

t

=

u

πu

t=

1t

u

πu =1t

From Lemma 3 and Proposition 2, we get the followingtheorem:

Theorem 4. The expected amount of update work as thetth network edge arrives is at most nR/t2, and the expectedtotal amount of work needed to keep the approximations up-dated over m edge arrivals is at most nR

2lnm.

Proof. Defining Mt as in Proposition 2, we know fromthe same proposition that

E[Mt] ≤nR

E[πut

outdegut(t)

]

Also, from Lemma 3, we know:

E[πut

outdegut(t)

] =1t

Hence,

E[Mt] ≤nR

1t

For each walk segment that needs an update, we can redothe walk starting at the updated node, or even more simplystarting at the corresponding source node. So, for each suchwalk segment, in average we need at most 1/ work (equal tothe average length of the walk segment). Hence, the averagework needed at time t is at most nR

21t , as stated in the

theorem.Summing up over all time instances, we get that the ex-

pected total amount of work that we need to do over m edgearrivals (to keep the approximations updated all the time)is:

nR2

m

t=1

1t=

nR2

Hm ≤ nR2

lnm

where Hm is the mth harmonic number. Therefore, the totalamount of expected work required is at most nR

2lnm, which

finishes the proof.

The above theorem bounds the amount of update work asnew edges arrive. Similarly, we can show that we can veryefficiently handle edges leaving the graph:

Proposition 5. When the network has m edges, if a ran-domly chosen edge leaves the graph, then the expected amountof work necessary to update the walk segments is at mostnR/m2.

Proof. If M is the number of walk segments that needto be updated, and (u∗, v∗) is the random edge leaving thenetwork, then exactly as in Proposition 2, one can see that:E[M ] ≤ nR

E[ πu∗

outdegu∗], and exactly as in Lemma 3, one

can see E[ πu∗

outdegu∗] = 1/m. Finally, as in Theorem 4, the

proof is completed by noticing that for each walk segmentneeding an update, the expected amount of work is 1/.

The result in Theorem 4 (and similarly Proposition 5) isquite surprising (at least to the authors). Hence, we will nowdiscuss various aspects of it. First, notice that the amountof work needed to keep the approximations updated all thetime is only logarithmically larger than the cost to initializethe approximations (i.e., nR/). Also, it is clear that themarginal update cost for not-so-early edges (say edges afterthe Ω(n) first ones) is so small, that we can do the updatesin real time even per social interaction (e.g., clicks, etc.)

Also, notice that in applications in networks such as theweb graph, we can not do the real time updates. That isbecause there is no way to figure out the changes in thenetwork, other than recrawling the network which is notfeasible in real time. Also, random access to edges in thenetwork is expensive. However, in social networking appli-cations, the network and all the events happening on it arealways available and visible to the network provider. There-fore, it is indeed possible to do the real time updates, andhence this method is very well suited for social networkingapplications.

It should be mentioned that the update cost analyzedabove is the extra cost due to updating the PageRank ap-proximations. In other words, as a new edge is added tothe network it should be added to the database containingthe network. We can keep the random walk segments inanother database, say PageRank Store. For each node v,we also keep two counters: one, denoted by W (v), keepingtrack of the number of walk segments visiting v, and one,denoted by d(v), keeping track of the outdegree of v. Then,when a new edge arrives at node v, first we add it to theSocial Store. Then, with probability 1 − (1 − 1/d(v))W (v)

we call the PageRank Store to do the updates, in which casethe PageRank Store will incur an additional cost as analyzedabove in Theorem 4. We are assuming that the preprocess-ing (to generate the random number) can be done for free,which is reasonable, as it does not require any extra networktransmissions or disk accesses; without this assumption, therunning time would be O(m + n lnm

2), which is still much

better than the existing results.In Theorem 4, we analyzed the update costs under the

random permutation model. Another model of interest isthe Dirichlet model, in which Pr[ut = u] = [outdegu(t−1)+1]/[t− 1+ n]. Following the same proof steps as in Theorem4, we can again prove that the total expected update costover m edge arrivals in this model is nR

2ln(m+n

n ). Again,the total updates cost is only logarithmically growing, andthe marginal update costs are small enough to allow realtime updates.

Furthermore, the same ideas work to give a 16nR2

lnm to-tal expected amount of update work for SALSA (over medge arrivals in random order). The details are given inAppendix D.

176

Page 5: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

Algorithm 1 Personalized PageRank Walk Using WalkSegments

Input: Source node w, required length L of the walkOutput: A personalized PageRank walk Pw for sourcenode w of length at least L

Start the walk at w: Pw ← [w]while length(Pw) < L do

u ← last node in Pw

Generate a uniformly random number β ∈ [0, 1]if β < then

Reset the walk to w: Pw ← Pw · append(w)else

if u has an unused walk segment Q remaining inmemory then

Add Q to the end of Pw: Pw ← Pw · append(Q)Then, reset the walk to w: Pw ← Pw · append(w)

else

if u was previously fetched then

Take a random edge (u, v) out of uAdd v to the end of Pw: Pw ← Pw · append(v)

else

Do a fetch at uend if

end if

end if

end while

3. APPROXIMATING PERSONALIZEDPAGERANK AND SALSA

In the previous section, we showed how we can approx-imate PageRank and SALSA by keeping a small numberof random walk segments per each node. In this section,we show how we can reuse the same stored walk segmentsto also approximate the personalized variants of PageRankand SALSA. Again, we will focus the discussion on (person-alized) PageRank. All the results also extend to the case ofSALSA.

We start by explaining the algorithm. Here, the basic ideais that we will perform the personalized PageRank randomwalk, but opportunistically use the R stored random walksegments (described in Section 2) for each node, where pos-sible. To access these walk segments, we have to query thedatabase containing them. A query to this database for anode u returns all R walk segments starting at u as well asall the neighbors of u. We call such a query a “fetch” oper-ation. Then, taking a random walk starting from a sourcenode w, based on the stored walk segments, can be done aspresented in Algorithm 1.

The main cost in this algorithm is the fetch operations itdoes. Everything else is done in main memory which is veryfast. Thus, we would like to analyze the number of fetchesmade by this algorithm.

But, unlike the case of global PageRank and SALSA, wenotice that in applications of personalized PageRank andSALSA, what we are interested in is the nodes with thelargest values of the personalized score. For instance, in arecommendation system based on personalized SALSA orpersonalized PageRank, we are only interested in the nodeswith the largest authority scores, because the system is even-tually going to find and recommend only those nodes any-

way. Thus, our objective here is to find the k nodes (forsome suitably chosen k) with the largest personalized scores.We show that, under a power-law network model, the abovealgorithm does this very efficiently.

We start with first exactly explaining our network model.

3.1 Network ModelIf −→π is the vector of the scores of interest (e.g., person-

alized PageRanks), we assume that −→π follows a power-lawmodel. That is, if πj is the jth largest entry in −→π , then weassume:

πj ∝ j−α (2)

for some 0 < α < 1. This is a very well-motivated model.First, it has been proved [24] that (under some assumptions),in a network with power-law indegrees, the PageRank vec-tor also follows a power-law, which has the same exponentas the one for indegrees. Our experiments with the Twit-ter social network, whose results are presented in Section 4,not only confirm this result, but also show that personal-ized PageRank vectors also follow power-laws, and that theaverage exponent for the personalized PageRank vectors isroughly the same as (yes, you guessed it!) the one for inde-gree and global PageRank.

By approximating summation with integration, we canapproximately calculate the normalizing factor in Equation2, denoted by η:

1 =n

j=1

πj = ηn

j=1

j−α ηnα−1 1

0

x−αdx =ηnα−1

1− α

Hence η = (1− α)/n1−α, and:

πj =(1− α)j−α

n1−α(3)

So, we will assume that the values of πj ’s are given byEquation 3 (i.e., we ignore the very small error in estimat-ing the summation with integration). This completes thedescription of our model.

3.2 Approximating the top k nodesFixing a number c, we do a long enough random walk

(with resets to the seed node) that for each of the top knodes, we expect to see that node at least c times. Then,we will return the k nodes most visited in this random walk.To this end, we first give a definition and prove a smalltechnical lemma, whose proof is presented in Appendix E:

Definition 1. Xs,v is the number of times that we visitnode v in a random walk of length s.

Lemma 6.

v |E[Xs,v]− sπv| ≤ 2/

The above lemma shows that if sπv is not very small (e.g.,compared to 1/), we can approximate E[Xs,v] (and becauseof sharp concentration of Xs,v, even Xs,v itself) with sπv.This is what we will do in the rest of this section.

Therefore, in order to see each of the top k nodes c timesin expectation, the minimum length of the walk that weneed to take is determined by sπk = c, which gives:

sk =c

1− αk(

nk)1−α (4)

177

Page 6: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

This gives us the length of the walk that we need to dousing our algorithm. So, now we can analyze the algorithm.We prove the following theorem, whose proof is presented inAppendix F:

Theorem 7. If we store R > q lnn walk segments at eachnode for a large enough constant q, then the expected numberof fetches done to take a random walk of length s is at most:

1 + (2(1− α)/nR)1α−1s1/α

Remark 2. We defined a fetch operation to return allthe stored walk segments as well as all the outgoing edgesof the queried node. While the number of stored walks pereach node is small, the outdegree of a node can be very large(indeed as large as Ω(n)). For instance, in the Twitter so-cial network, @BarackObama has more than 750, 000 outgo-ing edges. Thus, a fetch at such nodes may cause memoryproblems. However, one can see that if we change the fetchoperation for node w to either return all R stored walk seg-ments starting at w or just one randomly sampled outgoingedge from w, then in the analysis in Theorem 7 we will onlyget at most a factor 2 more fetches (because, denoting thenumber of fetches by F , as in Appendix F, we will haveF ≤ 2

v(Xs,v −R)+). So, we will just stick with our orig-

inal definition of a fetch operation.

Using the value of sk from Equation 4 in the result ofTheorem 7, directly gives the following corollary:

Corollary 8. Storing R > q lnn walk segments per nodefor a large enough constant q, the expected number of fetches

needed to find the top k nodes is at most 1+ c1/α

(1−α)(R/2)1α

−1k.

It should be noted that the bounds given in Theorem

7 and Corollary 8 are, respectively, Os/(nR/s)

1−αα

and

Ok/R

1−αα

. Also, as our experiments (described later)

show, the theoretical bounds are fairly accurate for valuesof R as small as 5.

Remark 3. To compare the bounds from Equation 4 andCorollary 8, let α = 0.75, c = 5, R = 10, k = 100, and n =108. Then, 4 bounds the number of required steps (also equalto number of database queries, if done in the crude way) with632k = 63200, while 8 bounds the number of required fetcheswith the much smaller number 20k = 2000. Also, notice howsignificantly smaller than n = 108 both these bounds are.This is because we are taking advantage of the power-lawassumption/property for the random walks we are interestedin. Without this assumption, in general, even to find thetop k nodes, one would need to calculate all n entries of thestationary distribution, and then return the top-k values.

4. EXPERIMENTSIn this section, we present the results of the experiments

that we did to test our assumptions and methods.

4.1 Experimental SetupWe used data from the social networking site, Twitter,

for our experiments. This network consists of directed edges.The access to data was through Twitter’s Social Store, calledFlockDB, stored in distributed shared memory. We emu-lated the PageRank Store on top of FlockDB. We used the

reset probability = 0.2 in our experiments. For the per-sonalized experiments, we picked 100 random users from thenetwork, who had a reasonable number of friends (between20 and 30).

4.2 Verification of the Random PermutationAssumption

Given a single permutation (i.e., the one that we actuallyobserved on twitter), it is impossible to validate whetheredges do arrive in random order. However, we can validatesome associated statistics, and as it turns out, these will alsoprovide a sufficient precondition for our analysis to hold:

1. Let X denote the expected value of πv/outdegv for anarriving edge (v, w). We assumed that mX = 1 in ourproof; this is the only assumption that we need in orderfor our proof to be applicable. In order to validate thisassumption, we looked at 4.63 Million edges arrivingbetween two snapshots of the social graph (we removededges originating from new nodes). The average ob-served value of mX for these 4.63 Million arrivals was0.81. This validates the running time we obtained inthis paper (in fact, this is a little better than whatwe assumed since smaller values of mX imply reducedcomputation time).

2. While not strictly required for our proof, another inter-esting consequence of the random permutation modelis that the probability of an edge arriving out of nodev is proportional to the out-degree of v1. We verifiedthis as well; details are omitted for brevity and will bepresented in the full version of this paper [3].

4.3 Network Model VerificationAs we explained in Section 3.1, we assumed a power-law

model on the personalized PageRank values. In addition toconsiderable literature [24], this assumption was also basedon our experiments, showing the following results:

1. Our network has power-law indegrees and global PageR-ank, with exponent roughly 0.76.

2. Personalized PageRank vectors follow power-laws. Aro-und 2% of the nodes had α > 1. Our analysis is eas-ily adapted to this case, but we omit the results forbrevity.

3. There is a variation in the exponents of the power-lawsfollowed by personalized PageRank vectors of differentusers. However, the mean of these exponents is almostthe same as the exponent for in-degree and PageRank.In our experiment, the average exponent was 0.77 andthe standard deviation was 0.08.

Further details of the experiments for this section are pre-sented in the extended version of this paper [3].

4.4 A Few Random Steps Go a Long WayAs we showed in Section 3.2, a relatively small number

of random walk steps (e.g., as calculated in Equation 4) areenough to approximately find the top k personalized PageR-ank nodes. We experimentally tested this idea. To do so,

1In fact, 1+ the out degree, but that distinction will not beempirically important.

178

Page 7: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Inte

rpol

ated

Ave

rage

Pre

cisi

on

Figure 1: 11 point interpolated average precision for

top 1000 results.

notice that the stationary distribution of a random walk isthe limit of the empirical walk distribution as the length ofthe walk approaches ∞. Thus, to accurately find the topk nodes, we can theoretically do an infinite walk, and findthe top k most frequently visited nodes. Based on this idea,we did the following experiment: For each of the 100 ran-domly selected users, we did a random walk, personalizedover that user, with 50000 steps. We calculated the top100 most visited nodes, and considered them as the “true”top 100 results. Then, we did a 5000 step random walk foreach user, and retrieved the top 1000 most visited nodes.For both experiments, we excluded nodes that were directlyconnected to the user. Then, we calculated the 11 point in-terpolated average precision curve [25]. The result is givenin Figure 1. Notice that the curve validates our approach.For instance, the precision at recall level 0.8 is almost equalto 0.8, meaning that 80 of the top 100 “true” results werereturned among the top 100 results of the short (5000 step)random walks. Similarly, precision of almost 0.9 at recalllevel 0.7 means 70 of the top 100 “true” results were re-trieved among the top 77 results. This shows that evenshort random walks are good enough to find the top scoringnodes (in the personalized setting).

4.5 Number of FetchesIn Theorem 7, we gave an upperbound on the number of

fetches needed to compose a random walk out of the storedwalk segments. We did an experiment to test this theoreticalbound. In our experiments we found the average (over 100users) number of fetches actually done to make a walk oflength s, for s between 100 and 50000, when we store R walksegments per node, for each of the cases with R ∈ 5, 10, 20.These are the thin lines in the plots in Figure 2. We alsocalculated the corresponding theoretical upperbound on thenumber of fetches for each user (using its own power-lawexponent), and then calculated the average over the 100users. The results are the thick lines in the plots in Figure 2.

As can be seen in this figure, our theoretical bounds ac-tually give an upperbound on the actual number of fetches

0 1 2 3 4 5x 104

0

500

1000

1500

2000

2500

3000

3500

4000

4500

number of random walk steps

num

ber o

f fet

ches

to F

lock

DB

R = 5

0 1 2 3 4 5x 104

0

500

1000

1500

2000

2500

3000

3500

number of random walk steps

num

ber o

f fet

ches

to F

lock

DB

R = 10

0 1 2 3 4 5x 104

0

500

1000

1500

2000

2500

3000

3500

number of random walk steps

num

ber o

f fet

ches

to F

lock

DB

R = 20

Figure 2: Number of fetches (thin line: observed,

thick line: theoretical bound).

179

Page 8: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

in our experiments. Also, we see that the number offetches that we make is not much sensitive to the number ofstored random walks per node (i.e., R). Note that the the-oretical guarantees are only valid for R > q lnn for a largeenough constant q; hence the theoretical bound appears tobe robust well before the range where we proved it.

5. CONCLUSIONSWe showed how PageRank, Personalized PageRanks, and

related random walk based measures can be computed veryefficiently in an incremental update model if the graph isstored in distributed memory, under mild assumptions onedge arrival order. We validated our assumptions using thesocial graph of Twitter. It would be an interesting resultto extend our running time guarantees to adversarial edgearrival. We also presented empirical results indicating thatthe collaborative filtering algorithm, SALSA, performs muchbetter than another well know collaborative filtering algo-rithm, HITS, for the link prediction problem in social net-works.

6. REFERENCES[1] A. Altman and M. Tennenholtz. Ranking systems: the

PageRank axioms. In Proceedings of the 6th ACMconference on Electronic commerce, pages 1–8, 2005.

[2] K. Avrachenkov, N. Litvak, D. Nemirovsky, andN. Osipova. Monte carlo methods in PageRankcomputation: When one iteration is sufficient. SIAMJ. Numer. Anal., 45(2):890–904, 2007.

[3] B. Bahmani, A. Chowdhury, and A. Goel. Fastincremental and personalized PageRank,http://arxiv.org/abs/1006.2880, 2010.

[4] A.-L. Barabasi and R. Albert. Emergence of scaling inrandom networks. Science, 286(5439):509–512, 1999.

[5] P. Berkhin. Survey: A survey on PageRankcomputing. Internet Mathematics, 2(1), 2005.

[6] A. Borodin, G. O. Roberts, J. S. Rosenthal, andP. Tsaparas. Link analysis ranking: algorithms,theory, and experiments. ACM Trans. InternetTechnol., 5(1):231–297, 2005.

[7] M. Brand. Fast online SVD revisions for lightweightrecommender systems. In SDM, 2003.

[8] S. Chien, C. Dwork, S. Kumar, and D. Sivakumar.Towards exploiting link evolution, 2001.

[9] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Googlenews personalization: scalable online collaborativefiltering. In Proceedings of the 16th internationalconference on World Wide Web, pages 271–280, 2007.

[10] A. Das Sarma, S. Gollapudi, and R. Panigrahy.Estimating PageRank on graph streams. InProceedings of the twenty-seventh ACMSIGMOD-SIGACT-SIGART symposium on Principlesof database systems, pages 69–78, 2008.

[11] D. Fogaras, B. Racz, K. Csalogany, and T. Sarlos.Towards scaling fully personalized PageRank:Algorithms, lower bounds, and experiments. InternetMathematics, 2(3), 2005.

[12] R. A. Hanneman and M. Riddle. Introduction to socialnetwork methods, chapter 10. published in digital format http://faculty.ucr.edu/ hanneman/, 2005.

[13] T. H. Haveliwala. Topic-sensitive PageRank. InWWW, pages 517–526, 2002.

[14] G. Jeh and J. Widom. Scaling personalized websearch. In WWW, pages 271–279, 2003.

[15] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub.Exploiting the block structure of the web forcomputing PageRank. Technical report, StanfordUniversity, 2003.

[16] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, andG. H. Golub. Extrapolation methods for acceleratingPageRank computations. In WWW, pages 261–270,2003.

[17] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. J. ACM, 46(5):604–632,1999.

[18] A. N. Langville and C. D. Meyer. Updating PageRankusing the group inverse and stochasticcomplementation. Technical report, North CarolinaState University, Mathematics Department, 2002.

[19] A. N. Langville and C. D. Meyer. Updating thestationary vector of an irreducible markov chain.Technical report, North Carolina State University,Mathematics Department, 2002.

[20] A. N. Langville and C. D. Meyer. Deeper insidePageRank. Internet Mathematics, 1:2004, 2004.

[21] A. N. Langville and C. D. Meyer. Updating PageRankwith iterative aggregation. In Proceedings of the 13thinternational World Wide Web conference onAlternate track papers & posters, pages 392–393, 2004.

[22] R. Lempel and S. Moran. SALSA: the stochasticapproach for link-structure analysis. ACM Trans. Inf.Syst., 19(2):131–160, 2001.

[23] D. Liben-Nowell and J. Kleinberg. The link predictionproblem for social networks. In CIKM ’03, pages556–559, 2003.

[24] N. Litvak, W. R. W. Scheinhardt, and Y. Volkovich.In-degree and PageRank: Why do they follow similarpower laws? Internet Math., 4(2-3):175–198, 2007.

[25] C. D. Manning, P. Raghavan, and H. Schutze.Introduction to Information Retrieval. CambridgeUniversity Press, 2008.

[26] F. McSherry. A uniform approach to acceleratedPageRank computation. In Proceedings of the 14thinternational conference on World Wide Web, pages575–582, 2005.

[27] M. A. Najork. Comparing the effectiveness of HITSand SALSA. In CIKM, pages 157–164, 2007.

[28] Netflix cinematch. http://www.netflix.com.[29] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable

algorithms for link analysis. In Proceedings of the 24thannual international ACM SIGIR conference onResearch and development in information retrieval,pages 258–266, 2001.

[30] L. Page, S. Brin, R. Motwani, and T. Winograd. ThePageRank citation ranking: Bringing order to theweb. Technical report, Stanford InfoLab, 1999.

[31] T. Sarlos, A. A. Benczur, K. Csalogany, D. Fogaras,and B. Racz. To randomize or not to randomize:space optimal summaries for hyperlink analysis. InWWW, pages 297–306, 2006.

[32] H. Tong, S. Papadimitriou, P. S. Yu, and C. Faloutsos.Proximity tracking on time-evolving bipartite graphs.In SDM, pages 704–715, 2008.

180

Page 9: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

APPENDIXA. DETAILED REVIEW OF THE RELATED

WORKAny PageRank computation or approximation method on

social networks is desired to have the following properties:

1. Ability to keep the values (or approximations) updatedall the time as the network evolves

2. Large scale full personalization capability

3. Very high computational efficiency

Also, as briefly mentioned in Section 1, in social network-ing applications, the data access model is dictated by theneed for random access to the network, and implemented us-ing a Social Store. Therefore, a desirable PageRank compu-tation (or approximation) scheme should achieve the abovementioned features in this data access model.

A simple way to keep the PageRank values updated is tojust recompute the values for each incremental change inthe network. But, this can be very costly. For instance,the simple power iteration method [30] to approximate (toa constant precision) PageRank values (with reset probabil-ity ) takes Ω( x

ln(1/(1−)) ) time over a graph with x edges.

Hence, over m edge arrivals, this takesm

x=1 Ω(x

ln(1/(1−)) )

= Ω( m2

ln(1/(1−)) ) total time, which is many orders of magni-

tude larger than our approach. Similarly, the Ω(n/) timecomplexity of the Monte Carlo method results in a totalΩ(mn/) work over m edge arrivals, which is also very inef-ficient. So, we need more efficient ways of doing the compu-tations.

There have been a lot of methods proposed for computa-tion or approximation of PageRank and similar measures [5].Broadly speaking, these methods can be categorized into twogeneral categories, based on the core techniques they use:

1. Linear algebraic methods: These methods mainly usetechniques from linear and matrix algebra, perhaps withsome application of structural properties of the net-works of interest (e.g., the world wide web) [8, 13–16,18, 19,21, 26,31, 32].

2. Monte Carlo methods: These methods use a small num-ber of simulated random walks per node to approximatePageRanks (or other variants) [2, 10,11, 31].

A great survey of many of the methods in the first cat-egory is done by Langville and Meyer [20]. However, forcompleteness and also to compare the state of the art withour own results, we provide an overview of the methods andresults in this category here.

A family of the methods proposed in this category dealwith accelerating the basic power iteration method for com-puting PageRank values [15, 16]. However, they all provideonly very modest (i.e., small constant factor) speed ups. Forinstance, Kamvar et al. [16] propose a method to acceleratethe power iteration, using an extrapolation based on Aitken∆2 method for accelerating linearly convergent sequences.However, as discussed in their paper, the time complexityof their method is Ω(n), which is prohibitively large for areal-time application. Also, their experiments show only a25− 300% speed up compared to the crude power iteration.So, the method does not perform well enough for our appli-cations.

Another family of the methods in the first category dealwith efficiently updating the PageRank values using the “ag-gregation” idea [8, 18, 19, 21]. The basic idea behind thesemethods is that when an incremental change happens in thenetwork, the effect of this change on the PageRank vector ismostly local. That is, only the PageRanks of the nodes inthe vicinity of the change may change significantly. To uti-lize this observation, these methods partition the set of thenetwork nodes to two subsets G, G, where G is a subset ofnodes close to where the incremental change happened, andG is the set of all other nodes. Then, all the nodes in G arelumped/aggregated into a single hyper-node, so a smallernetwork (composed of G and this hyper-node) is formed.Then, the PageRanks of the nodes in G are updated usingthis network, and finally the result is translated back to theoriginal network.

None of these methods seem well suited for real time ap-plications. First, the performance of these methods heavilydepends on the partitioning of the network, and as pointedout in [21], a bad choice of this partitioning can cause thesemethods to be as slow as the power iteration. It is not knownhow to do this partitioning; while a number of heuristicideas have been proposed [8, 19], there is also considerableevidence that these networks are expanders, and no suchpartitioning is possible. Further, it is easy to see that in-dependent of how the partitioning is done, partitioning andaggregation together will need Ω(n) time. Also, notice thatthis work is in addition to the actual PageRank computa-tion that needs to be done on the aggregated network, andthis computational load is also not negligible. For instance,as reported by Chien et al. [8], for a single edge additionto a network with 60M nodes, they need to do a PageRankcomputation on a network with almost 8K nodes. Afterall, these methods start with a precise PageRank vector andgive an approximate PageRank vector for the network afterthe incremental change. Therefore, even if these methodswere run for a real-time application, the approximation er-ror would potentially accumulate, and the estimations woulddrift away from the true PageRanks [18]. Of course, weshould mention that there exist exact aggregation based up-date methods, but all of those methods are more costly thanpower iteration [19]!

A number of other methods in the first category also dealwith updating PageRanks [26, 32]. However, the methodin [32] does not scale well for the large scale social network-ing applications. It achieves O(l2) update time for randomwalk based methods on an n× l bipartite graph. This maywork well when the graph is very skewed (i.e., l << n).But, for instance, in the friend recommendation applicationon social networks, l = n, so this gives only O(n2) updatetime, and hence O(mn2) total time over m edge arrivals,which is very bad. Also, McSherry [26] combines a numberof heuristics to provide some improvement in computationand updating of PageRank, using a sequential implemen-tation of the power iteration. The method works in thestreaming model where edges are stored and then streamedfrom the disk. No guarantee is given about the tradeoffbetween precision and the time complexity of the method.

Another family of methods in the first category [13–15]deals with personalization. In this family, Haveliwala’s work[13] achieves personalization only to the level of few (e.g.,16) topics and provides no efficiency improvement over thepower iterations.

181

Page 10: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

Kamvar et al. [15] use the host-induced block structure ofthe web link graph to speed up computation and updating ofPageRank and also provide personalization. However, first,in social networks, there is no equivalent of a web host, andmore generally it is not easy to find a corresponding blockstructure (even if such a structure actually exists). There-fore, it is not even clear how to apply this idea to socialnetworking applications. Also, they use the block structureonly to come up with a better initialization for a standardPageRank computation, such as power iteration. There-fore, even though, after an incremental change, the localPageRank vectors (corresponding to unchanged hosts) maybe reused, doing the power iteration alone would need Ω(m)work per each change (and hence a total Ω(m2) work overm edge arrivals). Finally, on the personalization front, theirmethod achieves personalization only at the host level.

Jeh and Widom [14] achieve personalization only over asubset of the network nodes, denoted as the “Hub Set”.Even though, the paper provides no precise time complexityanalysis, it is easy to see that, as mentioned in the paper,the time performance of the presented algorithm heavily de-pends on the choice of the hub set. In our application wherewe wish to have full personalization, the hub set needs to besimply the set of all vertices, in which case the algorithmsin [14] reduce to a simple dynamic programming which pro-vides no performance improvement over power iteration.

Another notable work in the first category is [31]. It usesdeterministic rounding or randomized sketching techniquesalong with the dynamic programming approach proposedin [14] to achieve full personalization. However, the timecomplexity of their (rounding) method, to achieve constanterror, is O(m/), while, the time complexity of the simpleMonte Carlo method to achieve constant error with highprobability is just O(n lnn/). Therefore, if m = ω(n lnn),which is expected in our applications of interest, then thesimple Monte Carlo method is asymptotically faster than themethod introduced in [31]. Also, it is not clear how wouldone be able to efficiently update this method’s estimationsas the network changes.

The methods in the second category above, namely MonteCarlo methods, have the advantage that they are very effi-cient and can achieve full personalization on a large scale[2, 11]. However, all the literature in this category dealswith static networks. Of course, it has been mentioned in [2]that one can keep the approximations updated continuously.However, they neither provide any details of how exactly todo this nor give any analysis about the efficiency of doingthese updates. For instance, the method in [2] uses Ω(n )work in each computation. So, if we naively recompute thePageRank using this method for each edge arrival, then overm edge arrivals, we will have Ω(mn

) total work.In contrast, in this paper, we show that one can use the

Monte Carlo techniques to achieve very cheap incrementalupdates. Indeed, we prove a surprising result, stating thatup to a logarithmic factor, the total work required to keepthe approximations updated all the time is the same as thework needed to just initialize the approximations! Moreprecisely, we prove that over m edge arrivals in a randomorder, we can keep the approximations updated using onlyO(n lnm

2) total work. This is significantly better than all the

previously proposed methods for doing the updates.Another issue with the methods in the second category is

that if we want to directly use the simulated random walk

segments to approximate personalized PageRank, we wouldget a limited precision. For instance, if we store R randomwalks per node (and use only the end points of these walksfor our estimates) the approximate personalized PageRankvectors that we get would have at most R non-zero entries,which is significantly fewer than what we need in all applica-tions of interest; previous works [10, 11] do not explore thistradeoff in any detail.

In this paper, we present a formal analysis of this tradeoff in the random access model, and prove that under thepower-law model for the network, one can do long enoughwalks (and hence get desirable levels of precision) very ef-ficiently. Indeed, Das Sarma et al. [10] achieve time per-formance O(m/

√), and hence using their method for each

incremental change in the network would need O(m2/√)

total work over m edge arrivals. While, this is better thanthe naive power iteration O(m2/ ln(1/(1−))) time complex-ity, it is still very inefficient. But, we show that in our model,we can achieve an O(n lnm/2) time complexity, assuming apower-law network model, which is significantly better, andgood enough for real-time incremental applications.

Collaborative filtering on social networks is very closelyrelated to the Link Prediction problem [23], for which therandom walk based methods have been reported to workvery well [23] (we also verified this experimentally; the re-sults are given in Appendix B). Most of the literature onLink Prediction deals with static networks. But, there aresome proposed algorithms which deal with dynamic evolv-ing networks [32]. However, none of these methods scale totoday’s very large scale social networks.

We would also like to mention that there are incrementalcollaborative filtering methods based on low-rank SVD up-dates. For instance, refer to [7] and references therein. How-ever, these methods also do not scale very well. For instance,the method proposed in [7] requires a total O(pqr) work tocalculate the rank-r SVD of a p×q matrix. But, for instance,for the friend recommendation application, p = q = n, andhence this method needs a total Ω(n2) work, while we onlyneed a total O(n lnm

2) work, as mentioned above, which is

significantly better.

B. EFFECTIVENESS OF RANDOM WALKBASED METHODS FOR LINK PREDIC-TION

As mentioned in the Introduction section, random walkbased methods have been reported to be very effective forthe link prediction problem on social networks [23]. Wealso did some experiments to explore this further. Theseare somewhat tangential to the rest of this paper, and byno means exhaustive. We present them not as significantoriginal research but as an interesting data point for readerswho are interested in practical aspects of recommendationsystems.

We picked 100 random nodes from the Twitter social net-work. To select these users, we considered the network fortwo different dates, with 5 weeks of difference. Then, we se-lected random users who had a reasonable number of friends(between 20 and 30) on the first date, and increased thenumber of their friends by a factor between 50% and 100%by the second date. For the second date, we only counted thefriends who already existed (and were reasonably followed,i.e., had at least 10 followers) on the first date (because,

182

Page 11: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

HITS COSINE PageRank SALSATop 100 0.25 4.93 5.07 6.29Top 1000 0.86 11.69 12.71 13.58

Table 1: Link Prediction Effectiveness

otherwise there is no way for a collaborative filter or linkprediction system to find these users).

The reason we enforced the above criteria was that theseusers are among the typical reasonably active users on thenetwork. Also, because they are increasing their friends set,they are good targets for a recommendation system.

For each of the 100 users, we used the network data fromthe first date to generate a personalized list of predictedlinks. We considered four link prediction methods: person-alized PageRank, personalized SALSA, personalized HITS,and COSINE. We already explained personalized PageRankand personalized SALSA in the paper. Personalized HITSand COSINE also assign hub and authority scores to eachnode. For personalized HITS, when personalizing over nodeu, these scores are related as follows:

hv = (1− )(

x| (v,x)∈E

ax) + δu,v

ax =

v| (v,x)∈E

hv

For the COSINE method, the hub score hv is defined asthe cosine similarity of the neighbor sets of u and v (consid-ered as 0-1 vectors). Then, the authority score, similar toHITS, is defined by:

ax =

v| (v,x)∈E

hv

We performed 10 iterations for each method to calculatethe hub and authority scores. After generating the lists ofpredicted links, we calculated how many of the new friend-ships that were made by each user between the two dateswere captured by the top 100 or top 1000 predictions. Fi-nally, we averaged these numbers over the 100 selected users.The results are presented in Table 1.

Notice that we do not expect these numbers to be large,because they are the number of friendships out of the pre-diction/recommendation list that the user made without be-ing exposed to the recommendations at all. Also, in ourexperiments, each user had only 10-30 new friends, whichis an upper bound on these numbers. This number wouldpresumably be very different (i.e., much larger) if the userfirst received the recommendations and then decided whichfriendships to make. Nonetheless, the relative values of thesenumbers for different algorithms are good indicators of thepredictive ability of those algorithms, specially when thedifferences are as pronounced as in our experiments. Aswe see from Table 1, the systems based on random walks(i.e., Personalized PageRank and SALSA) perform the best:they significantly outperform HITS, and they also do bet-ter than the cosine similarity based link prediction system.These results are in accordance with the previous literatureindicating the effectiveness of random walk based methodsfor the link prediction problem [23]. Moreover, it should

be mentioned that there is also axiomatic support for thisoutcome [1, 6].

C. PROOF OF THEOREM 1It is already proved in [2] that E[πv] = πv. So, we only

need to prove the sharp concentration result. First, assumeR = 1. Fix an arbitrary node v. Define Xu to be timesthe number of visits to v in the walk stored at u, Yu to bethe length of this walk, Wu = Yu, and xu = E[Xu]. Then,

Xu’s are independent, πv =

u Xu

n (hence πv =

u xu

n ),0 ≤ Xu ≤ Wu, and E[Wu] = 1. Then, it is easy to see that:

E[etXu ] ≤ xuE[etWu ] + 1− xu ≤ e−xu(1−E[etWu ])

Thus:

Pr[πv ≥ (1 + δ)πv] ≤E[etnπv ]

etn(1+δ)πv

=

u E[etXu ]

etn(1+δ)πv≤

u e−xu(1−E[etWu ])

etn(1+δ)πv

=e−nπv(1−E[etW ])

etn(1+δ)πv≤ e−nπvδ

where W = Y is a random variable with Y having geo-metric distribution with parameter , and δ

is a constant

depending on δ (and ), found by an optimization over t.Therefore, we see that if πv = Ω(lnn/n) (i.e. if πv is

slightly larger than the average PageRank value 1/n), thenwe already get a sharp concentration with R = 1 (the anal-ysis for Pr[πv ≤ (1− δ)πv] is similar, and hence we omit ithere).

Now, assume we have R walk segments stored at eachnode, where we do not necessarily have R = 1. Then, similarto the above tail analysis, we get:

Pr[πv ≥ (1 + δ)πv] ≤ e−nRπvδ

Therefore, choosing R = Ω( lnnnπv

) we get exponentially de-caying tails. Notice that this means even for average valuesof πv (i.e., for πv = Θ(1/n)), we have sharp concentrationwith R as small as O(lnn).

This finishes the proof of the theorem.

D. EXTENSION OF INCREMENTALCOMPUTATION BOUNDS TO SALSA

To approximate the hub and authority scores in SALSA,we need to keep 2R random walk segments per node; Rrandom walks starting with a forward step from the seednode, and R walks starting with a backward step. Then, theapproximations are done similar to the case of PageRank,and the sharp concentration of the approximations can beproved in a similar way.

For the update cost, we notice that if (ut, vt) is the edgearriving at time t, then, unlike the PageRank case whereonly ut could cause updates, both ut and vt can cause walksegments to need an update. Again, we assume the randompermutation model for edge arrivals. Then:

Pr[ut = u] =outdegu(t)

t, Pr[vt = v] =

indegv(t)t

and we get the following theorem:

183

Page 12: Fast Incremental and Personalized PageRank - VLDB Endowment Inc

Theorem 9. The expected amount of work needed to keepthe approximations updated over m edge arrivals is at most16nR2

lnm.

Rather than presenting the complete proof, which followsexactly the same steps as the one for Theorem 4, we just ex-plain from where the extra factor 16 is appearing: Insteadof R walks we are storing 2R walks per node, introducinga factor of 2. Rather than 1/, each walk segment has av-erage length 2/ (because we only allow resets at forwardsteps), introducing a factor of 4 (as appears in the boundas 2). Also, each time an edge (ut, vt) arrives, both ut andvt can cause updates, hence twice as many walks need to beupdated at each time. These three modifications, togethercause a factor 16 difference.

E. PROOF OF LEMMA 6If Xs,v is the number of times that we visit v when we take

a random walk starting at the stationary distribution, thenby coupling our walk with this stationary walk at the firstreset time ts and all the steps afterwards, we can see thatXs,v − Xs,v = Xts,v − Xts,v. Since

v Xts,v =

vXts,v =

ts, we get:

v

|E[Xs,v]− sπv| =

v

|E[Xs,v]− E[ Xs,v]|

=

|E[Xts,v]− E[ Xts,v]| ≤ 2E[ts] =2

which proves the lemma.

F. PROOF OF THEOREM 7A fetch is made at node v only if we arrive at u from

a parent node v which ran out of unused walk segments.In other words, each fetch at a node u can be charged toan extra visit to one of u’s parents. Therefore, denoting thenumber of fetches made during the algorithm by F , we have:

F ≤

v

(Xs,v −R)+

Hence,

E[F ] ≤

v

E[(Xs,v −R)+]

=

v

E[Xs,v −R|Xs,v ≥ R]Pr(Xs,v ≥ R)

≤ n

v|E[Xs,v ]<R/2

Pr(Xs,v ≥ R) +

v|E[Xs,v ]≥R/2

E[Xs,v]

≤ 1 +

v| sπv>R/2

E[Xs,v]

Where the second to last inequality holds because (dueto the memoryless property of the random walk) E[Xs,v −R|Xs,v ≥ R] ≤ E[Xs,v], and the last inequality holds be-cause with R > q lnn for large enough q, if E[Xs,v] < R/2then Pr(Xs,v > R) = o(1/n) using Chernoff bounds (andif E[Xs,v] ≥ R/2 then E[Xs,v] is almost equal to sπv, asmentioned after Lemma 6).

But, sπv > R/2 if and only if v ≤ τ where τ is such thatsπτ = R/2. This gives:

τ =(1− α)1/α

n1α−1

(2sR

)1/α

and E[F ] ≤ 1+τ

j=1 sπj . Upperbounding the summationwith integration, we get:

E[F ] ≤ 1 + (1− α)

τ/N

0

x−αdx = 1 + s(τ/n)1−α

= 1 + (2(1− α)

nR)

1α−1s1/α

which finishes the proof.

184