17
arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University China [email protected] Xiaokui Xiao Nanyang Technological University Singapore [email protected] ABSTRACT SimRank is a similarity measure for graph nodes that has numerous applications in practice. Scalable SimRank computation has been the subject of extensive research for more than a decade, and yet, none of the existing solutions can efficiently derive SimRank scores on large graphs with provable accuracy guarantees. In particular, the state-of-the-art solution requires up to a few seconds to compute a SimRank score in million-node graphs, and does not offer any worst-case assurance in terms of the query error. This paper presents SLING, an efficient index structure for Sim- Rank computation. SLING guarantees that each SimRank score returned has at most ε additive error, and it answers any single- pair and single-source SimRank queries in O(1) and O(n/ε) time, respectively. These time complexities are near-optimal, and are significantly better than the asymptotic bounds of the most re- cent approach. Furthermore, SLING requires only O(n/ε) space (which is also near-optimal in an asymptotic sense) and O(m/ε + n log n δ 2 ) pre-computation time, where δ is the failure proba- bility of the preprocessing algorithm. We experimentally evaluate SLING with a variety of real-world graphs with up to several mil- lions of nodes. Our results demonstrate that SLING is up to 10000 times (resp. 110 times) faster than competing methods for single- pair (resp. single-source) SimRank queries, at the cost of higher space overheads. 1. INTRODUCTION Assessing the similarity of nodes based on graph topology is an important problem with numerous applications, including social network analysis [21], web mining [16], collaborative filtering [5], natural language processing [26], and spam detection [27]. A num- ber of similarity measures have been proposed, among which Sim- Rank [14] is one of the most well-adopted. The formulation of SimRank is based on two intuitive arguments: A node should have the maximum similarity to itself; The similarity between two different nodes can be measured by the average similarity between the two nodes’ neighbors. Formally, the SimRank score of two nodes vi and vj is defined as: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2882903.2915243 s(vi ,vj )= 1, if vi = vj c |I (vi )|·|I (vj )| aI(v i ),bI(v j ) s(a, b), otherwise (1) where I (v) denotes the set of in-neighbors of a node v, and c (0, 1) is a decay factor typically set to 0.6 or 0.8 [14, 23]. Previous work [5, 8, 16, 21, 22, 26, 27, 32, 34] has applied SimRank (and its variants) to various problem domains, and has demonstrated that it often provides high-quality measurements of node similarity. 1.1 Motivation Despite of the effectiveness of SimRank, computing SimRank scores efficiently on large graphs is a challenging task, and has been the subject of extensive research for more than a decade. In par- ticular, Jeh and Widom [14] propose the first SimRank algorithm, which returns the SimRank scores of all pairs of nodes in the in- put graph G. The algorithm incurs prohibitive costs: it requires O ( n 2 ) space and O ( m 2 log 1 ε ) time, where n and m denote the numbers of nodes and edges in G, respectively, and ε is the maxi- mum additive error allowed in any SimRank score. Subsequently, Lizorkin et al. [23] improve the time complexity of the algorithm to O ( log 1 ε · min{nm, n 3 / log n} ) , which is further improved to O ( log 1 ε · min{nm, n ω } ) by Yu et al. [33], where ω 2.373. However, the space complexity of the algorithm remains O ( n 2 ) , as is inherent in any algorithm that computes all-pair SimRank scores. Fogaras and Rácz [8] present the first study on single-pair SimRank computation, and propose a Monte-Carlo method that requires O ( n log 1 δ 2 ) pre-computation time and space. The method returns the SimRank score of any node pair in O ( log 1 δ 2 ) time, where δ is the failure probability of the Monte-Carlo method. Subsequently, Li et al. [20] propose a deterministic algorithm for single-pair SimRank queries; it has the same time complexity with Jeh and Widom’s solution [14], but provides much better practi- cal efficiency. However, existing work [24] show that neither Li et al.’s [20] nor Fogaras and Rácz’s solution [8] is able to handle million-node graphs in reasonable time and space. There is a line of research [10, 13, 19, 29–31] that attempts to mitigate this effi- ciency issue based on an alternative formulation of SimRank, but the formulation is shown to be incorrect [17], in that it does not return the same SimRank scores as defined in Equation (1). The most recent approach to SimRank computation is the lin- earization technique [24] by Maehara et al., which is shown to considerably outperform existing solutions in terms of efficiency and scalability. Nevertheless, it still requires up to a few seconds to answer a single-pair SimRank query on sizable graphs, which is inadequate for large-scale applications. More importantly, the

SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

arX

iv:1

604.

0418

5v1

[cs.

DB

] 14

Apr

201

6

SLING: A Near-Optimal Index Structure for SimRank

Boyu TianShanghai Jiao Tong University

[email protected]

Xiaokui XiaoNanyang Technological University

[email protected]

ABSTRACTSimRank is a similarity measure for graph nodes that has numerousapplications in practice. Scalable SimRank computation has beenthe subject of extensive research for more than a decade, andyet,none of the existing solutions can efficiently derive SimRank scoreson large graphs with provable accuracy guarantees. In particular,the state-of-the-art solution requires up to a few seconds to computea SimRank score in million-node graphs, and does not offer anyworst-case assurance in terms of the query error.

This paper presentsSLING, an efficient index structure for Sim-Rank computation.SLING guarantees that each SimRank scorereturned has at mostε additive error, and it answers any single-pair and single-source SimRank queries inO(1/ε) andO(n/ε)time, respectively. These time complexities arenear-optimal, andare significantly better than the asymptotic bounds of the most re-cent approach. Furthermore,SLING requires onlyO(n/ε) space(which is also near-optimal in an asymptotic sense) andO(m/ε +n log n

δ/ε2) pre-computation time, whereδ is the failure proba-

bility of the preprocessing algorithm. We experimentally evaluateSLINGwith a variety of real-world graphs with up to several mil-lions of nodes. Our results demonstrate thatSLINGis up to10000times (resp.110 times) faster than competing methods for single-pair (resp. single-source) SimRank queries, at the cost of higherspace overheads.

1. INTRODUCTIONAssessing the similarity of nodes based on graph topology is

an important problem with numerous applications, including socialnetwork analysis [21], web mining [16], collaborative filtering [5],natural language processing [26], and spam detection [27].A num-ber of similarity measures have been proposed, among whichSim-Rank [14] is one of the most well-adopted. The formulation ofSimRank is based on two intuitive arguments:

• A node should have the maximum similarity to itself;

• The similarity between two different nodes can be measuredby the average similarity between the two nodes’ neighbors.

Formally, the SimRank score of two nodesvi andvj is defined as:

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USAc© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-3531-7/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2882903.2915243

s(vi, vj) =

1, if vi = vj

c

|I(vi)| · |I(vj)|∑

a∈I(vi),b∈I(vj)

s(a, b), otherwise

(1)whereI(v) denotes the set of in-neighbors of a nodev, andc ∈(0, 1) is a decay factor typically set to0.6 or 0.8 [14,23]. Previouswork [5, 8, 16, 21, 22, 26, 27, 32, 34] has applied SimRank (anditsvariants) to various problem domains, and has demonstratedthat itoften provides high-quality measurements of node similarity.

1.1 MotivationDespite of the effectiveness of SimRank, computing SimRank

scores efficiently on large graphs is a challenging task, andhas beenthe subject of extensive research for more than a decade. In par-ticular, Jeh and Widom [14] propose the first SimRank algorithm,which returns the SimRank scores of all pairs of nodes in the in-put graphG. The algorithm incurs prohibitive costs: it requiresO(n2)

space andO(m2 log 1

ε

)time, wheren andm denote the

numbers of nodes and edges inG, respectively, andε is the maxi-mum additive error allowed in any SimRank score. Subsequently,Lizorkin et al. [23] improve the time complexity of the algorithmto O

(log 1

ε·minnm,n3/ log n

), which is further improved to

O(log 1

ε·minnm,nω

)by Yu et al. [33], whereω ≈ 2.373.

However, the space complexity of the algorithm remainsO(n2), as

is inherent in any algorithm that computesall-pair SimRank scores.Fogaras and Rácz [8] present the first study onsingle-pair

SimRank computation, and propose a Monte-Carlo method thatrequiresO

(n log 1

δ/ε2)

pre-computation time and space. Themethod returns the SimRank score of any node pair inO

(log 1

δ/ε2)

time, whereδ is the failure probability of the Monte-Carlo method.Subsequently, Li et al. [20] propose a deterministic algorithm forsingle-pair SimRank queries; it has the same time complexity withJeh and Widom’s solution [14], but provides much better practi-cal efficiency. However, existing work [24] show that neither Liet al.’s [20] nor Fogaras and Rácz’s solution [8] is able to handlemillion-node graphs in reasonable time and space. There is alineof research [10, 13, 19, 29–31] that attempts to mitigate this effi-ciency issue based on an alternative formulation of SimRank, butthe formulation is shown to beincorrect [17], in that it does notreturn the same SimRank scores as defined in Equation (1).

The most recent approach to SimRank computation is thelin-earization technique [24] by Maehara et al., which is shown toconsiderably outperform existing solutions in terms of efficiencyand scalability. Nevertheless, it still requires up to a fewsecondsto answer a single-pair SimRank query on sizable graphs, whichis inadequate for large-scale applications. More importantly, the

Page 2: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Table 1: Comparison of SimRank computation methods with at most ε additive error and at least 1 − δ success probability.

AlgorithmQuery Time

Space Overhead Preprocessing TimeSingle Pair Single Source

Fogaras and Rácz [8] O(

log 1εlog n

δ/ε2

)

O(

n log 1εlog n

δ/ε2

)

O(

n log 1εlog n

δ/ε2

)

O(

n log 1εlog n

δ/ε2

)

Maehara et al. [24](under heuristic assumptions)

O(

m log 1ε

)

O(

m log2 1ε

)

O(n+m) no formal result

this paper O(1/ε)O(n/ε) (Algorithm 3)

O(n/ε) O(

m/ε+ n log nδ/ε2

)

O(

m log2 1ε

)

(Algorithm 6)

lower bound Ω(1) Ω(n) Ω(n) -

technique is unable to provide any worst-case guarantee in termsof query accuracy. In particular, the technique has a preprocessingstep that requires solving a systemL of linear equations; assum-ing that the solution toL is exact, Maehara et al. [24] show thatthe technique can ensureε worst-case query error, and can answerany single-pair and single-source SimRank queries inO

(m log 1

ε

)

andO(m log2 1

ε

)time, respectively. (A single-source SimRank

query from a nodevi asks for the SimRank score betweenvi andevery other node.) Unfortunately, as we discuss in Section 3.3, thelinearization technique cannot precisely solveL, nor can it offernon-trivial guarantees in terms of the query errors incurred by theimprecision ofL’s solution. Consequently, the technique in [24]only provides heuristic solutions to SimRank computation.In sum-mary, after more than tens years of research on SimRank, there isstill no solution for efficient SimRank computation on largegraphswith provable accuracy guarantees.

1.2 Contributions and OrganizationThis paper presentsSLING (SimRank via Local Updates and

Sampling), an efficient index structure for SimRank computation.SLINGguarantees that each SimRank score returned has at mostεadditive error, and answers any single-pair and single-source Sim-Rank queries inO(1/ε) andO(n/ε) time, respectively. Thesetime complexities arenear-optimal, since any SimRank method re-quiresΩ(1) (resp.Ω(n)) time to output the result of any single-pair (resp. single-source) query. In addition, they are significantlybetter than the asymptotic bounds of the states of the art (includ-ing Maehara et al.’s technique [24] under their heuristic assump-tions), as we show in Table 1. Furthermore,SLINGrequires onlyO(n/ε) space (which is also near-optimal in an asymptotic sense)andO(m/ε+n log n

δ) pre-computation time, whereδ is the failure

probability of the preprocessing algorithm.Apart from its superior asymptotic bounds,SLINGalso incorpo-

rates several optimization techniques to enhance its practical per-formance. In particular, we show that its preprocessing algorithmcan be improved with a technique that estimates the expectationof a Bernoulli variable using anasymptotically optimalnumber ofsamples. Additionally, its space consumption can be heuristicallyreduced without affecting its theoretical guarantees, while its em-pirical efficiency for single-source SimRank queries can beconsid-erably improved, at the cost of a slight increase in its querytimecomplexity. Last but not least, its construction algorithms can beeasily parallelized, and it can efficiently process querieseven whenits index structure does not fit in the main memory.

We experimentally evaluateSLINGwith a variety of real-worldgraphs with up to several millions of nodes, and show that it sig-nificantly outperforms the the states of the art in terms of queryefficiency. Specifically,SLING requires at most2.3 millisecondsto process a single-pair SimRank query on our datasets, and is upto 10000 times faster than the linearization method [24]. To our

knowledge, this is the first result in the literature that demonstratesmillisecond-scale query time for single-pair SimRank computa-tion on million-node graphs. For single-source SimRank queries,SLING is up to 110 times more efficient than the linearizationmethod. As a tradeoff,SLING incurs larger space overheads thanthe linearization method, but it is a still much more favorable choicein the common scenario where query time and accuracy (instead ofspace consumption) are the main concern.

The remainder of the paper is organized as follows. Section 2defines the problem that we study. Section 3 discusses the majorexisting methods for SimRank computation. Section 4 presents theSLING index, with a focus on single-pair queries. Section 5 pro-poses techniques to optimize the practical performance ofSLING.Section 6 details howSLINGsupports single-source queries. Sec-tion 7 experimentally evaluatesSLINGagainst the stats of the art

2. PRELIMINARIESLet G be a directed and unweighted graph withn nodes andm

edges. We aim to construct an index structure onG to supportsingle-pairandsingle-sourceSimRank queries, which are definedas follows:

• A single-pair SimRank query takes as input two nodesu andv in G, and returns their SimRank scores(u, v) (see Equa-tion 1).

• A single-source SimRank query takes as input a nodeu, andreturnss(u, v) for each nodev in G.

Following previous work [8, 23, 24, 32], we allow an additiveer-ror of at mostε ∈ (0, 1) in each SimRank score returned for anySimRank query.

For ease of exposition, we focus on single-pair SimRank queriesin Sections 3-5, and then discuss single-source queries in Section 6.Table 2 shows the notations frequently used in the paper. Unlessotherwise specified, all logarithms in this paper are to basee.

3. ANALYSIS OF EXISTING METHODSThis section revisits the three major approaches to SimRank

computation: thepower method[14], theMonte Carlo method[8],and thelinearization method[17, 24, 25, 32]. The asymptotic per-formance of the Monte Carlo method and the linearization methodhas been studied in literature, but to our knowledge, there is noformal analysis regarding their space and time complexities whenensuringε worst-case errors. We remedy this issue with detaileddiscussions on each method’s asymptotic bounds and limitations.

3.1 The Power MethodThe power method [14] is an iterative method for computing the

SimRank scores of all pairs of nodes in an input graph. The method

Page 3: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Table 2: Table of notations.

Notation Description

G the input graph

n,m the numbers of nodes and edges inG

vi thei-th node inG

I(v) the set of in-neighbors of a nodev in G

s(vi, vj) the SimRank score of two nodesvi andvj in G

c the decay factor in the definition of SimRank

ε the maximum additive error allowed in a SimRank score

δ the failure probability of a Monte-Carlo algorithm

M(i, j) the entry on thei-th row andj-th column of a matrixM

dk the correction factor for nodevkhℓ(vi, vj) the hitting probability (HP) from nodevi to nodevj at

stepℓ (see Section 4.2)

uses an × n matrix S, where the elementS(i, j) on thei-th rowandj-th column (i, j ∈ [1, n]) denotes the SimRank score of thei-th nodevi andj-th nodevj . Initially, the method sets

S(i, j) =

1, if i = j

0, otherwise

After that, in thet-th (t ≥ 1) iteration, the method updatesS basedon the following equation:

S(i, j) =

1, if i = j

c

|I(vi)||I(vj)|∑

vk∈I(vi),vℓ∈I(vj)

S(k, ℓ), otherwise

Let S(t) denote the version ofS right after thet-th iteration. Li-zorkin et al. [23] establish the following connection betweent andthe errors in the SimRank scores inS(t):

LEMMA 1 ( [23]). If t ≥ logc(ε · (1− c))− 1, then for any

i, j ∈ [1, n], we have∣∣∣S(t)(i, j) − s(vi, vj)

∣∣∣ ≤ ε .

Based on Lemma 1 and the fact that each iteration of the powermethod takesO

(m2)

time, we conclude that the power methodruns inO

(m2 log 1

ε

)time when ensuringε worst-case error. In

addition, it requiresO(n2)

space (caused byS). These large com-plexities in time and space make the power method only applicableon small graphs.

3.2 The Monte Carlo MethodThe Monte Carlo method [8] is motivated by an alternative def-

inition of SimRank scores [14] that utilizes the concept ofreverserandom walks. Given a nodew0 in G, a reverse random walk fromw0 is a sequence of nodesW = 〈w0, w1, w2, . . .〉, such thatwi+1

(i ≥ 0) is selected uniformly at random from the in-neighbors ofwi. We refer towi as thei-th stepof W .

Suppose that we have two reverse random walksWi andWj thatstart from two nodesvi andvj , respectively, and theyfirst meetatthe τ -th step. That is, theτ -th steps ofWi andWj are identical,but for anyℓ ∈ [0, τ ), theℓ-th step ofWi differs from theℓ-th stepof Wj . Jeh and Widom [14] establishes the following connectionbetweenτ and the SimRank score ofvi andvj :

s(vi, vj) = E[cτ ], (2)

whereE[·] denotes the expectation of a random variable.

Based on Equation (2), the Monte Carlo method [8] pre-computes a setWi of reverse random walks from each nodeviin G, such that (i) each setWi has the same numbernw of walks,and (ii) each walk inWi is truncated at stept, i.e., the nodes afterthe t-th step are omitted. (This truncation is necessary to ensurethat the walk is computed efficiently.) Then, given two nodesviandvj , the method estimates their SimRank score as

s(vi, vj) =1

nw

nw∑

ℓ=0

cτℓ ,

whereτℓ denotes the step at which theℓ-th walk inWi first meetswith theℓ-th walk inWj . Fogaras and Rácz [8] show that, with atleast1− 2 exp(− 6

7nwε

2) probability,∣∣s(vi, vj)− E [s(vi, vj)]

∣∣ ≤ ε. (3)

However, we note thatE [s(vi, vj)] 6= s(vi, vj), due to the trun-cation imposed on the reverse random walks inWi andWj . Toaddress this issue, we present the following inequality:∣∣∣E [s(vi, vj)]− s(vi, vj)

∣∣∣ =∣∣∣E[cτ]− Pr[τ ≤ t] · E

[cτ | τ ≤ t

]∣∣∣

=∣∣Pr[τ > t

]· E[cτ | τ > t]|

≤ ct+1 (4)

By Equations (3) and (4) and the union bound, it can be verifiedthat whent > logc

ε2

andnw ≥ 143ε2

(log 2

δ+ 2 log n

),

∣∣s(vi, vj)− s(vi, vj)∣∣ ≤ ε

holds for all pairs ofvi andvj with at least1 − δ probability. Inthat case, the space and preprocessing time complexities oftheMonte Carlo method are bothO(nw · t) = O

(nε2

log 1εlog n

δ

).

In addition, the method takesO(

1ε2

log 1εlog n

δ

)time to answer a

single-pair SimRank query, andO(

nε2

log 1εlog n

δ

)time to process

a single-source SimRank query. These space and time complex-ities are rather unfavorable under typical settings ofε in practice(e.g.,ε = 0.01). Fogaras and Rácz [8] alleviate this issue witha coupling technique, which improves the practical performanceof the Monte Carlo method in terms of pre-computation time andspace consumption. Nevertheless, the method still incurs signif-icant overheads, due to which it is unable to handle graphs withover one million nodes, as we show in Section 7.

3.3 The Linearization MethodLetS andP be twon×n matrices, withS(i, j) = s(vi, vj) and

P (i, j) =

1/|I(vj)|, if vi ∈ I(vj)

0, otherwise(5)

Yu et al. [33] show that Equation (1) (i.e., the definition of Sim-Rank) can be rewritten as

S = (cP⊤SP ) ∨ I, (6)

where I is an n × n identity matrix, P⊤ is the trans-pose ofP , and ∨ is the element-wise maximumoperator, i.e.,(A ∨B)(i, j) = maxA(i, j), B(i, j) for any two matricesAandB and anyi, j.

Maehara et al. [24] point out that solving Equation (6) is difficultsince it is anon-linearproblem due to the∨ operator. To circum-vent this difficulty, they prove that there exists an × n diagonalmatrixD (referred to as thediagonal correction matrix), such that

S = cP⊤SP +D. (7)

Page 4: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Furthermore, onceD is given, one can uniquely deriveS based onthe following lemma by Maehara et al. [24]:

LEMMA 2 ( [24]). Given the diagonal correction matrixD,

S =+∞∑

ℓ=0

cℓ(P ℓ)⊤

DP ℓ, (8)

whereP ℓ denotes theℓ-th power ofP .

Given Lemma 2, Maehara et al. [24] propose the linearizationmethod, which pre-computesD and then uses it to answer Sim-Rank queries based on Equation (8). In particular, for any twonodesvi andvj , Equation (8) leads to

s(vi, vj) =

+∞∑

ℓ=0

cℓ(P ℓ · ~ei

)⊤D(P ℓ · ~ej

), (9)

where ~ek denotes an-element column vector where thek-th ele-ment equals1 and all other elements equal0. To avoid the infi-nite series in Equation (9), the linearization method approximatess(vi, vj) with

s(vi, vj) =t∑

ℓ=0

cℓ(P ℓ · ~ei

)⊤D(P ℓ · ~ej

), (10)

which can be computed inO(m · t) time. It can be shown that ifDis precise andt ≥ logc(ε · (1− c))− 1, then

∣∣s(vi, vj)− s(vi, vj)∣∣ ≤ ε. (11)

Therefore, given an exactD, the linearization method answers anysingle-pair SimRank query inO(m log 1

ε) time. With a slight mod-

ification of Equation 10, the method can also process any single-source SimRank query inO(m log2 1

ε) time.

Unfortunately, the linearization method do not precisely deriveD, due to which the above time complexities does not hold in gen-eral. Specifically, Maehara et al. [24] formulateD as the solutionto a linear system, and propose to solve anapproximateversion ofthe system to derive an estimationD of D. However, there is noformal analysis on the errors inD and their effects on the accuracyof SimRank computation. In addition, the technique used to solvethe approximate linear system does not guarantee toconverge, i.e.,it may not returnD in bounded time. Furthermore, even if thetechnique does converge, its time complexity relies on a parame-ter that is unknown in advance, and may even dominaten, m, and1/ε. This makes it rather difficult to analyze the pre-computationtime of the linearization method. We refer interested readers toAppendix A for detailed discussions on these issues.

In summary, the linearization method by Maehara et al. [24] doesnot guaranteeε worst-case error in each SimRank score returned,and there is no non-trivial bound on its preprocessing time.Thisproblem is partially addressed in recent work [32] by Yu and Mc-Cann, who propose a variant of the linearization method thatdoesnot pre-compute the diagonal correction matrixD, but implicitlyderivesD during query processing. Yu and McCann’s techniqueis able to ensureε worst-case error in SimRank computation, butas a trade-off, it requiresO

(mn log 1

ε

)time to answer a single-pair

SimRank query, which renders it inapplicable on any sizablegraph.

4. OUR SOLUTIONThis section presents ourSLING index for SimRank queries.

SLINGis based on a new interpretation of SimRank scores, whichwe clarify in Section 4.1. After that, Sections 4.3-4.5 provide de-tails ofSLINGand analyze its theoretical guarantees.

4.1 New Interpretation of SimRankLet c be the decay factor in the definition of SimRank (see Equa-

tion (1)). Suppose that we perform a reverse random walk fromanynodeu in G, such that

• At each step of the walk, we stop with1−√c probability;

• With the other√c probability, we inspect the in-neighbors of

the node at the current step, and select one of them uniformlyat random as the next step.

We refer to such a reverse random walk as a√c-walk from u. In

addition, we say that two√c-walks meet, if for a certainℓ ≥ 0,

the ℓ-th steps of the two walks are identical. (Note the0-th stepof a

√c-walk is its starting node.) The following lemma shows an

interesting connection between√c-walks and SimRank.

LEMMA 3. Let Wi andWj be two√c-walks from two nodes

vi andvj , respectively. Then,s(vi, vj) equals the probability thatWi andWj meet.

The above formulation of SimRank is similar in spirit to the oneused in the Monte Carlo method [8] (see Section 3.2), but differsin one crucial aspect: each

√c-walk in our formulation has an ex-

pected length of 11−√

c, whereas each reverse random walk in the

previous formulation is infinite. As a consequence, if we areto es-timates(vi, vj) using a sample set of

√c-walks fromvi andvj ,

we do not need to truncate any√c-walk for efficiency; in contrast,

the Monte Carlo method [8] must trim each reverse random walktotrade estimation accuracy for bounded computation time. Infact,if we incorporate

√c-walks into the Monte Carlo method, then its

query time complexities are immediately improved by a factor oflog 1

ε. Nonetheless, the space and time overheads of this revised

method still leave much room for improvement, since it requiresO(log n

δ/ε2)

√c-walks for each node, whereδ is the upper bound

on the method’s failure probability. This motivates us to developtheSLINGmethod for more efficient SimRank computation, whichwe elaborate in the following sections.

4.2 Key Idea of SLINGLet h(ℓ)(va, vb) denote the probability that a

√c-walk from va

arrives atvb in its ℓ-th step. We refer toh(ℓ)(va, vb) as thehittingprobability (HP) from va to vb at stepℓ. Observe that, for any two√c-walksWi andWj from two nodesvi andvj , respectively, the

probability that they meet atvk at theℓ-th step is

h(ℓ)(vi, vk) · h(ℓ)(vj , vk).

Sinces(vi, vj) equals the probability thatWi andWj meet, onemay attempt to computes(vi, vj) by taking the the probability thatWi andWj meet over all combinations of meeting nodes and meet-ing steps, i.e.,

s∗(vi, vj) =

+∞∑

ℓ=0

n∑

k=1

(h(ℓ)(vi, vk) · h(ℓ)(vj , vk)

). (12)

However, this formulation is incorrect, because the eventsthat “Wi

andWj meet at nodevx at stepℓ” and “Wi andWj meet at nodevy at stepℓ′ > ℓ” arenotmutually exclusive. For example, assumethat vi = vj , andvi has only in-neighborvk. In that case,Wi

andWj have100% probability to meet atvi at the0-th step, anda non-zero probability to meet atvk at the first step. This leads tos∗(vi, vj) > 1, whereass(vi, vj) = 1 by definition.

Interestingly, Equation (12) can be fixed if we substituteh(ℓ)(vi, vk) · h(ℓ)(vj , vk) with the probability of the event that“Wi andWj meet atvk at stepℓ, but never meet again afterwards”.

Page 5: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

To explain this, observe that the above event indicates thatWi andWj last meetatvk at stepℓ. If we changevk (resp.ℓ) in the event,thenWi andWj should last meet at a different node (resp. step),in which case the changed event and the original one are mutuallyexclusive. Based on this observation, the following lemma presentsa remedy to Equaiton (12).

LEMMA 4. Let dk be the probability that two√c-walks from

nodevk do not meet each other after the0-th step. Then, for anytwo nodesvi andvj ,

s(vi, vj) =∞∑

ℓ=0

n∑

k=1

(h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

). (13)

In what follows, we refer todk as thecorrection factorfor vk.Based on Lemma 4, we propose to pre-compute approximate

versions ofdk and HPsh(ℓ)(vi, vk), and then use them to estimateSimRank scores based on Equation (4). The immediate problemhere is that there exists an infinite number of HPsh(ℓ)(vi, vk) toapproximate, since we need to consider allℓ ≥ 0. However, weobserve that if we allow an additive error in the approximateval-ues, then most of the HPs can be estimated as zero and be omitted.In particular, we have the following observation:

OBSERVATION 1. For any nodevi and ℓ ≥ 0, there exist atmost(

√c)ℓ/εh nodesvk such thath(ℓ)(vj , vk) ≥ εh.

To understand this, recall that each√c-walk has only(

√c)ℓ

probability tonot stop before theℓ-th step, i.e.,n∑

k=1

h(ℓ)(vj , vk) = (√c)ℓ.

Therefore, at most(√c)ℓ/εh of the HPs at stepℓ can be larger than

εh. Even if we take into account allℓ ≥ 0, the total number of HPsaboveεh is only

+∞∑

ℓ=0

(√c)ℓ/εh = O(1/εh).

In other words, we only need to retain a constant number of HPsfor each node, if we permit a constant additive error in each HP.

Based on the above analysis, we propose theSLINGindex, whichpre-computes an approximate versiondk of each correction fac-tor dk, as well as a constant-size setH(vi) of approximate HPsfor each nodevi. To derive the SimRank score of two nodesviandvj , SLINGfirst retrievesdk, H(vi), andH(vj), and then esti-matess(vi, vj) in constant time based on an approximate version ofEquation (13). The challenge in the design ofSLINGis threefold.First, how can we derive an accurate estimation ofdk? Second,how can we efficiently constructH(vi) without iterating over allHPs? Third, how do we ensure that alldk andH(vi) can jointlyguaranteeε worst-case error in each SimRank score computed? InSections 4.3-4.5, we elaborate how we address these challenges.

Before we proceed, we note that there is an interesting connec-tion between Lemmas 2 and 4:

LEMMA 5. Let P and D be as in Lemma 2, anddk andh(ℓ)(vi, vk) be as in Lemma 4. For anyi, k ∈ [1, n], h(ℓ)(vi, vk) =

(√c)

ℓ · P (k, i), anddk equals thek-th diagonal element inD.

In other words,h(ℓ)(vi, vk) (resp.dk) can be regarded as a random-walk-based interpretation of the entries inP (resp. diagonal ele-ments inD). Therefore, Lemmas 2 and 4 are different interpreta-tions of the same result. The main advantage of our new interpreta-tion is that it gives a physical meaning todk which, as we show in

Algorithm 1: A sampling method for estimatingdkInput: a nodevk , an error boundεd, and a failure probabilityδdOutput: an estimation versiondk of dk with at mostεd error, with at

least1− δd probability

1 Letnr =2c2 + c · εd

ε2dlog

2

δd;

2 Let cnt = 0;3 for x = 1, 2, · · · , nr do4 Select two nodesvi andvj from I(vk) uniformly at random;5 if vi 6= vj then6 Generate two

√c-walks fromvi andvj , respectively;

7 if the two√c-walks meetthen

8 cnt = cnt+ 1;

9 return dk = 1− c

|I(vi)|− c · cnt

nr;

Section 4.3, enables us to devise a simple and rigorous algorithm toestimatedk to any desired precision. In contrast, the only existingmethod for approximatingD [24] fails to provide any non-trivialguarantees in terms of accuracy and efficiency, as we discussinSection 3.3.

4.3 Estimation of dk

LetW andW ′ be two√c-walks fromvk. By definition,1− dk

is the probability that any of the following events occurs:

1. W andW ′ meet at the first step.

2. In the first step,W andW ′ arrive at two different nodesviandvj , respectively; but sometime after the first step,W andW ′ meet.

Note that the above two events are mutually exclusive, and the firstevent occurs with c

|I(vk)| probability. For the second event, if we

fix a pair of vi andvj , then the probability thatW andW ′ meetafter the first step equals the probability that a

√c-walk from vi

meets a√c-walk from vj ; by Lemma 3, this probability is exactly

s(vi, vj). Therefore, we have

dk = 1− c

|I(vk)|− c

|I(vk)|2∑

vi,vj∈I(vk)vi 6=vj

s(vi, vj). (14)

Equation (14) indicates that, if we are to estimatedk, it suffices toderive an estimation of

µ =1

|I(vk)|2∑

vi,vj∈I(vk)∧vi 6=vj

s(vi, vj) (15)

by sampling√c-walks fromvi andvj . In particular, as long asµ is

estimated with an error no more thanεd/c, the resulting estimationof dk would have at mostεd error. Motivated by this, we propose asampling method for approximatingdk, as shown in Algorithm 1.

In a nutshell, Algorithm 1 generatesnr pairs of√c-walks, such

that each walk starts from a randomly selected node inI(vk); afterthat, the algorithm counts the numbercnt of pairs that meet at orafter the first step; finally, it returnsdk = 1− c

|I(vi)| − c · cntnr

asan estimation ofdk. By the Chernoff bound (see Appendix D) andthe properties of

√c-walks, we have the following lemma on the

theoretical guarantees of Algorithm 1.

LEMMA 6. Algorithm 1 runs inO(

1ε2d

log 1δd

)expected time,

and returnsdk such that|dk − d| ≤ εd holds with at least1 − δdprobability.

Page 6: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Algorithm 2: A local update method for constructingH(vi)

Input: G and a thresholdθOutput: A setH(vi) of approximate HPs for each nodevi in G

1 Initialize H(vi) = ∅ for each nodevi;2 for each nodevk in G do3 Initialize a setRk = ∅ for storing approximate HPs;

4 Inserth(0)(vk, vk) = 1 into Rk;5 for ℓ = 0, 1, 2, . . . do6 for eachh(ℓ)(vx, vk) ∈ Rk do7 if h(ℓ)(vx, vk) ≤ θ then8 removeh(ℓ)(vx, vk) fromRk ;9 continue;

10 for each out-neighborvi of vx do11 if h(ℓ)(vi, vk) /∈ Rk then

12 Inserth(ℓ+1)(vi, vk) =√c · h(ℓ)(vx,vk)

|I(vi)| into

Rk;

13 else

14 Increaseh(ℓ+1)(vi, vk) by√c · h(ℓ)(vx,vk)

|I(vi)| ;

15 if Rk does not contain any HP at stepℓ+ 1 then16 break;

17 for eachh(ℓ)(vi, vk) ∈ Rk do18 Inserth(ℓ)(vi, vk) into H(vi);

4.4 Construction of H(vi)As mentioned in Section 4.2, we aim to construct a constant-

size setH(vi) for each nodevi, such thatH(vi) contains an ap-proximate versionh(ℓ)(vi, vx) of each HPh(ℓ)(vi, vx) that is suf-ficiently large. Towards this end, a relatively straightforward solu-tion is to sample a setWi of

√c-walks from eachvi, and then use

Wi to derive approximate HPs. This solution, however, requiresO(1/εh

2) walks in Wi to ensure that the additive error in eachh(ℓ)(vi, vx) is at mostεh, which leads to considerable computa-tion costs whenεh is small.

Instead of sampling√c-walks, we devise a deterministic method

for constructing allH(vi) in O(m/εh) time while allowing at mostεh additive error in each approximate HP. The key idea of ourmethod is to utilize the following equation on HPs:

h(ℓ+1)(vi, vk) =

√c

|I(vi)|∑

vx∈I(vi)

h(ℓ)(vx, vk), (16)

for any ℓ ≥ 0. Intuitively, Equation (16) indicates that once wehave derived the HPs tovk at stepℓ, then we can compute the HPsto vk at stepℓ + 1. Based on this intuition, our method generatesapproximate HPs tovk by processing the stepsℓ in ascending orderof ℓ. We note that our method is similar in spirit to thelocal updatealgorithm [4,9,15] for estimatingpersonalized PageRanks[15], andwe refer interested readers to Appendix B for a discussion ontheconnections between our method and those in [4,9,15].

Algorithm 2 shows the pseudo-code of our method. GivenGand a thresholdθ, the algorithm first initializesH(vi) = ∅ for eachnodevi (Line 1). After that, for each nodevk, the algorithm per-forms a graph traversal fromvk to generates approximate HPs fromother nodes tovk. Specifically, for eachvk, it first initializes a setRk = ∅, and then inserts an HPh(0)(vk, vk) = 1 into Rk, whichcaptures the fact that every

√c-walk from vk has100% probabil-

ity to hit vk itself at the0-th step (Lines 3-4). Then, the algorithmenters an iterative process, such that theℓ-th iteration (ℓ ≥ 0) pro-cesses the HPs tovk at stepℓ that have been inserted intoRk.

Algorithm 3: An algorithm for single-pair SimRank queries

Input: dk, H(vk), and two nodesvi andvjOutput: An approximate SimRank scores(vi, vj)

1 Let s(vi, vj) = 0;

2 for eachh(ℓ)(vi, vk) ∈ H(vi) do3 if there existsh(ℓ)(vj , vk) ∈ H(vj ) then4 s(vi, vj) = s(vi, vj) + h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk);

5 return s(vi, vj);

In particular, in theℓ-the iteration, the algorithm first identifiesthe approximate HPsh(ℓ)(vx, vk) in Rk that are at stepℓ, and pro-cesses each of them in turn (Lines 6-16). Ifh(ℓ)(vx, vk) ≤ θ, thenit is removed fromRk, i.e., the algorithm omits an approximateHP if it is sufficiently small. Meanwhile, ifh(ℓ)(vx, vk) > θ, thenthe algorithm inspects each out-neighborvi of vx, and updates theapproximate HP fromvi to vk at stepℓ + 1, according to Equa-tion (16). After all approximate HPs at stepℓ are processed, the al-gorithm terminates the iterative process onℓ. Finally, the algorithminserts eachh(ℓ)(vi, vk) ∈ R into H(vi), after which it proceedsto the next nodevk+1.

The following lemma states the guarantees of Algorithm 2.

LEMMA 7. Algorithm 2 runs inO(m/θ) time, and constructsa set H(vi) of approximate HPs for each nodevi, such that|H(vi)| = O(1/θ). In addition, for eachh(ℓ)(vi, vk) ∈ H(vi),we have

0 ≥ h(ℓ)(vi, vk)− h(ℓ)(vi, vk) ≥ −1− (√c)ℓ

1−√c

· θ.

4.5 Query Method and Complexity AnalysisGiven an approximate correction factordk and a setH(vk) of

approximate HPs for each nodevk, we estimate the SimRank scorebetween any two nodesvi andvj according to a revised version ofEquation (13):

s(vi, vj) =∞∑

ℓ=0

n∑

k=1

(h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

). (17)

Algorithm 3 shows the details of our query processing method.To analyze the accuracy guarantee of Algorithm 3, we first

present a lemma that quantifies the error ins(vi, vj) based on theerrors indk andH(vk).

LEMMA 8. Suppose that∣∣∣dk − dk

∣∣∣ ≤ εd for anyk, and

0 ≥ h(ℓ)(vk, vx)− h(ℓ)(vk, vx) ≥ −ε(ℓ)h ,

for anyk, x, ℓ. Then, we have|s(vi, vj)− s(vi, vj)| ≤ ε if

εd1− c

+ 2

+∞∑

ℓ=0

((√c)ℓ · ε(ℓ)h

)≤ ε.

Combining Lemmas 6, 7, and 8, we have the following theorem.

THEOREM 1. Suppose that we derive eachdk using Algo-rithm 1 with inputεd andδd, and we construct eachH(vk) usingAlgorithm 2 with inputθ. If δd ≤ δ/n and

εd1− c

+2√c

(1−√c)(1− c)

θ ≤ ε,

then Algorithm 3 incurs an additive error at mostε in each Sim-Rank score returned, with at least1− δ probability.

Page 7: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

By Theorem 1, we can ensureε worst-case error in each Sim-Rank score by settingεd = O(ε), θ = O(ε), andδd = δ/n.In that case, ourSLING index requiresO(m/ε + n log n

δ) pre-

computation time andO(n/ε) space, and it answers any single-pair SimRank query inO(1/ε) time. The space (resp. query time)complexity ofSLINGis onlyO(1/ε) times larger than the optimalvalue, since any SimRank method (that ensuresε worst-case error)requiresΩ(n) space for storing the information about all nodes,and takes at leastΩ(1) time to output the result of a single-pairSimRank query.

5. OPTIMIZATIONSThis section presents optimization techniques to (i) improve the

efficiency of estimating each correction factorsdk (Section 5.1),(ii) reduce the space consumption ofSLING(Section 5.2), (iii) en-hance the accuracy ofSLING (Section 5.3), and (iv) incorporateparallel and out-of-core computation intoSLING’s index construc-tion algorithm (Section 5.4).

5.1 Improved Estimation of dk

As discussed in Section 4.3, Algorithm 1 generates an approxi-mate correction factordk in O

(ε−2d log δ−1

d

)expected time, where

εd is the maximum error allowed indk, andδd is the failure proba-bility. As the algorithm’s time complexity is quadratic to1/εd, it isnot particularly efficient whenεd is small. This relative inefficiencyis caused by the fact the algorithm requiresO

(ε−2d log δ−1

d

)pairs

of√c-walks to estimate the valueµ (in Equation 15) withεd/c

worst-case error.However, we observe that we can often use a much smaller

number of√c-walk pairs to derive an estimation ofµ with at

most εd/c error. Specifically, by the Chernoff bound (see Ap-pendix D), we only needO

((µ+ εd) · ε−2

d log δ−1d

)pairs of

√c-

walks to estimateµ. Apparently, this number is much smaller thanO(ε−2d log δ−1

d

)whenµ ≪ 1 (which is often the case in prac-

tice). For example, ifµ ≤ εd, then the number of√c-walk pairs

required is onlyO(ε−1d log δ−1

d ). The main issue here is that wedo not knowµ in advance. Nevertheless, if we can derive an up-per bound ofµ, and we use it to decide an appropriate number of√c-walks needed.Based on the above observation, we propose an improved algo-

rithm for computingdk, as shown in Algorithm 4. The algorithmfirst generatesnr = O(ε−1

d log δ−1d ) pairs of

√c-walks from ran-

domly selected nodes inI(vk), and counts the numbercnt of pairsthat meet (Lines 1-8). Then, it computesµ = cnt/nr as an es-timation ofµ. If µ ≤ εd, then the algorithm determines thatnr

pairs of√c-walks are sufficient for an accurate estimation ofµ; in

that case, it terminates and returns an estimation ofdk based onµ(Lines 9-11).

On the other hand, ifµ > εd, then the algorithm proceeds togenerate a larger number of

√c-walks to derive a more accurate

estimation ofµ. Towards this end, it first computesµ∗ = µ +√µ · ε as an upper bound ofµ, and usesµ∗ to decide the total

numbern∗r = O(µ∗ε−2

d log δ−1d ) of

√c-walk pairs that are needed

(Lines 12-13). After that, it increases the total number of√c-walk

pairs ton∗r , and recounts the numbercnt of pairs that meet (Lines

14-19). Finally, it derivesu = cnt/n∗r as an improved estimation

of µ, and returns an approximate correction factordk computedbased onµ (Lines 20-21).

The following lemmas establish the asymptotic guarantees of Al-gorithm 4.

LEMMA 9. With at least1−δd probability, Algorithm 4 returnsdk such that|dk − d| ≤ εd holds.

Algorithm 4: An improved method for estimatingdkInput: a nodevk , an error boundεd, and a failure probabilityδdOutput: an estimation versiondk of dk with at mostεd error, with at

least1− δd probability

1 Letnr =14c

3εdlog

4

δd;

2 Let cnt = 0;3 for x = 1, 2, · · · , nr do4 Select two nodesvi andvj from I(vk) uniformly at random;5 if vi 6= vj then6 Generate two

√c-walks fromvi andvj , respectively;

7 if the two√c-walks meetthen

8 cnt = cnt+ 1;

9 Let µ = cnt/nr;10 if µ ≤ εd then

11 return dk = 1− c

|I(vi)|− c · µ;

12 Let µ∗ = µ +√µ · εd;

13 Letn∗r =

2c2 · µ∗ + 23c · εd

ε2dlog

4

δd;

14 for x = 1, 2, · · · , n∗r − nr do

15 Select two nodesvi andvj from I(vk) uniformly at random;16 if vi 6= vj then17 Generate two

√c-walks fromvi andvj , respectively;

18 if the two√c-walks meetthen

19 cnt = cnt+ 1;

20 µ = cnt/n∗r ;

21 return dk = 1− c

|I(vi)|− c · µ;

LEMMA 10. Algorithm 4 generatesO(µ+εdε2d

log 1δd)√c-walks

in expectation, and runs inO(µ+εdε2d

log 1δd) expected time.

By Lemma 9, Algorithm 4 uses a number of√c-walks that

is roughlymaxµ, εd times the number in Algorithm 1, whichleads to significantly improved efficiency. In addition, we notethat Algorithm 4 can be easily revised into a general method thatestimates the expectationµz of a Bernoulli distribution by takingO(µz+ε

ε2log 1

δ) samples, while ensuring at mostε estimation error

with at least1− δ success probability. In particular, the only majorchange needed is to replace each

√c-walk pair in Algorithm 4 with

a sample from the Bernoulli distribution. In this context, we canprove that the number of samples used by Algorithm 4 isasymptot-ically optimal.

Specifically, letz1, z2, . . . be a sequence of i.i.d. Bernoulli ran-dom variables, andµZ = E[zi]. Let A be an algorithm that in-spectszi in ascending order ofi, and stops at a certainzj beforereturning an estimationµZ of µZ . In addition, for any possiblesequence ofzi, A runs in finite expected time, and ensures that|µZ − µZ | ≤ ε with at least1 − δ probability. It can be verifiedthat the revised Algorithm 4 is an instance ofA. The followinglemma shows that no other instance ofA can be asymptoticallymore efficient than Algorithm 4.

LEMMA 11. Any instance ofA has Ω(maxµz,εε2

log 1δ) ex-

pected time complexity whenµz < 0.5.

Our proof of Lemma 11 utilizes an important result by Dagumet al. [7] that establishes a lower bound of the expected timecom-plexity of A, when it provides a worst-case guarantee in terms ofthe relative error (instead of absolute error) inµz . Dagum et al. [7]also provide a sampling algorithm whose time complexity matches

Page 8: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Algorithm 5: An algorithm for constructingH ′(vi)

Input: a nodeviOutput: A setH′(vi) of precise HPs fromvi at steps1 and2

1 Initialize H′(vi) = ∅;2 Inserth(0)(vi, vi) = 1 into H′(v);3 for each nodevx ∈ I(vi) do4 Inserth(1)(vi, vx) =

c|I(vi)| into H′(v);

5 for each nodevy ∈ I(vx) do6 if h(2)(vi, vy) /∈ H′(vi) then

7 Inserth(2)(vi, vy) =√c · h(1)(vi,vx)

|I(vx)| into H′(vi);

8 else

9 Increaseh(2)(vi, vy) by√c · h(1)(vi,vx)

|I(vx)| in H′(vi);

10 return H′(vi)

their lower bound, but the algorithm is inapplicable in our context,since it requires as input a relative error bound, which cannot betranslated into an absolute error bound unlessµz is known.

5.2 Reduction of Space ConsumptionRecall that ourSLING index pre-computes a setH(vi) of ap-

proximate HPs for each nodevi, such that eachh(ℓ)(vi, vk) ∈H(vi) is no smaller than a thresholdθ = O(ε). The total sizeof all H(vi) is O(n/ε), which is asymptotically near-optimal, butmay still be costly from a practical perspective (especially whenεis small). To address this issue, we aim to reduce the size ofH(vi)without affecting the time complexity ofSLING.

We observe that, in eachH(vi), a significant portion of the ap-proximate HPs are in the form ofh(1)(vi, vk) or h(2)(vi, vk), i.e.,they concern the HPs fromvi to the nodes within two hops awayfrom vi. On the other hand, such HPs can be easily computed usinga two-hop traversal fromvi, as we will show shortly. This leads tothe following idea for space reduction: we remove fromH(vi) allapproximate HPs that are at steps1 and2, and we recompute thoseHPson the flyduring query processing. The re-computation maylead to slightly increased query cost, but as long as it takesO(1/ε)time, it would not affect the asymptotic performance ofSLING. Inthe following, we clarify how we implement this idea.

First, we present a simple andprecisealgorithm for computingthe setH ′(vi) of HPs from nodevi to other nodes at steps1 and2, as shown in Algorithm 5. The algorithm first initializes a setH ′(vi) = ∅ for storing HPs, and then insertsh(0)(vi, vi) = 1into H ′(v). After that, for each in-neighborvx of vi, it setsh(1)(vi, vx) =

√c

|I(vi)| , which is the exact probability that a√c-

walk from vi would hit vx at step1. In turn, for each in-neighbor

vy of vx, the algorithm initializesh(2)(vi, vy) =√c · h(1)(vi,vx)

|I(vx)|in H ′(vi), if it is not yet inserted intoH ′(vi); otherwise, the al-

gorithm increasesh(2)(vi, vy) by√c · h(1)(vi,vx)

|I(vx)| in H ′(vi). This

reason is that if a√c-walk from vi hits vx at step1, then it has√

c|I(vx)| probability to hitvy at step2. After all of vi’s in-neighbors

are processed, the algorithm terminates and returnsH ′(vi).Algorithm 5 runs in time linear to the total numberη(vi) of in-

coming edges ofvi and its in-neighbors, i.e.,

η(vi) = |I(vi)|+∑

vx∈I(vi)

|I(vx)|.

If η(vi) = O(1/ε), then we can omit all step-1 and step-2 approx-imate HPs inH(vi), and compute them with Algorithm 5 duringquery processing without degrading the time complexity ofSLING;

otherwise, we need to retain all approximate HPs inH(vi). In ourimplementation ofSLING, we set a constantγ = 10, and we ex-clude step-1 and step-2 HPs fromH(vi) wheneverη(vi) ≤ γ/θ,whereθ = Ω(ε) is the HP threshold used in the construction ofH(vi) (see Algorithm 2). Notice that eachη(vi) can be computedin O(|I(vi)|) time by inspectingvi and all of its in-neighbors;therefore, the total computation cost of allη(vi) is O(m), whichdoes not affectSLING’s preprocessing time complexity. Further-more, the on-the-fly computation of step-1 and step-2 HPs doesnot degradeSLING’s accuracy guarantee, since all HPs returned byAlgorithm 5 are precise.

5.3 Enhancement of AccuracyThe approximation error of eachH(vi) arises from the fact that

it omits the HPs fromvi that are smaller than a thresholdθ. Astraightforward solution to reduce this error is to decrease θ, butit would degrade the space overhead ofH(vi). Instead, we pro-pose to generateadditional HPs inH(vi) on-the-flyduring queryprocessing, to increase the accuracy of query results.

Specifically, for each nodevi, afterH(vi) is constructed (withthe space reduction procedure in Section 5.2 applied), we inspectthe set of approximate HPsh(ℓ)(vi, vj) in H(vi) such thatvjhas no more than1/

√ε in-neighbors, and thenmark the 1/

√ε

largest HPs in the set. After that, whenever a SimRank queryrequires utilizingH(vi), we substituteH(vi) with an enhancedversionH∗(vi) constructed on-the-fly. In particular, we first setH∗(vi) = H(vi). Then, for every marked HPh(ℓ)(vi, vj) inH(vi), we process each in-neighborvk of vj as follows:

• If there existsh(ℓ+1)(vi, vk) in H(vi), then we omitvk;

• If h(ℓ+1)(vi, vk) is not inH(vi) and has not been inserted into

H∗(vi), then we seth(ℓ+1)(vi, vk) =

√c

|I(vj)|h(ℓ)(vi, vj),

and insert it intoH∗(vi);

• Otherwise, we updateh(ℓ+1)(vi, vk) in H∗(vi) as follows:

h(ℓ+1)(vi, vk) = h(ℓ+1)(vi, vk) +

√c

|I(vj)|h(ℓ)(vi, vj).

In other words, ifH(vi) does not contain an approximate HP fromvi to vk, then we generateh(ℓ+1)(vi, vk) in H∗(vi).

It can be verified that0 < h(ℓ+1)(vi, vk) ≤ h(ℓ+1)(vi, vk), andhence,H∗(vi) provides higher accuracy thanH(vi). In addition,the construction ofH∗(vi) requires onlyO(1/ε) time, and hence,it does not affect theO(1/ε) query time complexity ofSLING. Fur-thermore, marking HPs in allH(vi) requires onlyO(n/

√ε) space

andO(n log(1/ε)/ε) preprocessing time, which does not degradetheO(n/ε) space andO

(m/ε+ n log n

δ/ε2)

preprocessing timecomplexity ofSLING.

5.4 Parallel and Out-of-Core ConstructionsThe preprocessing algorithms ofSLING (i.e., Algorithms 1, 2,

and 4) areembarrassingly parallelizable. In particular, Algo-rithm 1 (and Algorithm 4) can be simultaneously applied to multi-ple nodesvk to compute the corresponding approximate correctionfactorsdk. Meanwhile, the main loop of Algorithm 2 (i.e., Lines2-16) can be parallelized to construct the “reverse” HP setsRk formultiple nodesvk at the same time.

Furthermore,SLINGdoes not require the complete index struc-ture to fit in the main memory. Instead, we only need to keep allapproximate correction factorsvk (k ∈ [1, n]) in the memory, butcan store the approximate HP setH(vx) for each nodevx on thedisk. To process a single-pair SimRank query on two nodesvi and

Page 9: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

vj , we retrieveH(vi) andH(vj) from the disk and combine themwith vk to derive the query result, which incurs a constant I/O cost,sinceH(vi) andH(vj) takes onlyO(1/ε) space. In addition, theindex construction process ofSLINGdoes not require maintainingall HP setsH(vx) simultaneously in the memory. Specifically, inAlgorithm 2, we can construct each “reverse” HP setRk in turnand write them to the disk; after that, we can construct all approx-imate HP setsH(vx) in a batch, by using an external sorting al-gorithm to sort all HPsh(ℓ)(vx, vk) by vx. This process requiresonly O(n

εlog n

ε) I/O accesses, since the total size of allH(vx) is

O(n/ε).

6. EXTENSION TO SINGLE-SOURCEQUERIES

Given theSLING index introduced in Sections 4, we can easilyanswer any single-source SimRank query from a nodevi, by in-voking Algorithm 3n times to computes(vi, vj) for each nodevj .This leads to a total query cost ofO(n/ε), which is near-optimalsince any single-source SimRank method requiresΩ(n) time tooutput the results. This straightforward algorithm, however, can beimproved in terms of practical efficiency. To explain this, let usconsider two nodesvi andvj , such thatH(vi) andH(vj) do notcontain any HPs to the same node at the same step, i.e.,

∄vk, ℓ, h(ℓ)(vi, vk) ∈ H(vi) ∧ h(ℓ)(vj , vk) ∈ H(vj).

Then,SLINGwould returns(vi, vj) = 0. We say thatH(vi) andH(vj) do not intersectin this case. Intuitively, if we can avoidaccessing those HP setsH(vj) that do not intersect withH(vi),then we can improve the efficiency of the single-source SimRankquery fromvi. For this purpose, a straightforward approach is tomaintain, for each combination ofvk andℓ, aninverted listL(vk, ℓ)that records the approximate HPsh(ℓ)(vx, vk) from any nodevx tovk. Then, to process a single-source SimRank query from nodevi,we first examine each approximate HPh(ℓ)(vi, vk) ∈ H(vi) andretrieveL(vk, ℓ), based on which we computes(vi, vj) for anynodevj with s(vi, vj) > 0.

Although the inverted list approach improves efficiency forsingle-source SimRank queries, it doubles the space consumptionof SLING, since the inverted lists have the same total size as theapproximate HP setsH(vi). Furthermore, the approach cannot becombined with the space reduction technique in Section 5.2,be-cause the former requires storing all approximate HPs in thein-verted lists, whereas the latter aims to omit certain HPs to savespace. To address this issue, we propose a single-source SimRankalgorithm for SLING that finds a middle ground between the in-verted list approach and the straightforward approach. Thebasicidea is that, given nodevi, we first retrieve all approximate HPsh(ℓ)(vi, vk) ∈ H(vi), and then apply a variant of Algorithm 2 tocompute the HPs from other nodes to eachvk; after that, we com-bine all HPs obtained to derive the query results. In other words,we construct the inverted lists relevant for the single-source queryon the fly, instead of pre-computing them in advance.

Algorithm 6 shows the details of our method. It takes as inputaquery nodevi and the thresholdθ used in constructingH(vi) (seeAlgorithm 2), and returns an approximate SimRank scores(vi, vj)for each nodevj . The algorithm starts by initializings(vi, vj) = 0for all vj (Line 1). Then, it identifies the stepsℓ such that there is atleast one step-ℓ approximate HP inH(vi); after that, it processeseach of those steps in turn (Lines 2-10). The general idea of pro-cessing is as follows. By Equation 13, ifvi has a positive HP toa nodevk at stepℓ, then for any other nodevj with a positive HPto vk at stepℓ, we haves(vi, vj) > 0. To identify such nodesvj

Algorithm 6: An algorithm for single-source SimRank queriesInput: query nodevi and thresholdθOutput: an approximate SimRank scores(vi, vj) for each nodevj

1 Initialize s(vi, vj) = 0 for all vj ;2 for eachℓ such thatH(vi) contains some approximate HP at stepℓ do3 for each nodevk such thath(ℓ)(vi, vk) ∈ H(vi) do4 Initialize ρ(0)(vk) = h(ℓ)(vi, vk) · dk;

5 for t = 1 to ℓ do6 for each nodevx such thatρ(t−1)(vx) > (

√c)ℓ · θ do

7 for each out-neighborvy of vx do8 if ρ(t)(vy) does not existthen

9 ρ(t)(vy) =√

c|I(vy)| · ρ

(t−1)(vx);

10 else

11 ρ(t)(vy) = ρ(t)(vy) +√

c|I(vy)| · ρ

(t−1)(vx);

12 for eachvj such thatρ(ℓ)(vj) > 0 do13 s(vi, vj) = s(vi, vj) + ρ(ℓ)(vj);

14 return s(vi, vj) for each nodevj ;

and their SimRank scores withvi, we can apply the local updateapproach in Algorithm 2 to traverseℓ steps fromvk; however, thelocal update procedure needs to be slightly modified to deal withthe fact that we may need to traverse from multiplevk simultane-ously, i.e., whenvi have positive HPs to multiple nodes at stepℓ.

Specifically, for each particularℓ, Algorithm 6 first identifieseach nodevk such thath(ℓ)(vi, vk) ∈ H(vi), and initializes a tem-porary scoreρ(0)(vk) = h(ℓ)(vi, vk) for vk (Line 3). After that,it traversesℓ steps from allvk simultaneously (Lines 5-8). In thet-th step (t ∈ [1, ℓ]), it inspects the temporary scores created in the(t−1)-th step, and omit those scores that are no larger than(

√c)ℓ·θ

(Line 6). This omission is similar to the pruning of HPs appliedin Algorithm 2, except that the threshold used here is(

√c)ℓ times

smaller than the thresholdθ used in Algorithm 2. The reason is thatthe local update procedure in Algorithm 2 starts from a node whoseapproximate HP equals1, whereas the procedure in Algorithm 6begins from a node whose temporary scoreρ(0)(vk) ≤ (

√c)ℓ, due

to which we need to scale down the threshold to ensure accuracy.For each temporary scoreρ(t−1)(vx) that is above the thresh-

old, Algorithm 2 examines each out-neighborvy of vx, and checkswhether the temporary score ofvy at stept (denoted asρ(t)(vy))exists. If it does not exist, then the algorithm initializesit asρ(t)(vy) =

√c

|I(vy)| · ρ(t−1)(vx); otherwise, the algorithm increases

it by√

c|I(vy)| ·ρ

(t−1)(vx) (Lines 7-11). (Observe that this update ruleis identical to that in Algorithm 2.) Finally, after theℓ-step traver-sal is finished, the algorithm adds each temporary scoreρ(ℓ)(vj) atstepℓ into s(vi, vj), and then proceeds to consider the nextℓ (Lines12-14). Once all stepsℓ are processed, the algorithm returns eachs(vi, vj) as the final result.

We have the following lemma regarding the theoretical guaran-tees of Algorithm 6.

LEMMA 12. Algorithm 6 runs inO(m log2 1

ε

)time, and en-

sures that each SimRank score returned hasε worst-case error.

The time complexity of Algorithm 6 is not as attractive as thoseof the inverted list approach and the straightforward approach, butis roughly comparable to the latter whenm = O(n/ε) (as is of-ten the case in practice). In addition, we note that the time com-plexity of Algorithm 6 matches that of the more recent methodforsingle-source SimRank queries [24], even though the latterrelies onheuristic assumptions that do not hold in general (see Section 3.3).

Page 10: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

MCSLING Linearize

10-2

10-1

100

101

102

103

104

105

GrQc AS Wiki-Vote HepTh Enron Slashdot EuAll NotreDame Google In-2004 LiveJournal Indochina

running time (ms)

Figure 1: Average query costs for single-pair SimRank queries.

MCSLING (Alg. 6) Linearize SLING (Alg. 3)

10-2

10-1

100

101

102

103

104

105

GrQc AS Wiki-Vote HepTh Enron Slashdot EuAll NotreDame Google In-2004 LiveJournal Indochina

running time (ms)

Figure 2: Average query costs for single-source SimRank queries.

Table 3: Datasets.

Dataset Type n m

GrQc undirected 5,242 14,496AS undirected 6,474 13,895Wiki-Vote directed 7,155 103,689HepTh undirected 9,877 25,998Enron undirected 36,692 183,831Slashdot directed 77,360 905,468EuAll directed 265,214 400,045NotreDame directed 325,728 1,497,134Google directed 875,713 5,105,049In-2004 directed 1,382,908 17,917,053LiveJournal directed 4,847,571 68,993,773Indochina directed 7,414,866 194,109,311

7. EXPERIMENTSThis section experimentally evaluatesSLING. Section 7.1 clari-

fies the experimental settings, and Section 7.2 presents theexperi-mental results.

7.1 Experimental Settings

Datasets and Environment. We use twelve graph datasets thatare publicly available from [1, 2] and are commonly used in theliterature. Table 3 shows the statistics of each graph. We conductall of experiments on a Linux machine with a 2.6GHz CPU and64GB memory. All methods tested are implemented in C++. (Ourcode is available at [3].)

Methods and Parameters. We compareSLINGagainst two state-of-the-art methods for SimRank computation: the linearizationmethod [24, 25] (referred to asLinearize) and the Monte Carlomethod [8] (referred to asMC). Linearize has three parametersT , R, andL. Following the recommendations in [24], we setT = 11, R = 100, andL = 3. In addition, we set the decayfactor c in the SimRank model to0.6, as suggested in previouswork [23,24,30–32]. Under this setting,Linearizeensures a worst-case errorε = cT /(1 − c) ≈ 0.01 in each SimRank score,if it isable to derive an exact diagonal correction matrixD. However, aswe discuss in Section 3.3,Linearizeutilizes an approximate versionof D that provides no quality assurance, due to which the above er-ror bound does not hold.

For SLING, we set its maximum errorε = 0.025, which isroughly comparable to the quality assurance of the linearizationmethod given a preciseD. Towards this end, we setεd = 0.005andθ = 0.000725, which ensuresε < 0.025 by Theorem 1. In ad-dition, we setδd = 1/n2, which guarantees that the preprocessingalgorithm ofSLINGsucceeds with at least1−1/n probability. ForMC, we setε = 0.025, as inSLING.

7.2 Experimental ResultsIn the first set of experiments, we randomly generate1000

single-pair SimRank queries on each dataset, and evaluate the av-erage computation time of each method in answering the queries.Figure 1 shows the results. We omitMC on all but the four small-est datasets, since its index size exceeds 64GB on the large graphs.Observe that the query time ofSLINGis at most2.2ms in all cases,and is often several orders of magnitude smaller than that ofLin-earize. In particular, onLiveJournal, SLINGis around10000 timesfaster thanLinearize. This is consistent with the fact thatSLINGand LinearizehasO(1/ε) andO(m log 1

ε) query time complex-

ities, respectively. Meanwhile,Linearize incurs a smaller querycost thanMC on the four smallest datasets, which is also observedin previous work [24].

Our second set of experiments evaluates the average computationcost of each method in answering500 random single-source Sim-Rank queries. ForSLING, we consider two different methods: onethat directly uses Algorithm 6, and another one that invokesAlgo-rithm 3 once for each node. Figure 2 illustrates the results.Noticethat the method that applies Algorithm 3 is significantly slower thanAlgorithm 6, even though the former (resp. latter) runs inO(n/ε)time (resp.O(m log2 1

ε) time). This is in accordance with our

analysis in Section 6, which shows that adopting Algorithm 3forsingle-source queries would incur unnecessary overheads and leadto inferior query time. Since the method that employs Algorithm 3is not competitive, we omit it on all but the four smallest datasets.

Among all methods for single-source SimRank queries,SLING(with Algorithm 6) achieves the best performance, but its improve-ment overLinearize is less pronounced when compared with thecase of single-pair queries. This, as we mention in Section 6, isbecause the local update procedure in Algorithm 6 incurs super-linear overheads, due to which the algorihtm’s time complexity isthe same asLinearize’s. Nonetheless,SLINGis still at least9 times

Page 11: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

MCSLING Linearize

10-1

100

101

102

103

104

105

GrQc AS Wiki-Vote HepTh Enron Slashdot EuAll NotreDame Google In-2004 LiveJournal Indochina

preprocessing time (seconds)

Figure 3: Preprocessing cost of each method.

MCSLING Linearize

100

101

102

103

104

105

GrQc AS Wiki-Vote HepTh Enron Slashdot EuAll NotreDame Google In-2004 LiveJournal Indochina

space (MB)

Figure 4: Space consumption of each method.

SLING Linearize MC

10-4

10-3

10-2

10-1

1 2 3 4 5 6 7 8 9 10

maximum error

run

10-4

10-3

10-2

10-1

1 2 3 4 5 6 7 8 9 10

maximum error

run

10-4

10-3

10-2

10-1

1 2 3 4 5 6 7 8 9 10

maximum error

run

10-4

10-3

10-2

10-1

1 2 3 4 5 6 7 8 9 10

maximum error

run

(a) GrQc (b) AS (c) Wiki-Vote (d) HepTh

Figure 5: Maximum SimRank error of each method measured in 10 different runs.

faster thanLinearizeon 7 out of the12 datasets, and is110 timesmore efficient onSlashdot. Meanwhile,MC is consistently outper-formed byLinearize.

Next, we plot the the preprocessing cost (resp. space consump-tion) of each method in Figure 3 (resp. Figure 4).Linearizeincursa smaller pre-computation cost thanSLINGdoes; in turn,SLINGismore efficient thanMC in terms of pre-computation. The index sizeof SLINGis considerably larger thanLinearize, sinceSLINGhas anO(n/ε) space complexity, whileLinearizeonly incursO(n +m)space overhead. Nevertheless,SLINGoutperformsMC in terms ofspace efficiency. Overall,SLING is inferior to Linearizein termsof space overheads and preprocessing costs, but this is justified bythe fact thatSLINGoffers superior query efficiency and rigorousaccuracy guarantee, whereasLinearize incurs significantly largerquery costs and does not offer non-trivial bounds on its query er-rors. Furthermore, the pre-computation algorithm ofSLINGcan beeasily parallelized, as we discuss in Section 5.4 and demonstrate inAppendix C.

Our last three experiments focus on the query accuracy of eachmethod. We first apply the power method (see Section 3.1) oneach of the four smallest graphs to compute the SimRank scoreof each node pair, setting the number of iterations in the method to50 (which results in a worst-case error below10−11). We take theSimRank scores thus obtained as the ground truth, and use themto gauge the error of each method computing all-pair SimRank

scores. We do not repeat this experiment on larger graphs, due tothe tremendous overheads in computing all-pair SimRank results.

Figure 5 illustrates the maximum query error incurred by eachmethod in all-pair SimRank computation over10 different runs,where each run rebuilds the index of each method from scratch.Observe that the maximum error ofSLINGis always below0.0025,which is considerably smaller than the stipulated error bound ε =0.025. MC’s maximum error is also belowε = 0.025, but is con-sistently larger than that ofSLING, and is over0.01 on Wiki-Vote.In contrast, the maximum error ofLinearizeis above0.025 in mostruns onGrQc, AS, andHepth, which is consistent with our analysisthatLinearizedoes not offer any worst-case guarantee in terms ofquery accuracy.

To further assess each method’s query accuracy, we divide theground-truth SimRank scores into three groupsS1, S2, andS3,such thatS1 (resp.S2) contains SimRank scores in the rangeof [0.1, 1] (resp.[0.01, 0.1]), while S3 concerns SimRank scoressmaller than0.01. Intuitively, the scores inS1 andS2 are moreimportant than those inS3, since the former correspond to nodepairs that are highly similar. Figure 6 shows the average query er-rors of each method forS1, S2, andS3. Observe that, comparedwith Linearize, SLINGincurs much smaller (resp. slightly smaller)errors onS1 (resp.S2). This indicates thatSLINGis more effectivethanLinearizein measuring the similarity of important node pairs.Meanwhile,MC is less accurate thanSLINGonS1, and is consid-erably outperformed by bothSLINGandLinearizeonWiki-Vote.

Page 12: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

SLING Linearize MC

10-6

10-5

10-4

10-3

10-2

S1 S2 S3

average error

SimRank group

10-5

10-4

10-3

10-2

S1 S2 S3

average error

SimRank group

10-7

10-6

10-5

10-4

10-3

S1 S2 S3

average error

SimRank group

10-6

10-5

10-4

10-3

10-2

S1 S2 S3

average error

SimRank group

(a) GrQc (b) AS (c) Wiki-Vote (d) HepTh

Figure 6: Average SimRank error vs. SimRank group.

MCSLING Linearize

0.95

0.96

0.97

0.98

0.99

1

400 800 1200 1600 2000

precision

k

0.95

0.96

0.97

0.98

0.99

1

400 800 1200 1600 2000

precision

k

0.95

0.96

0.97

0.98

0.99

1

400 800 1200 1600 2000

precision

k

0.95

0.96

0.97

0.98

0.99

1

400 800 1200 1600 2000

precision

k(a) GrQc (b) AS (c) Wiki-Vote (d) HepTh

Figure 7: Precision of the top-k SimRank pairs returned by each method.

Finally, we use the all-pair SimRank scores computed by eachmethod to identify thek node pairs with the highest SimRankscores1, and we measure theprecisionof thosek pairs, i.e., thefraction of them among the ground-truth top-k pairs. Figure 7 il-lustrates the results whenk varies from400 to 2000. The precisionof SLING is never worse than that ofLinearize, and is up to4%higher than the latter in many cases. This is consistent withourresults in Figure 6 that, for node pairs with large SimRank scores,SLINGprovides much higher accuracy thanLinearizedoes. Mean-while, MC yields lower accuracy thanSLINGdoes, and is signif-icantly outperformed by bothSLINGandLinearizeon Wiki-Vote.These results are also in agreement with those in Figure 6.

8. OTHER RELATED WORKThe previous sections have discussed the existing techniques that

are most relevant to ours. In what follows, we survey other re-lated work on SimRank computation. First, there is a line of re-search [10,13,19,29–31] on SimRank queries based on the follow-ing formulation of SimRank:

S = cP⊤SP + (1− c)I,

whereS, P , andT aren×n matrices such thatS(i, j) = s(vi, vj)for any i, j, P is as defined in Equation 5, andI is an identitymatrix. However, as point out by Kusumoto et al. [17], the aboveformulation isincorrect since it assumes that(1 − c)I equals thediagonal correction matrixD (see Equation 7), which does not holdin general. As a consequence, the methods in [10,13,19,29–31] failto offer any guarantees in terms of the accuracy of SimRank scores,due to which we do not consider them in this paper.

Second, several variants [5, 8, 22, 32, 34] of SimRank have beenproposed to enhance the quality of similarity measure and miti-gate certain limitations of SimRank. Antonellis et al. [5] present

1Note that we ignore any node pair containing two nodes that areidentical.

SimRank++, which extends SimRank by taking into account theweights of edges and prior knowledge of node similarities. Jin etal. [16] introduceRoleSim, which guarantees to recognize automor-phically or structurally equivalent nodes. Fogaras and Rácz [8] pro-posePSimRank, which improves the quality of SimRank by allow-ing random walks that are close to each other to have a higher prob-ability to meet. Yu and McCann [32] presentSimRank#, whichdefines the similarity between two nodes based on theconsine sim-ilarity of their neighbors. Zhao et al. [34] introduceP-Rank, whichconsider both in-neighbors and out-neighbors of two nodes whenmeasuring their similarity.

Finally, there is existing work [10, 17, 18, 25, 28, 35] that studiestop-k SimRank queriesandSimRank similarity joins. In particular,a top-k SimRank queries takes as input a nodevi, and asks for thek nodesvj with the largest SimRank scores(vi, vj). Meanwhile,a SimRank similarity join asks for all pairs of nodes whose Sim-Rank scores are among the largestk, or are larger than a predefinedthreshold. Techniques designed for these two types of queries aregenerally inapplicable for single-pair and single-sourceSimRankqueries.

9. CONCLUSIONSThis paper presents theSLING index for answering single-

pair and single-source SimRank queries withε worst-case er-ror in each SimRank score.SLING requiresO(n/ε) space andO(m/ε+ n log n

δ/ε2) pre-computation time, and it handles any

single-pair (resp. single-source) query inO(1/ε) (resp.O(n/ε))time. The space and query time complexities ofSLINGare near-optimal, and are significantly better than those of the existing so-lutions. In addition,SLINGincorporates several optimization tech-niques that considerably improves its practical performance. Ourexperiments show thatSLINGprovides superior query efficiencyagainst the states of the art. For future work, we plan to (i) investi-gate techniques to reduce the index size ofSLING, and (ii) extendSLINGto handle other similarity measures for graphs.

Page 13: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

10. REFERENCES[1] http://snap.stanford.edu/data/index.html.[2] http://law.di.unimi.it/datasets.php.[3] https://sourceforge.net/projects/slingsimrank/.[4] R. Andersen, F. R. K. Chung, and K. J. Lang. Local graph

partitioning using pagerank vectors. InFOCS, pages 475–486, 2006.[5] I. Antonellis, H. G. Molina, and C. C. Chang. Simrank++: query

rewriting through link analysis of the click graph.PVLDB,1(1):408–421, 2008.

[6] F. R. K. Chung and L. Lu. Concentration inequalities and martingaleinequalities: A survey.Internet Mathematics, 3(1):79–127, 2006.

[7] P. Dagum, R. M. Karp, M. Luby, and S. M. Ross. An optimalalgorithm for monte carlo estimation.SIAM J. Comput.,29(5):1484–1496, 2000.

[8] D. Fogaras and B. Rácz. Scaling link-based similarity search. InWWW, pages 641–650, 2005.

[9] D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós. Towardsscalingfully personalized pagerank: Algorithms, lower bounds, andexperiments.Internet Mathematics, 2(3):333–358, 2005.

[10] Y. Fujiwara, M. Nakatsuji, H. Shiokawa, and M. Onizuka.Efficientsearch algorithm for simrank. InICDE, pages 589–600, 2013.

[11] G. H. Golub and C. F. Van Loan.Matrix Computations. JohnsHopkins University Press, 3 edition, 2012.

[12] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. WTF:the who to follow service at twitter. InWWW, pages 505–514, 2013.

[13] G. He, H. Feng, C. Li, and H. Chen. Parallel simrank computation onlarge graphs with iterative aggregation. InKDD, pages 543–552,2010.

[14] G. Jeh and J. Widom. Simrank: a measure of structural-contextsimilarity. In SIGKDD, pages 538–543, 2002.

[15] G. Jeh and J. Widom. Scaling personalized web search. InWWW,pages 271–279, 2003.

[16] R. Jin, V. E. Lee, and H. Hong. Axiomatic ranking of network rolesimilarity. In KDD, pages 922–930, 2011.

[17] M. Kusumoto, T. Maehara, and K. Kawarabayashi. Scalablesimilarity search for simrank. InSIGMOD, pages 325–336, 2014.

[18] P. Lee, L. V. S. Lakshmanan, and J. X. Yu. On top-k structuralsimilarity search. InICDE, pages 774–785, 2012.

[19] C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fastcomputation of simrank for static and dynamic information networks.In EDBT, pages 465–476, 2010.

[20] P. Li, H. Liu, J. X. Yu, J. He, and X. Du. Fast single-pair simrankcomputation. InSDM, pages 571–582, 2010.

[21] D. Liben-Nowell and J. M. Kleinberg. The link-prediction problemfor social networks.JASIST, 58(7):1019–1031, 2007.

[22] Z. Lin, M. R. Lyu, and I. King. Matchsim: a novel similarity measurebased on maximum neighborhood matching.KAIS, 32(1):141–166,2012.

[23] D. Lizorkin, P. Velikhov, M. N. Grinev, and D. Turdakov.Accuracyestimate and optimization techniques for simrank computation.VLDB J., 19(1):45–66, 2010.

[24] T. Maehara, M. Kusumoto, and K. Kawarabayashi. Efficient simrankcomputation via linearization.CoRR, abs/1411.7228, 2014.

[25] T. Maehara, M. Kusumoto, and K. Kawarabayashi. Scalable simrankjoin algorithm. InICDE, pages 603–614, 2015.

[26] S. Rothe and H. Schütze. Cosimrank: A flexible & efficientgraph-theoretic similarity measure. InACL, pages 1392–1402, 2014.

[27] N. Spirin and J. Han. Survey on web spam detection: principles andalgorithms.SIGKDD Explorations, 13(2):50–64, 2011.

[28] W. Tao, M. Yu, and G. Li. Efficient top-k simrank-based similarityjoin. PVLDB, 8(3):317–328, 2014.

[29] W. Yu, X. Lin, and W. Zhang. Fast incremental simrank onlink-evolving graphs. InICDE, pages 304–315, 2014.

[30] W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is simpler:Effectively and efficiently assessing node-pair similarities based onhyperlinks.PVLDB, 7(1):13–24, 2013.

[31] W. Yu and J. A. McCann. Efficient partial-pairs simrank search forlarge networks.PVLDB, 8(5):569–580, 2015.

[32] W. Yu and J. A. McCann. High quality graph-based similarity search.In SIGIR, pages 83–92, 2015.

v1

v2

v3

v4

Figure 8: An adversarial case for the linearization method.

[33] W. Yu, W. Zhang, X. Lin, Q. Zhang, and J. Le. A space and timeefficient algorithm for simrank computation.World Wide Web,15(3):327–353, 2012.

[34] P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structuralsimilarity measure over information networks. InCIKM, pages553–562, 2009.

[35] W. Zheng, L. Zou, Y. Feng, L. Chen, and D. Zhao. Efficientsimrank-based similarity join over large graphs.PVLDB,6(7):493–504, 2013.

APPENDIX

A. LIMITATIONS OF THE LINEARIZA-TION METHOD

Recall that the linearization method [24] requires pre-computingthe diagonal correction matrixD. Maehara et al. [24] prove thatthe diagonal elements inD satisfy the following linear system:

for all k ∈ [1, n],∞∑

ℓ=0

n∑

i=1

cℓ(p(ℓ)k,i

)2D(i, i) = 1, (18)

wherep(ℓ)k,i is the probability thatvi is the ℓ-th step of a reverserandom walk fromvk. Based on this, the linearization method es-timatesp(ℓ)k,i with a set of reverse random walks, and then incorpo-rates the estimated values into a truncated version of Equation (18):

for all k ∈ [1, n],t∑

ℓ=0

n∑

i=1

cℓ(p(ℓ)k,i

)2D(i, i) = 1, (19)

wherep(ℓ)k,i denotes the estimated version ofp(ℓ)k,i. After that, it ap-

plies the Gauss-Seidel technique [11] to solve Equation (19), andobtains ann× n diagonal matrixD that approximatesD.

The above approach for derivingD is interesting, but it fails toprovide any worst-case guarantee in terms of the pre-computationtime and the accuracy of SimRank queries, due to the followingreasons. First, because of the sampling error inp

(ℓ)k,i and the trunca-

tion applied in Equation (19),D could differ considerably fromD,which may in turn lead to significant errors in SimRank computa-tion. There is no formal result on how large the error inD could be.Instead, Maehara et al. [24] only show that the error inp

(ℓ)k,i can be

bounded by using a sufficiently large sample set of reverse randomwalks; however, it does not translate into any accuracy guaranteeon D.

Second, even ifp(ℓ)k,i = p(ℓ)k,i and Equation (19) is not truncated,

the Gauss-Seidel technique [11] used by the linearization methodto solve Equation (19) may not converge. In particular, if wedefineann× n matrixM as

M(k, i) =

∞∑

ℓ=0

n∑

i=1

cℓ(p(ℓ)k,i

)2,

then the linearization method requires thatM should be diagonallydominant, i.e., for anyi, |M(i, i)| ≥ ∑

j 6=i |M(i, j)|. However,this requirement is not always satisfied. For example, consider thegraph in Figure 8. The linear system corresponding to the graph is

Page 14: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

50

100

200

400

800

1 2 4 8 16

preprocessing time (s)

number of threads

200

400

800

1600

3200

1 2 4 8 16

preprocessing time (s)

number of threads

1000

2000

4000

8000

16000

1 2 4 8 16

preprocessing time (s)

number of threads

1000

2000

4000

8000

16000

1 2 4 8 16

preprocessing time (s)

number of threads

(a) Google (b) In-2004 (c) LiveJournal (d) Indochina

Figure 9: Preprocessing time vs. number of threads.

0 100 200 300 400 500 600 700 800

256 512 1024 2048 all

preprocessing time (s)

memory buffer size (MB)

0

500

1000

1500

2000

2500

256 512 1024 2048 all

preprocessing time (s)

memory buffer size (MB)

0 2000 4000 6000 8000

10000 12000

256 512 1024 2048 all

preprocessing time (s)

memory buffer size (MB)

0 2000 4000 6000 8000

10000 12000 14000

256 512 1024 2048 all

preprocessing time (s)

memory buffer size (MB)

(a) Google (b) In-2004 (c) LiveJournal (d) Indochina

Figure 10: Preprocessing time vs. memory buffer size.

1

1− c4

1 c c2 c3

c3 1 c c2

c2 c3 1 cc c2 c3 1

D(1, 1)D(2, 2)D(3, 3)D(4, 4)

=

1111

It can be verified the matrix4 × 4 matrixM on the left hand sideis not diagonally dominant whenc = 0.6.

Finally, the number of iterations required by the Gauss-Seidelmethod isO(log ε∗/ log ρ), whereε∗ is the maximum error al-lowed in the solution to the linear system, andρ is the spectralradius of the iteration matrix used by the method [11]. The valueof ρ depends on the input graph, and might be very close to1, inwhich caselog ε∗/ log ρ can be an extremely large number.

B. HITTING PROBABILITIES VS. PER-SONALIZED PAGERANKS

Suppose that we start a random walk from a nodevi followingthe outgoing edges of each node, with1 − cp probability to stopat each step. The probability that the walk stops at a nodevj isreferred to as thepersonalized PageRank (PPR)[15] from vi tovj . PPR is well-adopted as a metric for measuring therelevanceof nodes with respect to the input nodevi, and it has importantapplications in web search [15] and social network analysis[12].

Our notion of hitting probabilities (HP) bears similarity to PPR,but differs in the following aspect:

1. HP concerns the probability that the random walk reachesnodevj at a particular stepℓ, but disregards whether the ran-dom walk stops atvj ;

2. PPR only concerns the endpointvj of the random walk, anddisregards all nodes before it.

Our Algorithm 2 for computing approximate HPs is inspired bythe local updatealgorithm [4, 9, 15] proposed for computing ap-proximate PPRs. Specifically, given a nodevj and an error bound

ε, the local update algorithm returns an approximate versionof thePPRs from other nodes tovj , with ε worst-case errors. The al-gorithm starts by assigning aresidual1 to vj , and0 to any othernode. Subsequently, the algorithm iteratively propagatesthe resid-ual of each node to its in-neighbors, during which it computes theapproximate PPR from each node tovj . When the largest resid-ual in all nodes is smaller thanε, the algorithm terminates. Thisalgorithm is similar in spirit to our Algorithm 2, but it cannot bedirectly applied in our context, due to the inherent differences be-tween PPRs and HPs.

C. ADDITIONAL EXPERIMENTSIn this section, we evaluate the parallel and out-of-core algo-

rithms for constructing the index structures ofSLING(presented inSection 5.4), using the four largest datasets in Table 3. First, we im-plement a multi-threaded version ofSLING’s pre-computation al-gorithm, and measure its running time when the number of threadsvaries from1 to 16 and all 64GB main memory on our machine isavailable. (The total number of CPU cores on our machine is16.)Figure 9 illustrates the results. Observe that the algorithm achievesa near-linear speed-up as the number of threads increases, whichis consistent with our analysis (in Section 5.4) thatSLINGprepro-cessing algorithm is embarrassingly parallelizable.

Next, we implement an I/O-efficient version ofSLING’s prepro-cessing algorithm, based on our discussions in Section 5.4.Then,we measure the running time of the algorithm when it uses oneCPU core along with a memory buffer of a pre-defined size. (Weassume that the input graph is memory-resident, and we exclude itwhen calculating the memory buffer size.) Figure 10 shows the pro-cessing time of the algorithm as the buffer size varies. Observe thatthe algorithm can efficiently process all tested graphs evenwhenthe buffer size is as small as256MB. In addition, the overhead ofthe algorithm does not increase significantly when the buffer sizedecreases, since the algorithm is CPU-bound. In particular, its onlyI/O cost is incurred by (i) writing each entry in the index once tothe disk, and (ii) performing an external sort on the entries.

Page 15: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

D. CONCENTRATION INEQUALITIESThis section introduces the concentration inequalities used in our

proofs. We start from the classic Chernoff bound.

LEMMA 13 (CHERNOFFBOUND [6]). For any setxi (i ∈[1, nx]) of i.i.d. random variables with meanµ andxi ∈ [0, 1],

Pr

∣∣∣∣∣

nx∑

i=1

xi − nxµ

∣∣∣∣∣ ≥ nxε

≤ exp

(− nx · ε2

23ε+ 2µ

).

Our proofs also use a concentration bound onmartingales, asdetailed in the following.

DEFINITION 1 (MARTINGALE ). A sequence of random vari-ablesy1, y2, y3, · · · is a martingale if and only ifE[yi] < +∞ andE[yi+1|y1, y2, · · · , yi] = yi for anyi.

LEMMA 14 ( [6]). Let y1, y2, y3, · · · be a martingale, suchthat |y1| ≤ a, |yj+1 − yj | ≤ a for anyj ∈ [1, i− 1], and

Var[y1] +i∑

j=2

Var[yj | y1, y2, · · · , yj − 1] ≤ bi,

where Var[·] denotes the variance of a random variable. Then, foranyλ > 0,

Pr yi − E[yi] ≥ λ ≤ exp

(− λ2

23aλ+ 2bi

)

E. PROOFSProof of Lemma 3. Let s′(vi, vj) be the probability thatWi andWj meet. Ifvi = vj , thens′(vi, vj) = 1, sinceWi andWj alwaysmeet at the first step. Suppose thatvi 6= vj . Then,s′(vi, vj) isthe probability thatWi andWj meet at or after the second step.Assume without loss of generality that the second steps ofWi andWj arevk andvℓ, respectively. By definition,s′(vk, vℓ) equals theprobability thatWi andWj meet at or aftervk andvℓ. Taking intoaccount all possible second steps ofWi andWj , we have

s′(vi, vj) =∑

vk∈I(vi),vℓ∈I(vj)

√c

|I(vi)|·

√c

|I(vj)|· s′(vk, vℓ)

=c

|I(vi)| · |I(vj)|∑

vk∈I(vi),vℓ∈I(vj)

s′(vk, vℓ).

As such,s′(vi, vj) have the same definition ass(vi, vj) (see Equa-tion (1)), which indicates thats′(vi, vj) = s(vi, vj).

Proof of Lemma 4. First, we define the following events:

• E(vi, vj) : Two√c-walks starting fromvi andvj , respec-

tively, meet each other.

• L(vi, vj , vk, ℓ) : Two√c-walks starting fromvi andvj , re-

spectively,last meeteach other at theℓ-th step atvk.

As we discuss in Section 4.2, two different eventsL(vi, vj , vk, ℓ)andL(vi, vj , v′k, ℓ

′) are mutually exclusive whenevervk 6= v′k orℓ 6= ℓ′. Therefore,

PrE(vi, vj) =

+∞∑

ℓ=0

n∑

k=1

PrL(vi, vj , vk, ℓ)

Observe that the probability ofL(vi, vj , vk, ℓ) can be computedby multiplying the following two probabilities:

1. The probability that two√c-walksWi andWj from vi and

vj , respectively, meet atvk at stepℓ.

2. Given thatWi andWj meet atvk stepℓ, the probability thatthey do not meet at stepsℓ+ 1, ℓ+ 2, . . ..

The first probability equalsh(ℓ)(vi, vk) · h(ℓ)(vi, vk). Meanwhile,since the(x + 1)-th step of any

√c-walk depends only on itsx-th

step, the second probability should equal the probability that two√c-walks from vk never meet after the0-th step, which in turn

equalsdk. Hence, we have

s(vi, vj) = PrE(vi, vj) =+∞∑

ℓ=0

n∑

k=1

PrL(vi, vj , vk, ℓ)

=

+∞∑

ℓ=0

n∑

k=1

(h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

),

which completes the proof.

Proof of Lemma 5. Let R =√c · P , andRℓ be theℓ-th power of

R. We have

R0(k, i) =

1 if i = k

0 otherwise

Hence,R0(k, i) = h(0)(vi, vk) for all vi andvk. Assume that for acertainℓ, we haveRℓ(k, i) = h(ℓ)(vi, vk) for all vi andvk. Then,

Rℓ+1(k, i) =(√

c · P ·Rℓ)(k, i)

=

n∑

j=1

(√c · P (k, j) · Rℓ(j, i)

)

=∑

each out-neighborvj of vk

( √c

|I(vj)|· h(ℓ)(vi, vj)

)

= h(ℓ+1)(vi, vk).

Therefore,Rℓ(k, i) = h(ℓ)(vi, vk) for all vi, vk, andℓ.Let D be then×n diagonal matrix whosek-th diagonal element

is dk. Then, Equation (13) can be written as:

S =+∞∑

ℓ=0

((Rℓ)⊤

DRℓ

)

By multiplying R⊤ andR on the left and right, respectively, onboth side of the equation, we have

S = R⊤SR + D = cP⊤SP + D.

This indicates thatD is a diagonal correction matrix. Since thediagonal correction matrix is unique [24], we haveD = D.

Proof of Lemma 6. By the Chernoff Bound in Lemma 13,

Pr

∣∣∣∣cnt

nr− µ

∣∣∣∣ ≥ ε/c

≤ 2 exp

(− nr(ε/c)

2

23ε/c+ 2µ

)

= 2 exp

(−

23ε/c+ 2

23ε/c+ 2µ

log2

δd

)

= δd

Therefore,|dk − dk| = c · | cntnr

− µ| ≤ ε with at least1 − δdprobability.

Proof of Lemma 7. According to Algorithm 2, for allh(ℓ)(vi, vj) ∈ H(vi), we have

θ ≤ h(ℓ)(vi, vj) ≤ h(ℓ)(vi, vj).

Page 16: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

Then, for each nodevi and each stepℓ,∑

vj∈V

h(ℓ)(vi, vj) ≤∑

vj∈V

h(ℓ)(vi, vj) =√cℓ

Therefore, there are at most(√c)ℓ/θ nodes vj such that

h(ℓ)(vi, vj) ∈ H(vi). Therefore, the size ofH(vi) is

|H(vi)| ≤+∞∑

ℓ=0

(√c)

θ= O(1/θ)

Let d be the average out-degree of theG. Since a local update isperformed on each entryh(ℓ)(vi, vj) ∈ H(vi), the running time ofthe algorithm isO(dn/θ) = O(m/θ).

Let εℓ be the upper bound of|h(ℓ)(vi, vj)− h(ℓ)(vi, vj)| for all

nodesvi andvj at stepℓ. Whenℓ = 0, we haveε0 ≤ 1− (√c)

0

1−√c

θ.

Assume thatεℓ ≤1− (

√c)

1−√c

θ holds for a certainℓ. Then,

εℓ+1 ≤ √c · εℓ + θ =

1−√cℓ+1

1−√c

θ.

Thus, the lemma is proved.

Proof of Lemma 8. Given that∣∣∣dk − dk

∣∣∣ ≤ εd and

−ε(ℓ)h ≤ h(ℓ)(vk, vx)− h(ℓ)(vk, vx) ≤ 0 for anyk, x, ℓ, we have

h(ℓ)(vi, vk) · d · h(ℓ)(vj , vk)− h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

≤ h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)− h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

≤ h(ℓ)(vi, vk) · εd · h(ℓ)(vj , vk).

Therefore,

s(vi, vj)− s(vi, vj) ≤+∞∑

ℓ=0

n∑

k=1

h(ℓ)(vi, vk) · εd · h(ℓ)(vj , vk)

≤+∞∑

ℓ=0

εd · (√c)ℓ · (

√cℓ) =

εd1− c

.

Meanwhile,

h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)− h(ℓ)(vi, vk) · d · h(ℓ)(vj , vk)

≤ h(ℓ)(vi, vk) · dk · h(ℓ)(vj , vk)

−(h(ℓ)(vi, vk)− ε

(ℓ)h

)· (dk − εd) ·

(h(ℓ)(vj , vk)− ε

(ℓ)h

)

=(h(ℓ)(vi, vk)− ε

(ℓ)h

)· εd ·

(h(ℓ)(vj , vk)− ε

(ℓ)h

)

+ ε(ℓ)h · dk ·

(h(ℓ)(vj , vk)− ε

(ℓ)h

)+ εd · ε(ℓ)h · h(ℓ)(vi, vk)

Hence,

s(vi, vj)− s(vi, vj)

=

+∞∑

ℓ=0

n∑

k=1

((h(ℓ)(vi, vk)− ε

(ℓ)h

)· εd ·

(h(ℓ)(vj , vk)− ε

(ℓ)h

)

+ ε(ℓ)h · dk ·

(h(ℓ)(vj , vk)− ε

(ℓ)h

)+ εd · ε(ℓ)h · h(ℓ)(vi, vk)

)

≤+∞∑

ℓ=0

((√c)ℓ · εd · (√c

ℓ) + 2 · ε(ℓ)h · (√c)ℓ

)

=εd

1− c+ 2

+∞∑

ℓ=0

((√cℓ) · ε(ℓ)h

).

This completes the proof.

Proof of Theorem 1. By Lemma 7, for allk, x, l,

−1− (√c)ℓ

1−√c

θ ≤ h(ℓ)(vk, vx)− h(ℓ)(vk, vx) ≤ 0.

Then,

εd1− c

+ 2+∞∑

ℓ=0

((√c)ℓ · ε(ℓ)h

)

=εd

1− c+ 2

+∞∑

ℓ=0

((√c)ℓ · 1− (

√c)ℓ

1−√c

θ

)

=εd

1− c+

2√c

(1− c)(1−√c)θ.

By Lemma 6,∣∣∣dk − dk

∣∣∣ ≤ εd holds with at least1−δd probability.

Sinceδd ≤ δ/n, with at least1− δ probability,∣∣∣dk − dk

∣∣∣ < εd , for all k

Therefore, By Lemma 8,|s(vi, vj) − s(vi, vj)| ≤ ε holds with atleast1− δ probability.

Proof of Lemma 9. Let ε∗ = εd/c. Thennr =14

3ε∗log

4

δd,

n∗r =

2µ∗ + 23ε∗

ε∗2log

4

δd, andµ∗ = µ+

√µε∗. In the second part

of the sampling procedure, ifµ∗ ≥ µ, then by Lemma 14,

Pr|µ− µ| > ε∗ ≤ 2 exp

(− n∗

rε∗2

2µ+ 23ε∗

)

= 2 exp

(−2µ∗ + 2

3ε∗

2µ+ 23ε∗

log4

δd

)≤ δd

2.

We differentiate two cases:µ ≥ 2ε∗ andµ < 2ε∗. Assume thatµ ≥ 2ε∗. Then,

Prµ ≤ µ/2 ≤ exp

(− nr(µ/2)

2

2µ+ 23· µ/2

)= exp

(− nrµ

28/3

)

≤ exp

(−

143ε

log 4δd

· 2ε28/3

)=

δd4

Then, givenµ > 2ε andµ ≥ µ/2 > ε∗,

Prµ > µ+√

µε∗

≤ exp

(− nrµε

2µ+ 23

√µε∗

)≤ exp

(− nrµε

4µ+ 23µ

)

= exp

(−nrε

14/3

)= exp

(−

143ε∗

log 4δd

· ε∗

14/3

)≤ δd

4.

Therefore,µ is estimated with at mostε∗ error with at least1− ( δ

4+ δ

4+ δ

2) = 1− δ probability.

Now consider thatµ < 2ε∗. We have

Pr|µ− µ| ≥ ε∗ ≤ 2 exp

(− nrε

∗2

2µ+ 23ε∗

)

≤ 2 exp

(− nrε

∗2

4ε∗ + 23ε∗

)= 2 exp

(−

143ε∗

log 4δd

· ε∗

14/3

)=

δd2.

Page 17: SLING: A Near-Optimal Index Structure for SimRank · arXiv:1604.04185v1 [cs.DB] 14 Apr 2016 SLING: A Near-Optimal Index Structure for SimRank Boyu Tian Shanghai Jiao Tong University

If µ < ε∗, thenµ is estimated with at mostε∗ error with at least1 − δd/2 probability. On the other hand, ifµ > ε∗ ,then with atleast1− δd/2 probability,

µ < µ+ ε∗ < µ+√

µε∗.

Therefore,µ is estimated with at mostε∗ error with at least1− ( δd

2+ δd

2) = 1− δd probability.

In summary, with at least1− δd probability,µ is estimated withat mostε∗ = εd/c error, in which casedk is estimated with at mostεd error.

Proof of Lemma 10. Let ε∗ = εd/c. Then,nr =14

3ε∗log

4

δd,

n∗r =

2µ∗ + 23ε∗

ε∗2log

4

δd, andµ∗ = µ +

√µε∗. By Lemma 14,

for anyk ≥ 1,

Prk · (µ+ ε∗) ≤ µ− µ ≤ (k + 1) · (µ+ ε∗)

≤ Prµ− µ ≥ k(µ+ ε∗)

≤ exp

(−k2 · (µ+ ε∗)2 · nr

2µ+ 23(µ+ ε∗)

)

≤ exp

(−k2 · (µ+ ε∗)2 · nr

83· k · (µ+ ε∗)

)≤ exp

(−k log

4

δd

)≤(1

4

)k

Then, we have the following upper bound on the expectation ofµ:

E[µ] ≤ (2µ+ ε∗) · Prµ ≤ 2µ+ ε∗++∞∑

k=1

(k + 1)(µ+ ε∗)

4k

≤ 2µ+ ε∗ ++∞∑

k=1

(k + 1)(µ+ ε∗)

4k=

25µ+ 16ε∗

9

Hence, we have the following upper bound on the the expectednumber of

√c-walk pairs needed:

14

3ε∗log

4

δd+

2 · E[µ∗] + 23ε∗

ε∗2log

4

δd

≤ 14

3ε∗log

4

δd+

2 · E[µ+ (µ+ ε∗)] + 23ε∗

ε∗2log

4

δd

= O

(µ+ εdε2d

log1

δd

)

Let ℓi be the length of the shorter one of thei-th pair of√c-walks.

Thenℓi is identically geometrically distributed with success proba-bility 1−√

c. Then the upper bound on the expected running timeof the algorithm is given by

E[

n∗

r∑

i=1

ℓi] = E[n∗r ] · E[ℓi] = O

(µ+ εdε2d

log1

δd

)

Proof of Lemma 11. TheA∗ be the algorithmA defined in theend of Section 5.1, except that it returns an estimationµz with arelative error guarantee, i.e., with at lesat1− δ probability,

(1− λ) · µZ ≤ µZ ≤ (1 + λ) · µ.

Thelower bound theoremin [7] shows that the expected number ofsamples taken byA∗ is

Ω

(maxµz(1− µz), λµz

λ2 · µ2z

log1

δ

).

Observe that ifλ = ε/µz , thenA∗ ensures at mostε additiveerror. In that case, ifµ < 0.5, the expected number of samplestaken byA∗ is

Ω

(maxµz(1− µz), λµz

λ2 · µ2z

log1

δ

)= Ω

(maxµz , ε

ε2log

1

δ

),

which completes the proof.

Proof of Lemma 12. For givenℓ andt, consider two random walksW1 andW2, such thatW1 starts fromvi at time0, whileW2 startsatvj at timeℓ−t. Consider the probability thatW1 andW2 meet attime ℓ and do not meet again, denoted byρ

(t)ℓ (vj). Then, we have

• ρ(0)ℓ (vk) = h(ℓ)(vi, vk) · dk, and

• ρ(t+1)ℓ (vj) =

√c

|I(vj)|∑

vk∈I(vj)ρ(t)ℓ (vk).

Observe that, in Algorithm 6, if we ignore the thresholding approx-imation and the errors inh(ℓ)(vi, vk), then the algorithm exactlycorresponds to the iterative process defined by the above equations.Moreover, ρ(ℓ)ℓ (vj) is the probability that two random walksW1

andW2 starting atvi andvj together meet each other afterℓ steps,for the last time, which exactly equals

∑nk=1 h

(ℓ)(vi, vk) · dk ·h(ℓ)(vj , vk). Therefore,

s(vi, vj) =+∞∑

ℓ=0

ρ(ℓ)ℓ (vj).

Now, we consider the error in eachρ(t)ℓ (vj). Letε(t)ℓ be the errorat thet-th step. Then,

• For t = 0, ε(0)ℓ = |ρ(0)ℓ (vk)− ρ(0)ℓ | ≤ √

cℓεd +

1−√cℓ

1−√c· θ.

• ε(t+1)ℓ ≤ √

c · ε(t)ℓ +√cℓθ

By solving the inequality, we get

εℓ ≤ cℓεd +2− 2

√cℓ

1−√c

θ

Therefore, the total error of the algorithm is

+∞∑

ℓ=0

(cℓεd +

2− 2√cℓ

1−√c

θ

)=

εd1− c

+2√c

(1− c)(1−√c)θ ≤ ε

For eachℓ andt, the calculation ofρ(t+1)ℓ requires at mostO(m)

times (i.e., scanning all the edges in the worst case). Sinceall en-tries inH(vi) are greater thanθ, ℓ is at mostlog 1

θ. Therefore, the

total running time of the algorithm if bounded by

O

log(1/θ)∑

ℓ=0

ℓ ·m

= O

(m log2

1

ε

).