Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google,...
38
Theoretical Justification of Popular Link Prediction Heuristics Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.) 1
Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.) 1
Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo!
Research) Andrew W. Moore (Google, Inc.) 1
Slide 2
Which pair of nodes {i,j} should be connected? Alice Bob
Charlie Goal: Recommend a movie
Slide 3
Which pair of nodes {i,j} should be connected? Goal: Suggest
friends
Slide 4
Predict link between nodes Connected by the shortest path With
the most common neighbors (length 2 paths) More weight to
low-degree common nbrs (Adamic/Adar) 8 followers 1000 followers
Prolific common friends Less evidence Less prolific Much more
evidence Alice Bob Charlie
Slide 5
Predict link between nodes Connected by the shortest path With
the most common neighbors (length 2 paths) More weight to
low-degree common nbrs (Adamic/Adar) With more short paths (e.g.
length 3 paths ) exponentially decaying weights to longer paths
(Katz measure)
Slide 6
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of
short paths Link prediction accuracy* *Liben-Nowell &
Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 How do we
justify these observations? Especially if the graph is sparse
Slide 7
7 Unit volume universe Model: 1.Nodes are uniformly distributed
points in a latent space 2.This space has a distance metric
3.Points close to each other are likely to be connected in the
graph Logistic distance function (Raftery+/2002)
Slide 8
8 1 Higher probability of linking radius r determines the
steepness The problem of link prediction is to find the nearest
neighbor who is not currently linked to the node. Equivalent to
inferring distances in the latent space Model: 1.Nodes are
uniformly distributed points in a latent space 2.This space has a
distance metric 3.Points close to each other are likely to be
connected in the graph Logistic distance function
(Raftery+/2002)
Slide 9
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of
short paths Link prediction accuracy *Liben-Nowell & Kleinberg,
2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph
is sparse
Slide 10
Pr 2 (i,j) = Pr(common neighbor|d ij ) Product of two logistic
probabilities, integrated over a volume determined by d ij As
Logistic Step function Much easier to analyze! i j
Slide 11
11 Everyone has same radius r i j # common nbrs gives a bound
on distance =Number of common neighbors V(r)=volume of radius r in
D dims Unit volume universe
Slide 12
OPT = node closest to i MAX = node with max common neighbors
with i Theorem: w.h.p Link prediction by common neighbors is
asymptotically optimal d OPT d MAX d OPT + 2[ /V(1)] 1/D
Slide 13
Node k has radius r k. i k if d ik r k (Directed graph) r k
captures popularity of node k Weighted common neighbors: Predict
(i,j) pairs with highest w(r)(r) 13 i k j rkrk m Weight for nodes
of radius r # common neighbors of radius r
Slide 14
Presence of common neighbor is very informative r is close to
max radius Absence is very informative Adamic/Adar 1/r Real world
graphs generally fall in this range i k j radius
Slide 15
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of
short paths Link prediction accuracy *Liben-Nowell & Kleinberg,
2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph
is sparse
Slide 16
Common neighbors = 2 hop paths For longer paths: Bounds are
weaker For we need >> to obtain similar bounds justifies the
exponentially decaying weight given to longer paths by the Katz
measure
Slide 17
Three key ingredients 1. Closer points are likelier to be
linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 2.
Triangle inequality holds necessary to extend to - hop paths 3.
Points are spread uniformly at random Otherwise properties will
depend on location as well as distance
Slide 18
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of
short paths Link prediction accuracy* *Liben-Nowell &
Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number
of paths matters, not the length For large dense graphs, common
neighbors are enough Differentiating between different degrees is
important In sparse graphs, length 3 or more paths help in
prediction.
Slide 19
Slide 20
20 1 Higher probability of linking Two sources of randomness
Point positions: uniform in D dimensional space Linkage
probability: logistic with parameters , r , r and D are known
radius r determines the steepness The problem of link prediction is
to find the nearest neighbor who is not currently linked to the
node. Equivalent to inferring distances in the latent space
Slide 21
1 Factor weak bound for Logistic Can be made tighter, as
logistic approaches the step function.
Slide 22
22 Generative model Link Prediction Heuristics node a Most
likely neighbor of node i ? node b Compare A few properties Can
justify the empirical observations We also offer some new
prediction algorithms
Slide 23
Combine bounds from different radii But there might not be
enough data to obtain individual bounds from each radius New sweep
estimator Q r = Fraction of nodes w. radius r, which are common
neighbors. Higher Q r smaller d ij w.h.p
Slide 24
Q r = Fraction of nodes w. radius r, which are common neighbors
larger Q r smaller d ij w.h.p T R : = Fraction of nodes w. radius
R, which are common neighbors. Smaller T R large d ij w.h.p
Slide 25
Q r = Fraction of nodes with radius r which are common
neighbors T R = Fraction of nodes with radius R which are common
neighbors Number of common neighbors of a given radius Large Q r
small d ij Small T R large d ij r
Slide 26
Which pair of nodes {i,j} should be connected? Variant: node i
is given Friend suggestion in Facebook Alice Bob Charlie Movie
recommendation in Netflix
Slide 27
27 Nodes are uniformly distributed in a latent space The
problem of link prediction is to find the nearest neighbor who is
not currently linked to the node. Equivalent to inferring distances
in the latent space Raftery et al.s Model: Unit volume universe
Points close in this space are more likely to be connected.
Slide 28
28 1 Higher probability of linking Two sources of randomness
Point positions: uniform in D dimensional space Linkage
probability: logistic with parameters , r , r and D are known
radius r determines the steepness
Slide 29
i j k 1 ~ Bin[N 1, A(r 1, r 1, d ij )] 2 ~ Bin[N 2, A(r 2, r 2,
d ij )] Example graph: N 1 nodes of radius r 1 and N 2 nodes of
radius r 2 r 1
Bound d ij as a function of For we need >> to obtain
similar bounds justifies the exponentially decaying weight given to
longer paths by the Katz measure Also, we can obtain much tighter
bounds for long paths if shorter paths are known to exist.