View
213
Download
0
Category
Preview:
Citation preview
Scaling Personalized Web Search
Glen Jeh , Jennfier Widom Stanford University
Presented by Li-Tal MashiachSearch Engine Technology course (236620)
Technion
Today’s topics
Overview Motivation Personal PageRank Vector Efficient calculation of PPVEfficient calculation of PPV Experimental results Discussion
PageRank Overview
Ranking method of web pages based on the link structure of the web
Important pages are those linked-to by many important pages
Original PageRank has no initial preference for any particular pages
PageRank Overview
The ranking is based on the probability that a random surferrandom surfer will visit a certain page at a given time
E(p) E(p) can be: Uniformly distributed Biased distributed
:
( )( ) (1 ) ( )
deg( )q q p
PR qPR p c cE p
Out q
Motivation
We would like to give higher importance to user selected pages User may have a set P P of preferred pages preferred pages Instead of jumping to to any random pagerandom page with
probability c, the jump is restricted to P That way, we increase the probability that the
random surfer will stay in the near environment of pages in P
Considering P will create a personalized personalized viewview of the importance of pages on the web
Personalized PageRank Vector (PPV)
Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank
PPV is a vector of length n, where n is the number of pages on the web
PPV[p] = the importance of page p
P H V
PPV Equation
u – preference vector |u| = 1 u(p) = the amount of preference for page p
A – nxn matrix
c – the probability the random surfer jumps to a page in P
(1 )PPV c A PPV c u
,
1
deg( )
0 .i j
j iOut jA
else
PPV – Problem
Not practical to compute PPV’s during query time
Not practical to compute and store offline There are preference sets
How to calculate PPV? How to do it efficiently?
| |2 H
Main Steps to solution
Break down preference vectorspreference vectors into commoncommon componentsComputation divided between offlineoffline
(lots of time) and online online (focused computation)
Eliminates redundantredundant computation
Linearity Theorem
The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s .
Let xxi be a unit vectorunit vector
Let rrii be the PPV corresponding to xi, called hub vectorhub vector
1 2 and u u
1 2 and v v
1 1 2 2 1 1 2 2PPV(a +a ) a ( ) a ( )u u PPV u PPV u
i
n
ii xau
1
i
n
iirav
1
Example
…
r1r1
…
r2r2
…
r12r12
…
x1, x2, x12
Personal preferences of David
1 1 1 + +
3 3 3DavidPPV
…
rkrk
…
Good, but not enough…
If hub vector ri for each page in H can be computed ahead of time and stored, then computing PPV is easier
The number of pre-computed PPV decrease from to |H|.
But…. Each hub vector computation requires multiple
scans of the web graph Time and space grow linearly with |H| The solution so far is impractical
| |2 H
Decomposition of Hub Vectors
In order to compute and store the hub vectors efficiently, we can further break them down into… Partial vectorPartial vector –unique component Hubs skeletonHubs skeleton –encode interrelationships
among hub vectors Construct into full hub vectorhub vector during query
time Saves computation time and storage due to
sharing of components among hub vectors
Inverse P-distance
Hub vectorHub vector rrpp can be represented as inverseinverse
P-distance vectorP-distance vector
l(t) – the number of edges in path t P(t) – the probability of traveling on path t
1
11
1( ) , where t=<w , ...,w >
| ( ) |
k
ki i
P to w
( )
:
( ) [ ] (1 )l tpt p q
r q P t c c
Partial Vectors
Breaking rrpp into into two components: Partial Vectors- Partial Vectors- computed without using
any intermediate nodes from H The rest
For well-chosen sets H, it will be true that for many pages p,q
( )H Hp p p pr r r r
0Hp pr r
Partial Vectorh H
Paths that going through some page h H
Precompute and store the partial vector Cheaper to compute and store than Decreases as |H| increases
Add at query time to compute the full hub vector
But… Computing and storing could be expensive
as itself
Good, but not enough…
( )Hp pr rpr
Hpr
Hpr
pr
Hubs Skeleton Breaking down :
Hubs skeletonHubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors
for each p, rrpp(H)(H) has size at most |H|, much smaller than the full hub vector
Hpr
1( ( ) ( ))*( )H H
p p p h h hh H
r r h cx h r r cxc
HphrS p |)(
Partial VectorsPartial VectorsHubs skeletonHubs skeleton
Handling the case p or q is itself in Hh H
Paths that go through some page h H
Example
21 1( ( ) ( )) (1 )
2 2H
a ar b r b c c
1 1 1 1( ) ( ) ( ) (1 ) (1 )
2 2Ha a cr b r c r b c c c c
c c
HH
aa
bb
ddcc
21( ) (1 )
2ar b c c
( ) ( ( ) ( )) ( )H Ha a a ar b r b r b r b
Putting it all together
Given a chosen reference set P
1. Form a preference vector u
2. Calculate hub vector for each ik
3. Combine the hub vectors
zziiiu ...2211
1( ) ( ) ( )
k k kk
H Hi i i h hi h Hr r r r h r r
c
11 21 2 ...ki i k ir r r r
Pre- computed of partial vectors
Hubs skeleton may be deferred to query time
Algorithms
Decomposition theorem Basic dynamic programming algorithm
Partial vectors - Selective expansion algorithm
Hubs skeleton - Repeated squaring algorithm
Decomposition theorem
The basis vector rrp p is the average of the basis vectors of its out-neighbors, plus a compensation factor
Define relationships among basis vectors Having computed the basis vectors of p’s
out-neighbors to certain precision, we can use the theorem to compute rrp p to greater precision
:
1
deg( )p q pq p q
cc
Out p
r r x
Basic dynamic programming algorithm
pˆ approximation of r in itration k
error in the iteration k of the basic vector p
p
p
k
k
r
E
:
:
1ˆ ˆ
deg( )
1
deg( )
p q pq p q
p qq p q
cc
Out p
c
Out p
k+1 k
k+1 k
r r x
E E
Using the decomposition theory, we can build a dynamic programming algorithmdynamic programming algorithm which iteratively improves the precision of the calculation
On iteration k, only paths with length ≤ k-1 are being considered
The error is reduced by a factor of 1-c on each iteration
Computing partial vectors
Selective expansion algorithm Tours passing through a hub page H are
never considered The expansion from p will stop when
reaching page from H
( )Hp pr r
Computing hubs skeleton
Repeated squaring algorithm Using the intermediate results from the
computation of partial vectors The error is squaredsquared on each iteration –
reduces error much faster Running time and storage depend only
on the size of rrpp(H)(H) This allows to defer the computation to
query time
( ) p Hpr H
Experimental results
Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages
Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory
Experimental results
Partial vector approach is much more effective when H contains high-PageRank pages
H was taken from the top 1000 to the top 100,000 pages with the highest PageRank
Experimental results
Compute hubs skeleton for |H|=10,000
Average size is 9021 entries, much less than dimensions of full hub vectors
Experimental results
Instead of using the entire set rrpp(H)(H), using only the highest m enteries
Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds
Discussion
Are personalized PageRank’s even useful? What if personally chosen pages are not
representative enough? Too focused? Even if overhead is scalable with number of
pages, do light-web users want to accept that overhead?
performance depends on choice of personal pages
References
Scaling Personalized Web Search Glen Jeh and Jennifer Widom WWW2003
Personalized PageRank seminar: Link mining http://www.informatik.uni-freiburg.de/~ml/teaching/ws04/l
m/20041207_PageRank_Alcazar.ppt
Recommended