View
215
Download
0
Category
Preview:
Citation preview
Keyword Search in Databases using PageRank
By Michael Sirivianos
April 11, 2003
Roadmap PageRank: Ranking Web Pages
using link structure Ranking Keyword Search Results in
Structured Databases Ranking Combining Individual
PageRanks
Roadmap PageRank: Ranking Web Pages
using link structure of the web Ranking Keyword Search Results in
Structured Databases Ranking Combining Individual
PageRanks
PageRank(1) Stanford project Lawrence Page, Sergey Brin,
Rajeev Motwani, Terry Winograd. “The PageRank Citation Ranking:
Bringing Order to the Web”. Started Google
PageRank(2) Make use of the link structure of the web to
calculate a quality ranking (PageRank) for each web page.
Citation counting a metric for measuring page/paper quality
PageRank a more sophisticated citation counting method, not prone to manipulation.
Each page has unique PageRank, independent of keyword query
PageRank does NOT express relevance of page to query
PageRank (3) Calculation Intuition :PageRank of page
P increases when pages with large PageRanks point to P.
The rank of a page is evenly distributed among its forward links.
A problem: When two pages form a loop by pointing to each other but no other page, then in every iteration this loop accumulates and never distributes rank. This is called rank sink.
PageRank is a Usage Simulation “Random surfer”
Given a random URL Clicks randomly on links After a while gets bored and gets a
new random URL The number of visits to each page
is its PageRank.
PageRank CalculationPR(A)=(1-d) + d*( PR(T1)/C(T1)+…+
PR(Tn)/C(Tn) )
d: damping factor, normally this is set to 0.85.T1, …, Tn: pages pointing to page APR(A): PageRank of page A.PR(Ti): PageRank of page Ti.C(Ti): the number of links going out of page Ti.
Note: d counts for PageRank sinks
Example of Calculation (1)
Page A
Page C
Page B
Page D
Example of Calculation (2)
Page A 1
Page C1
Page D1
Page B1
1*0.85/2
1*0.85/2
1*0.85
1*0.85
1*0.85
Example of Calculation (3) Each page has not passed
on 0.15, so we get:Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15
Page A 1
Page C2.275
Page B0.575
Page D0.15
Example of Calculation (4)Page A: 2.275*0.85 (from Page C)
+ 0.15 (not transferred) = 2.08375
Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575
Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) =
1.19125Page D: receives none, but has not
transferred 0.15 = 0.15
Page A 2.08375
Page C1.19125
Page B0.575
Page D0.15
Example - Conclusions Page C has the highest PageRank,
and page A has the next highest: page C has a highest importance in this page graph!
More iterations lead to convergence of PageRanks.
Base set In practice when the user gets bored tends
to use his bookmarked pages instead of a random one. These bookmarked pages constitute the base set.
The PR formula is modified to reflect this behavior.PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) )
If A in base set E = 1 else E = 0
Roadmap PageRank: Ranking Web Pages
using link structure Ranking Keyword Search Results in
Structured Databases Ranking Combining Individual
PageRanks
Keyword QueryInput: set of keywords
Output: List of nodes ranked according to their relevance to the keywords
Score of a result-node:• Sum of keyword-specific PRs (OR semantics)• Product of keyword-specific PRs (AND
semantics)
Database Schema
C(cid,name)
Y(yid,year,cid)
P(pid,title,yid)
A(aid,name)
PP(pid1,pid2)
PA(pid,aid)
C: conferenceY: conference yearP: paperA: author
: primary to foreign key
Tupples in C, Y, P, Aare objects that represent nodes in schema graph
Primary to foreign key relations represent edges in the graph
All connections are two way except P – P that is only from paper to cited paper
Architecture
Attributes of PRindex table:•Keyword •CLOB of (id,PR) list
List of •Nodeid•Node text•PR wrt all keywords
CreatePR index
Database
PRindex
d,edge weights,
epsilon, threshold
QueryModule
Keywords,k
Results
Preprocessingstage
Query stage
Modified PageRank Formula
PR(A)=(1-d) + d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A has keyword
PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A doesn’t have keyword
Preprocessing stage (1) Load whole database in memory
Create edges Hashtable ( nodeId, nodeId, Type of edge )
Create nodes Hashtable ( nodeId ) Create text Hashtable ( nodeId, text )
For each keyword Find all nodes that contain keyword and put
them in base set. Execute PR algorithm with base set.
Preprocessing stage (2) Create descending list of (nodeid,PR)
pair. Store list in CLOB in PRindex table indexed by keyword.
Query Stage For each keyword in input retrieve
( id, PR ) list from database. Resolve top-k ids with respect to
the sum of Page ranks using Fagin’s algorithm (PODS 2001).
Fagin’s Algorithm Descending sorted keyword-specific PR lists
Keep the maximum possible value of a node that is the current PR for node extracted so far in scanned lists plus the PR of currently pointed nodes in other lists. Keep the minimum value that is the current PR for node.
Algorithm terminates when it finds k objects of which minimum value is greater than the maximum PR value for the rest of nodes.
Conclusions
We implemented a system for keyword search in databases using PageRank.
It uses an index of keyword specific Object Ranks
Recommended