Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Keyword Search in Databases using PageRank

By Michael Sirivianos

April 11, 2003

Roadmap PageRank: Ranking Web Pages

using link structure Ranking Keyword Search Results in

Structured Databases Ranking Combining Individual

PageRanks

using link structure of the web Ranking Keyword Search Results in

PageRanks

PageRank(1) Stanford project Lawrence Page, Sergey Brin,

Rajeev Motwani, Terry Winograd. “The PageRank Citation Ranking:

Bringing Order to the Web”. Started Google

PageRank(2) Make use of the link structure of the web to

calculate a quality ranking (PageRank) for each web page.

Citation counting a metric for measuring page/paper quality

PageRank a more sophisticated citation counting method, not prone to manipulation.

Each page has unique PageRank, independent of keyword query

PageRank does NOT express relevance of page to query

PageRank (3) Calculation Intuition :PageRank of page

P increases when pages with large PageRanks point to P.

The rank of a page is evenly distributed among its forward links.

A problem: When two pages form a loop by pointing to each other but no other page, then in every iteration this loop accumulates and never distributes rank. This is called rank sink.

PageRank is a Usage Simulation “Random surfer”

Given a random URL Clicks randomly on links After a while gets bored and gets a

new random URL The number of visits to each page

is its PageRank.

PageRank CalculationPR(A)=(1-d) + d*( PR(T1)/C(T1)+…+

PR(Tn)/C(Tn) )

d: damping factor, normally this is set to 0.85.T1, …, Tn: pages pointing to page APR(A): PageRank of page A.PR(Ti): PageRank of page Ti.C(Ti): the number of links going out of page Ti.

Note: d counts for PageRank sinks

Example of Calculation (1)

Page A

Page C

Page B

Page D

Example of Calculation (2)

Page A 1

Page C1

Page D1

Page B1

1*0.85/2

1*0.85

Example of Calculation (3) Each page has not passed

on 0.15, so we get:Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15

Page A 1

Page C2.275

Page B0.575

Page D0.15

Example of Calculation (4)Page A: 2.275*0.85 (from Page C)

+ 0.15 (not transferred) = 2.08375

Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575

Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) =

1.19125Page D: receives none, but has not

transferred 0.15 = 0.15

Page A 2.08375

Page C1.19125

Page B0.575

Page D0.15

Example - Conclusions Page C has the highest PageRank,

and page A has the next highest: page C has a highest importance in this page graph!

More iterations lead to convergence of PageRanks.

Base set In practice when the user gets bored tends

to use his bookmarked pages instead of a random one. These bookmarked pages constitute the base set.

The PR formula is modified to reflect this behavior.PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) )

If A in base set E = 1 else E = 0

using link structure Ranking Keyword Search Results in

PageRanks

Keyword QueryInput: set of keywords

Output: List of nodes ranked according to their relevance to the keywords

Score of a result-node:• Sum of keyword-specific PRs (OR semantics)• Product of keyword-specific PRs (AND

semantics)

Database Schema

C(cid,name)

Y(yid,year,cid)

P(pid,title,yid)

A(aid,name)

PP(pid1,pid2)

PA(pid,aid)

C: conferenceY: conference yearP: paperA: author

: primary to foreign key

Tupples in C, Y, P, Aare objects that represent nodes in schema graph

Primary to foreign key relations represent edges in the graph

All connections are two way except P – P that is only from paper to cited paper

Architecture

Attributes of PRindex table:•Keyword •CLOB of (id,PR) list

List of •Nodeid•Node text•PR wrt all keywords

CreatePR index

Database

PRindex

d,edge weights,

epsilon, threshold

QueryModule

Keywords,k

Results

Preprocessingstage

Query stage

Modified PageRank Formula

PR(A)=(1-d) + d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A has keyword

PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A doesn’t have keyword

Preprocessing stage (1) Load whole database in memory

Create edges Hashtable ( nodeId, nodeId, Type of edge )

Create nodes Hashtable ( nodeId ) Create text Hashtable ( nodeId, text )

For each keyword Find all nodes that contain keyword and put

them in base set. Execute PR algorithm with base set.

Preprocessing stage (2) Create descending list of (nodeid,PR)

pair. Store list in CLOB in PRindex table indexed by keyword.

Query Stage For each keyword in input retrieve

( id, PR ) list from database. Resolve top-k ids with respect to

the sum of Page ranks using Fagin’s algorithm (PODS 2001).

Fagin’s Algorithm Descending sorted keyword-specific PR lists

Keep the maximum possible value of a node that is the current PR for node extracted so far in scanned lists plus the PR of currently pointed nodes in other lists. Keep the minimum value that is the current PR for node.

Algorithm terminates when it finds k objects of which minimum value is greater than the maximum PR value for the rest of nodes.

Conclusions

We implemented a system for keyword search in databases using PageRank.

It uses an index of keyword specific Object Ranks

Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Documents

Google PageRank

PageRank and BlockRank - evl · PageRank/BlockRank Highlights PageRank is a global ranking based on the web’s graph structure PageRank uses backlink information PageRank can be

HITS + Pagerank

PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank - Information Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture20.pdf · 2009-05-04 · Calculang PageRank • A simple iterave algorithm: – First, assign a PageRank

Topic-Sensitive PageRank: A Context-Sensitive Ranking ...infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank-tkde.pdf · Topic-Sensitive PageRank: A Context-Sensitive Ranking

Temporal PageRank - unibo.it

Dm PageRank

PageRank - GitHub Pages

HITS e Pagerank

Deeper Inside PageRank - Nc State Universitymeyer.math.ncsu.edu/Meyer/PS_Files/DeeperInsidePR.pdf · Deeper Inside PageRank ... score is combined with a PR (PageRank) score to determine

1 QSX: Querying Social Graphs Graph Queries and Algorithms Graph search (traversal) PageRank Nearest neighbors Keyword search Graph pattern matching

Generalizing PageRank (Pisa)

Information Retrieval and Web Search PageRank for Summarization and Keyword Extraction Instructor: Rada Mihalcea

Wireless Sensor Network Architecture for Structural Health Monitoring Michael Sirivianos April 17, 2003

Behm Shah Pagerank

Pagerank Algorithm Explained

Pagerank (1)

Cooperative Control of Multi-Agent Systemswatanabe- › syllabus › atmis › ATMIS_Ishii_… · PageRank in Various Areas. 6 Keyword: “Department of control engineering” (in

Quick & Dirty SEO Intro / Overview SEO Example Market Research Keyword Research PageRank