Upload
tasha-vang
View
32
Download
1
Embed Size (px)
DESCRIPTION
Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University. “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ”, S. Brin and L. Page, in Proceeding of WWW’98 - PowerPoint PPT Presentation
Citation preview
Google Search Engine*
CS461 LectureDepartment of Computer Science
Iowa State University
1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, S. Brin and L. Page, in Proceeding of WWW’98
2. “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin, R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998
What to cover today
PageRankGoogle Architecture
Problem Statement
Ultimate version Find what I want
In most cases, I don’t know exactly or cannot expressed clearly what I want
“What-I-want” can be estimated using a set of keywords
Simplified version Find the files that are most related to a
set of keywords
Naïve Solution
How it works Download the entire Internet to a local
machine Search and return all files containing the set
of keywords
Problems: all files are treated equally importance Could return tons of files, but most of them
are not what I want Since most users simply check out the first
few files, this scheme actually cannot find much useful things
Ranking Based on Hit Rate
How it works A file is ranked higher if it is visited
more frequently
Problems Could be affected by faked hits A file will be ranked higher and higher
Ranking based on Citation
Basic idea A paper is important if it is cited by many papers
Each paper has a set of references that link to the related work
A pioneering paper typically has a high citation An HTML page is more important if it is linked by
many other page Each page may link to other pages
Problems Publish of academic papers is well-controlled
Many are peer-reviewed Chronically ordered
Internet files could be anything
Proposed: PageRank
Basic idea A page with many links to it is more likely to
be useful than one with few links to it Just like citation
The links from a page that itself is the target of many links are likely to be particularly important This is something new
Proposed: PageRank
Basic idea A page with many links to it is more likely to be
useful than one with few links to it Just like citation
The links from a page that itself is the target of many links are likely to be particularly important This is something new
back linksforward link
Each link has different weight
Proposed: PageRank
How it works Each page is ranked using a value called PageRank
(PR) A page’s PR depends on the PRs of its back link
pages
PR(A)=(1-d) + d*[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)]
d: damping factor, normally this is set to 0.85
T1, … Tn: pages point to page A
PR(A): PageRank of page A
PR(Ti): PageRank of page Ti pointing to page A
C(Ti): the number of links going out of page Ti
Properties of PageRank formula PageRanks form a probability distribution
over web pages, so the normalized sum of all web pages' PageRanks will be one
Challenge of calculating PageRanks The links could be circulated, e.g., ABA
Proposed: PageRank
Page A Page B
Assign each page an initial rank value Could be any number (seed)
Repeat calculations until the rank of each page does not change much
PageRank Calculation
Page A
Page B
d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)
Seed = 1
PR(A)= 0.15 + 0.85 * 1 = 1PR(B)= 0.15 + 0.85 * 1 = 1
Assign each page an initial rank value Could be any number (seed)
Repeat calculations until the rank of each page does not change much
PageRank Calculation
Page A
Page B
d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)
Seed = 01)PR(A)= 0.15 + 0.85 * 0 = 0.15 PR(B)= 0.15 + 0.85 * 0.15 = 0.27752)PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875PR(B)= 0.15 + 0.85 * 0.385875 = 0.477993753)PR(A)= 0.15 + 0.85 * 0.47799375 = 0.5562946875PR(B)= 0.15 + 0.85 * 0.5562946875 =
0.622850484375
Assign each page an initial rank value Could be any number (seed)
Repeat calculations until the rank of each page does not change much
PageRank Calculation
Page A
Page B
d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)
Seed = 401)PR(A)= 0.15 + 0.85 * 40 = 34.25PR(B)= 0.15 + 0.85 * 0.385875 = 29.17752)PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875PR(B)= 0.15 + 0.85 * 24.950875 = 21.358243753) ......
Assign each page an initial rank value Could be any number (seed)
Repeat calculations until the rank of each page does not change much
PageRank Calculation
Page A
Page B
Seed = 401)PR(A)= 0.15 + 0.85 * 40 = 34.25PR(B)= 0.15 + 0.85 * 0.385875 = 29.17752)PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875PR(B)= 0.15 + 0.85 * 24.950875 = 21.358243753) ……
Observation: It doesn’t matter what the seed value you use, once the PageRank calculations settle down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0
Example of Calculation (0)
Page A
Page C
Page B
Page D
Example of Calculation (1)
Page A1
Page C1
Page B1
Page D1
Example of Calculation (2)
Page A 1
Page C1
Page B1
Page D1
1*0.85/2
1*0.85/21*0.85
1*0.85
1*0.85
Each page has not passed on 0.15, so we get:Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1
Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15Page A
1
Page C2.275
Page B0.575
Page D0.15
Example of Calculation (3)
Page A 1
Page C2.275
Page B0.575
Page D0.15
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375
Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575
Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125
Page D: receives none, but has not transferred, remains at 0.15
Page A 2.03875
Page C1.1925
Page B0.575
Page D0.15
Example of calculation (4)
After 20 iterations, we get
Page A 1.490
Page C1.577
Page B0.783
Page D0.15
In reality: a PageRank for 26,000,000 web pages can be computed in a few hours on a medium size workstation. (1998)
Result
Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page links!More iterations lead to a stability PageRank of the resulting page for keyword research.
PageRank Summary
PageRank is a citation importance ranking Approximated measure of importance or quality Number of citations or backlinks
The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!)
PageRank Summary
PageRank is a citation importance ranking Approximated measure of importance or
quality Number of citations or backlinks Each citation has different weight
The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!)Questions: how to improve the ranking of your web pages? Creating dummy sites to link to their main
sites? Increasing internal links and/or decreasing
external links?
Google Architecture