Authoritative Sources in a Hyperlinked Environment
Jon M. Kleinberg
Presentation by Julian Zinn
Searching the Web
• Goal: find pages relevant to a query.• The basic text-based search algorithms
retrieve pages that contain the query keywords.
• Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages.
• This paper introduces an algorithm for identifying authoritative pages and hub pages.
Overview
• Issues in Searching
• Algorithm Overview
• Iterative Algorithm
• Wrap-up
Types of Queries
• Specific queries: information about the topic is scarce.
• Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages.
• Similar-page queries: find pages that are ‘like’ a given page.
This paper examines broad-topic queries.
Complications with Text-based Search
• An authoritative page for a query may not contain the query terms.– Example: www.uh.edu contains neither ‘University’
nor ‘Houston’, and has ‘UH’ only six times.– Text may be in the form of images or flash
animations.
• A page might not be self-descriptive.– Example: Honda does not describe itself as an
automobile manufacturer and Google does not describe itself as a search engine.
Examining Link Structure
• The creator of a page p, by including a link to a page q, confers authority in some way to page q.
• How can we exploit this latent human judgment information?
• Pitfall: Many links, such as navigational links and advertisement links do not confer authority.
Exploiting Link Structure 1
• An authoritative page must be popular.• So, of all pages that contain the query terms,
return those with the highest in-degree.• Pitfall: Still misses authoritative pages that do
not contain the query terms.• Pitfall: Universally popular pages (like
www.yahoo.com) will be considered highly authoritative for any query terms they contain.
Exploiting Link Structure 2
• Authoritative sources often do not link to other authoritative sources.– Examples: Toyota does not link to Honda, and
Google does not link to Teoma.
• Other pages, which we call hub pages, link to multiple authoritative sources.– Example: Auto enthusiast websites linking to
multiple manufacturer’s websites.
• The authoritative pages for a query share many hub pages.
Overview
• Issues in Searching
• Algorithm Overview
• Iterative Algorithm
• Wrap-up
Algorithm Overview
• For a query , start with a text-based search to generate an initial root set R.
• Enlarge the root set to a base set S.
• Identify authoritative pages and hub pages in S.
• Return the most authoritative pages in S.
Desiderata for S
S should be:• Relatively small.• Rich in relevant pages.• Contain most (or many) of the strongest
authorities.
R will satisfy 1 and 2, but not 3.Even the set of all pages that contain the query
terms may not satisfy 3.
Enlarging R to S
• Pages in R may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R.
• Pages in R may not point to each other.• Let S = R + all pages pointed to by pages of R +
some pages that point to pages of R.• Use a heuristic to avoid navigation links.
Kleinberg’s experiments had R 200 and S 1000 to 5000.
Identifying Hubs and Authorities
• Our set S still has the problem of non-authoritative pages of high in-degree.
• The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them.
• The hub pages are the pages that point to many of the authoritative pages.
Hubs and Authorities Picture
hubs authorities
Unrelated page of large in-degree
Mutually Reinforcing Relationship
• Good hubs point to many good authorities.
• Good authorities are pointed to by many good hubs.
• There must be an iterative algorithm.
Overview
• Issues in Searching
• Algorithm Overview
• Iterative Algorithm
• Wrap-up
Iterative Algorithm 1
• For each page p, we associate a non-negative authority weight x(p) and a non-negative hub weight y(p).
• Values are normalized
• Larger values indicate better pages.
Iterative Algorithm 2
• If p points to many pages with large x-values, then p receives a large y-value:
• If p is pointed to by many pages with large y-values, then p receives a large x-value:
Iterative Algorithm 3
• We iterate and renormalize until values converge.
• Therefore, we need to prove convergence.• The algorithm is a discrete-time evolution and
can be written as multiplications of matrices and vectors
• A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of MTM and MMT.
Example: Mini Web
X
YZ
h
h
hH
z
y
x
a
a
aA
z
y
x
AMH ii * 1
HMA iT
i * 1
HMMH Tii * 1
AMMA iT
i ** 1
011
100
111
M
X Y Z
X
Y
Z
Example
011
100
111
M
011
101
101
M T
202
011
213
MM T
211
122
122
MM T
1
1
1
H
1
1
1
A
4
5
5
4
2
6
18
24
24
20
8
28
84
114
114
96
36
132
2
31
31
31
1
32
Iteration 0 1 2 3 …
X
YZ
X is the best hub
Z is most authoritati
ve
Overview
• Issues in Searching
• Algorithm Overview
• Iterative Algorithm
• Example
• Wrap-up
Notes to Consider
• In general, we don’t need to iterate to convergence.
• Paper contains a list of good results for various queries.
• After initial text-based search, the text was ignored in favor of the link structure.
Related Areas
• Similar-page queries.• Connections with:
– Social networks– Bibliometrics (citations)– Stand-alone hypertext environments– Clustering of link structures– Multiple sets of hubs and authorities– Diffusion and Generalization
Conclusion
• Influential paper – many citations.
• Published at the same time as the Google page-rank algorithm.
• HITS – Hyperlink Induced Topic Search
• Clever (IBM)
• Basis of Teoma search engine algorithm.
References
Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632.
The mini-web example comes fromhttp://www.cs.fiu.edu/~vagelis/
presentations/RandomWalks.ppt
The End