Download ppt - Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn

Authoritative Sources in a Hyperlinked Environment

Jon M. Kleinberg

Presentation by Julian Zinn

Searching the Web

• Goal: find pages relevant to a query.• The basic text-based search algorithms

retrieve pages that contain the query keywords.

• Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages.

• This paper introduces an algorithm for identifying authoritative pages and hub pages.

Overview

• Issues in Searching

• Algorithm Overview

• Iterative Algorithm

• Wrap-up

Types of Queries

• Specific queries: information about the topic is scarce.

• Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages.

• Similar-page queries: find pages that are ‘like’ a given page.

This paper examines broad-topic queries.

Complications with Text-based Search

• An authoritative page for a query may not contain the query terms.– Example: www.uh.edu contains neither ‘University’

nor ‘Houston’, and has ‘UH’ only six times.– Text may be in the form of images or flash

animations.

• A page might not be self-descriptive.– Example: Honda does not describe itself as an

automobile manufacturer and Google does not describe itself as a search engine.

Examining Link Structure

• The creator of a page p, by including a link to a page q, confers authority in some way to page q.

• How can we exploit this latent human judgment information?

• Pitfall: Many links, such as navigational links and advertisement links do not confer authority.

Exploiting Link Structure 1

• An authoritative page must be popular.• So, of all pages that contain the query terms,

return those with the highest in-degree.• Pitfall: Still misses authoritative pages that do

not contain the query terms.• Pitfall: Universally popular pages (like

www.yahoo.com) will be considered highly authoritative for any query terms they contain.

Exploiting Link Structure 2

• Authoritative sources often do not link to other authoritative sources.– Examples: Toyota does not link to Honda, and

Google does not link to Teoma.

• Other pages, which we call hub pages, link to multiple authoritative sources.– Example: Auto enthusiast websites linking to

multiple manufacturer’s websites.

• The authoritative pages for a query share many hub pages.

Overview




• Wrap-up

Algorithm Overview

• For a query , start with a text-based search to generate an initial root set R.

• Enlarge the root set to a base set S.

• Identify authoritative pages and hub pages in S.

• Return the most authoritative pages in S.

Desiderata for S

S should be:• Relatively small.• Rich in relevant pages.• Contain most (or many) of the strongest

authorities.

R will satisfy 1 and 2, but not 3.Even the set of all pages that contain the query

terms may not satisfy 3.

Enlarging R to S

• Pages in R may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R.

• Pages in R may not point to each other.• Let S = R + all pages pointed to by pages of R +

some pages that point to pages of R.• Use a heuristic to avoid navigation links.

Kleinberg’s experiments had R 200 and S 1000 to 5000.

Identifying Hubs and Authorities

• Our set S still has the problem of non-authoritative pages of high in-degree.

• The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them.

• The hub pages are the pages that point to many of the authoritative pages.

Hubs and Authorities Picture

hubs authorities

Unrelated page of large in-degree

Mutually Reinforcing Relationship

• Good hubs point to many good authorities.

• Good authorities are pointed to by many good hubs.

• There must be an iterative algorithm.

Overview




• Wrap-up

Iterative Algorithm 1

• For each page p, we associate a non-negative authority weight x(p) and a non-negative hub weight y(p).

• Values are normalized

• Larger values indicate better pages.


• If p points to many pages with large x-values, then p receives a large y-value:

• If p is pointed to by many pages with large y-values, then p receives a large x-value:


• We iterate and renormalize until values converge.

• Therefore, we need to prove convergence.• The algorithm is a discrete-time evolution and

can be written as multiplications of matrices and vectors

• A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of MTM and MMT.

Example: Mini Web

X

YZ

h

h

hH

z

y

x

a

a

aA

z

y

x

AMH ii * 1

HMA iT

i * 1

HMMH Tii * 1

AMMA iT

i ** 1

011

100

111

M

X Y Z

X

Y

Z

Example

011

100

111

M

011

101

101

M T

202

011

213

MM T

211

122

122

MM T

1

1

1

H

1

1

1

A

4

5

5

4

2

6

18

24

24

20

8

28

84

114

114

96

36

132

2

31

31

31

1

32

Iteration 0 1 2 3 …

X

YZ

X is the best hub

Z is most authoritati

ve

Overview




• Example

• Wrap-up

Notes to Consider

• In general, we don’t need to iterate to convergence.

• Paper contains a list of good results for various queries.

• After initial text-based search, the text was ignored in favor of the link structure.

Related Areas

• Similar-page queries.• Connections with:

– Social networks– Bibliometrics (citations)– Stand-alone hypertext environments– Clustering of link structures– Multiple sets of hubs and authorities– Diffusion and Generalization

Conclusion

• Influential paper – many citations.

• Published at the same time as the Google page-rank algorithm.

• HITS – Hyperlink Induced Topic Search

• Clever (IBM)

• Basis of Teoma search engine algorithm.

References

Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632.

The mini-web example comes fromhttp://www.cs.fiu.edu/~vagelis/

presentations/RandomWalks.ppt

The End