34
1 Hyperlink Analysis A Survey (In Progress)

1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk Introduction to Hyperlink Analysis Classification of Hyperlink Analysis Two

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

1

Hyperlink Analysis

A Survey(In Progress)

Page 2: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

2

Overview of This Talk

Introduction to Hyperlink Analysis

Classification of Hyperlink Analysis

Two sub-topics: Measures and Metrics Interesting Web Structures

Page 3: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

3

Definition of Hyperlink Analysis

Hyperlink Analysis can be defined as an

area of Web Information Retrieval using

the hyperlink structure of the Web.

Page 4: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

4

Motivation

Hyperlinks serve two main purposes. Pure Navigation. Point to pages with authority* on the

same topic of the page containing the link.

This can be used to retrieve useful information from the web.

* - a set of ideas or statements supporting a topic

Page 5: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

5

What Information Can Be Retrieved ?

Quality of Web Page.- The authority of a page on a topic.

- Ranking of web Pages. Interesting Web Structures.

- Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc.

Web Page Classification.- Classifying web pages according to various

topics.

Page 6: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

6

What Information Can Be Retrieved? (Cont…)

Which pages to crawl.- Deciding which web pages to add to the

collection of web pages. Finding Related Pages.

- Given one relevant page, find all related pages.

Detection of duplicated pages.- Detection of neared-mirror sites to

eliminate duplication.

Page 7: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

7

Classification of Hyperlink Analysis Research

Hyperlink Analysis

Measures and Metrics

Interesting Web Structures

Web Page Classification

Web Search

(Still needs to be refined. Suggestions Welcome)

Page 8: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

8

Measures/metrics

Standards for measuring properties of a page or a web structure.

Quality of a page. Distance between pages. Web Page Reputation.

Page 9: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

9

PageRank Citation Ranking[1]

Aim Ranking Metric for Hypertext

Documents

Approach Page has a high rank if the sum of the

ranks of its backlinks is high

Page 10: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

10

Authoritative Sources in Hyperlink Environment[3]

Aim Determining relative “authority” of pages

Approach Good authority page is one pointed to by many good

hubs Good hub page is one that points to many good

authorities

Results Efficient when query topic is sufficiently “broad”

Benefits Locating dense bipartite communities

Page 11: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

11

Does “Authority” Mean Quality ?[4]

Aim. Are any metrics we compute for Web documents good

predictors of document quality ?

Approach. Do experts agree in their quality judgments? Are different link-based metrics different?

o Indegree, PageRank and Authority. Can we predict human quality judgments ?Compute correlations between each pair of metrics and

also compare it with expert judgment.

Page 12: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

12

Does “Authority” Mean Quality ?[4]

Results. Experts agree on the nature of a quality within

a topic. No significant difference between link based

metrics. In-degree performed as well as PR and

Authority.

Page 13: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

13

Web Page Reputations [5]

Aim. Input: URL, Output: Ranked set of topics for

which the page has a reputation.

Approach.A page an acquire a high reputation on a topic

because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic.

A page is deemed authority on the topic if it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.

Page 14: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

14

One-level Influence Propagation

Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p

At each step: with probability d>0 jump to a random page, or with probability (1-d) follow a random link from the current

page

Gpq

n

t

n

qOuttqRdN

dtpR),(

1

)(),()1(),(

Gpq

n

qOuttqRd

),(

1

)(),()1(

if term t appears in page p

otherwise

Page 15: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

15

Two Level Influence Propagation

with probability d>0 jump to random page that contains term t with probability (1-d) follow random link forward/backward from

the current page, alternating directions

Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p

Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p

Page 16: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

16

Two Level Influence Propagation

pq

n

t

n

qOuttqHdN

dtpA )(),()1(2),(

1

pq

n

qOuttqHd )(),()1(

1

if term t appears in page p

otherwise

qp

n

t

n

qIntqAdN

dtpH )(),()1(2),(

1

if term t appears in page p

qp

n

qIntqAd )(),()1(

1otherwise

A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t

H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t

Page 17: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

17

Factors Affecting Page Reputation

How well a topic is represented. How well pages on a topic are

connected.

Page 18: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

18

Link Analysis and Stability[6]

Aim. When to expect stable rankings under small

perturbations to hyperlink patterns.

Approach. Eigengap directly affects the stability of

eigenvectors in HITS algorithm. Coupled Markov Chain Theory(?).

So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original.

Result. HITS – Unstable; PageRank – Stable.

Page 19: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

19

Stable Algorithms [7]Aim

Stable Link Analysis Methods

Approach Randomized HITS

Merging Hubs and Authorities notion with “reset” mechanism from PageRank

Subspace HITSCombining multiple eigenvectors from HITS to yield

aggregate authority scores – Subspace HITS

Results Both approaches more stable than HITS, latter a little

worse than PageRank

Page 20: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

20

Average Clicks [8]Aim.

A new definition of distance between two pages.

Approach. Based on probability to click a link through random

surfing.

Benefit. A good justification of practical search for fetching

neighboring pages.

Result. Distance by average clicks seems to fit well intuitively.

Page 21: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

21

Interesting Web Structure

Analyzing interesting graph patterns or Web Structures.

Helpful in identification of ‘Web Communities.’

Page 22: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

22

Interesting Web Structures [11]

Endorsement Mutual Reinforcement

Co-Citation Social Choice

Transitive Endorsement

Page 23: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

23

Interesting Web Structures [11]

Directed Complete Bipartite graph NK-clan with N=2, K=10

NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes

Page 24: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

24

Interesting Web Structures [11]

In - TreeOut- Tree

Page 25: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

25

Interesting Web Structures

Web Communities

Page 26: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

26

Friends and Neighbors [9]

Aim. Techniques to mine information in order

to predict relationship between individuals.

Approach. Similarity measured by analyzing text,

in-links, out-links and mailing list.

Result. In-links were ‘good’ predictors.

Page 27: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

27

References [1] S. Brin and L. Page(1998) The PageRank

Citation Ranking: Bringing Order to the Web. In Technical Report available at http://www-db.stanford.edu/~backrub/pageranksub.ps, January 1998.

[2] T. Haveliwala,(1999) Efficient Computation of PageRank In Technical Report , Stanford University,CA

[3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment

Page 28: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

28

References [4] B. Amento1, L. Terveen, and Will Hill(2000) ,

Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000) 

[5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations ,Proceedings of Ninth International WWW Conference

Page 29: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

29

References(contd…) [6] A. Y. Ng, A. X. Zheng, and M. I.

Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01.

[7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), 2001.

[8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.

Page 30: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

30

References(contd…) [9] L. A. Adamic and E. Adar(2000), Friends and

Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA 94304.

[10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.

Page 31: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

31

References (contd…) [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L.

Broadwater, Levent Bolelli, Seyda Ertekin (2000), The Shape of the Web and Its Implications for Searching the Web , International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, Rome. Italy, Jul.-Aug. 2000

[12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3

Page 32: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

32

PageRank Approach

PageRank of a page p.

• d is the damping factor (or probability that a page is chosen uniformly at random from all pages ).

• n is the number of nodes in Graph G.• outdegree(q) is the number of edges leaving a page q.

Back.

Gpq

qreeoutqPRdn

dpPR),(

)(deg)()1()(

Page 33: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

33

HITS Approach

Let z denote the vector(1,1,1,1,….1).Initially set x z ; y z,For i = 1,2,3….Apply the I Operation.Apply the O operation.Normalize x and y.The sequence of (x, y) pairs produced converges to a limit (x*, y*).

Return (x*, y* ) as the authority and hub weights.Back.

EPpqqqp yx

),(:

:

EPpqqqp xy

),(:

:

Page 34: 1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two

34

Friends and Neighbors

Predicting Friendship

sshareditem shareditemfrequency

BAsimilarity)](log[

1),(

Items that are unique to few users are weighted more than commonly occurring items 2 people mention item, Weight = 1/log(2) = 1.4 5 people mention item, Weight = 1/log(5) = 0.62

Back