Link Analysis {week 09}

Preview:

DESCRIPTION

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis {week 09}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT Presentation

Citation preview

Link Analysis{week 09}

The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Are you connected?

The Internet (1969) is a network that’s Global Decentralized Redundant Made up of many different types of

machines

How many machines make up the Internet?

Browsing the Web

from Fluency with Information Technology, 4th editionby Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2

The World Wide Web

Sir Tim Berners-Lee

Weaving the Web

The World Wide Web (or just Web) is: Global Decentralized Redundant (sometimes) Made up of Web pages

and interactive Web services

How many Web pages are on the Web?

Links

Links are useful to us humans fornavigating Web sites and finding things

Links are also useful to search engines <a href="http://cnn.com"> Latest News

</a> anchor textdestination link (URL)

Anchor text

How does anchor text apply to ranking? Anchor text describes the

content of the destination page Anchor text is short, descriptive,

and often coincides with query text Anchor text is typically written

by a non-biased third party

The Web as a graph (i)

We often represent Web pages as vertices and links as edges in a webgraph

http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg

The Web as a graph (ii)

http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg

An example:

Using webgraphs for ranking Links may be interpreted as describing

a destination Web page in terms of its: Popularity Importance

We focus on incoming links (inlinks) And use this for ranking matching documents Drawback is obtaining incoming link data

Authority Incoming link count

PageRank (i)

PageRank is a link analysis algorithm PageRank is accredited to Sergey Brin

and Lawrence Page (the Google guys!) The original PageRank paper:▪ http://infolab.stanford.edu/~backrub/google.h

tml

PageRank (ii)

Browse the Web as a random surfer: Choose a random number r between 0 and 1 If r < λ then go to a random page else follow a random link from the current

page Repeat!

The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

PageRank (iii)

Jumping to a random pageavoids getting stuck in: Pages that have no links Pages that only have broken links

Pages that loop back to previously visited pages

PageRank (iv)

PageRank of page C is theprobability a random surferis viewing page C Based on inlinks PR(C) = PR(A) / 2 + PR(B) / 1

We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C) PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50

PageRank (v)

More generally:

Bu is the set of pages that point to u Lv is the number of outgoing links from

page v (not counting duplicate links)

PageRank (vi)

We can account for the “random jumps” by incorporating constant λ into the equation:

Typically, λ is low (e.g. λ = 0.15)

(N is the number of pages)

Link quality (and avoiding spam) A cycle tends to negate the

effectiveness of thePageRank algorithm

What next?

Read and study Chapter 4.5