26
15-396 Science of teh Interwebs

Science of the Interwebs

Embed Size (px)

Citation preview

15-396Science of teh Interwebs

Web Search IILecture 13 (October 14, 2008)

What Does the Web

Look Like?

Can Think of the Web

as a Directed Graph

What is a Node?

There is an

“infinite” number

of pages in

Google alone

Spider Traps

http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....

Modern Search Engines

Focus on Relatively

Stable Pages

What Does the Web

Look Like?

A strongly connected component (SCC) in a

directed graph is a subset of the nodes

such that every node in the subset has a

path to every other node in the subset

56 Million

44 Million 44 Million

44 MillionData from 1999

How Should we Use

Rankings in Search?

1. Collect all pages that are relevant

through text-only techniques: the query

occurs in the title of the page, the query

occurs in the page itself, etc.

2. Sort the outcome by e.g. global

PageRank

Problem: If Yahoo! Contains the text

“flower” it will be one the first few results for

the query

Naïve Approach

Forget about PageRank

for a Second…

1. Collect all pages that are relevant

through text-only techniques: the query

occurs in the title of the page, the query

occurs in the page itself, etc.

2. Let pages in this sample “vote”

through links

Problem: Super popular pages like Yahoo!

still pose problems

Lists

Some pages are “lists” of things

A page’s value

as a list = sum of

votes received

by all pages that

it voted for

Hubs and Authorities: A

Precursor of PageRank

Hubs = High-value lists for the query

Authorities = highly endorsed answers to

the query

For each page p, we assign it two values

hub(p) and auth(p)

Start: for all p, hub(p) = 1, auth(p) = 1

Authority Update Rule: For each page p,

update auth(p) to be the sum of the

hub scores of all pages that point to it

Hub Update Rule: For each page p, update

hub(p) to be the sum of the authority

scores of all pages that it points to

For k times:

Apply Authority Update Rule

Apply Hub Update Rule

To make the numbers not

grow infinitely, always

normalize

This process converges!

Combining Anchor Text

A great newspaper

Check out this picture

Which link is better for the query

“newspaper”?

How do we incorporate this information

into PageRank or “Hubs and Authorities”?

We can multiply link contributions by a

factor that indicates the quality

Impact Factor of

Scientific Journals

Nature

Science

New England Journal of Medicine

Cell

PNAS

Journal of Biological Chemistry

JAMA

The Lancet

NAT GENET

Nature Medicine

Supreme Court Cases

g2g

ttyl