49
Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Embed Size (px)

Citation preview

Page 1: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Google Pagerank: how Google orders your webpages

Dan TeagueNCSSM

Page 2: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

The Problem• Imagine a library containing 40 billion

documents but with no centralized organization and no librarians.

• In addition, anyone may add a document at any time without telling anyone.

• If one of these documents is vitally important to you, how could you find it?

Page 3: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Why This

Order?

Page 4: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Google Pagerank System

Google was developed by Sergey Brin and Larry Page

This is the method that Larry Page developed to rank and order the pages.

Hence, the Pagerank.

Page 5: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Larry Page (new CEO of Google)

Co-founder Larry Page once described the “perfect search engine” as something that “understands exactly what you mean and gives you back exactly what you want.”

Page 6: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Eagle Ray at Eden Rock

Page 7: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

How would you order these site?• Suppose each

of the nodes at right have the links shown in the directed graph. Which node is most important and should appear first?

Page 8: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 9: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

The Basic Idea• PageRank is a numeric value that represents

how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. http://www.webworkshop.net/pagerank.html

Page 10: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Bucket Brigade Matrix

Page 11: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Outdegree

Matrix H

Page 12: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Markov Chain

We would like to think of this matrix as a transition matrix (like a Markov chain). If we move around on the graph at random, at which nodes will we spend most of our time?

These most important nodes can be found in a Markov chain by considering the powers of H.

Page 13: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

• Or we can look for solutions to HX = X. This means we want the eigenvector X associated with the eigenvalue of 1.

• This is why the Pagerank is known as the $25,000,000,000 eigenvector.

Page 14: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Consider powers of H

Page 15: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 16: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Where did all the Importance go?

Page 17: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Things that go wrong:

Page 18: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Dangling Nodes

Page 19: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Dangling Node

Page 20: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Cycles

Page 21: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Dangling Subgraphs

Page 22: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Graph not strongly connected

Page 23: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Powers of Hs

Page 24: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 25: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

States 4-7 Disappear

Page 26: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

How Do We Handle These Problems?

• The Dangling Node

• The Cycle

• The Sub-graph Sink

Page 27: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

The Dangling Node

• The Dangling Node we handle by requiring a transition to another node at random.

• Pick a node, move there, and then move forward.

Page 28: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

We alter our Bucket Brigade matrix by adding in matrix A.

Page 29: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Matrix H + A

Page 30: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

What About the Other Problems?

Dangling Nodes are easy to find. Cycles and Sub-graph sinks are more difficult and time consuming. Pagerank handles these problems without actually finding them.

• The Cycle

• The Sub-graph Sink

Page 31: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Probabilistic Movement• Roll a die.

• If anything but a 6 shows, • then follow the web, that is, • use our matrix (H + A).

• However, if you roll a 6, then pick a page at random and go there.

• This gives us an out when we are trapped either by a cycle or by a sub-graph sink.

Page 32: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

How Often Should We Look for an Escape?

• Would it be better to roll a 20-sided die or flip a coin?

Page 33: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

How do you implement the coin flip?

• Create a matrix all of whose entries are 1.

This is the One matrix. If we multiply this matrix by 1/n, where n is the number of nodes in the graph (in our example 11, in reality 40 billion), then we have an equal chance of traveling from any point to any other point. We pretend that the web is a complete graph.

Page 34: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Roll the die

• We will use the Web-ordered matrix H+A with probability p and the One matrix with probability (1-p).

• What’s a good value for p?

Page 35: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

The Basic Google Equation

G

0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

0.05 0.05 0.5 0.05 0.2 0.05 0.05 0.05 0.091

0.2 0.05 0.05 0.05 0.2 0.05 0.05 0.05 0.091

0.05 0.05 0.05 0.2 0.05 0.05 0.05 0.05 0.091

0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.091

0.2 0.275 0.05 0.05 0.05 0.05 0.05 0.05 0.091

0.05 0.275 0.05 0.2 0.2 0.05 0.05 0.05 0.091

0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.5 0.091

0.05 0.05 0.05 0.2 0.05 0.5 0.275 0.05 0.091

0.2 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.091

0.05 0.05 0.05 0.05 0.05 0.05 0.275 0.05 0.091

0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...

Page 36: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

G = p(H + A) + (1-p)One (1/n)

We know that (H + A) and One(1/n) are both Markov chains. Is G also?

So, powers of G should tell us what we want to know.

Page 37: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

G = p(H + A) + (1-p)One (1/n)

• But powers of G is an incredibly inefficient way to go on the “real world” of the web.

• Instead, the iterative method is employed.

Page 38: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Iterating Xn+1 = GXn

Page 39: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

The Pagerank order is 8-7-11-10-9-6-5-1-2-3-4

Page 40: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

What about p?

• What role does p play and what value is actually used?

Page 41: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

p determines the rate of convergence

Page 42: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

p = 0.95 has not yet converged

Page 43: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Google Pagerank

• Google claims that it uses p = 0.85 (roll of the die is just about right) and about 50 iterations of the matrix G, where

G = p(H + A) + (1-p)One (1/n).

It recomputes every month.

Page 44: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

References:

Page 45: Google Pagerank: how Google orders your webpages Dan Teague NCSSM

Convergence?

Page 46: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 47: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 48: Google Pagerank: how Google orders your webpages Dan Teague NCSSM
Page 49: Google Pagerank: how Google orders your webpages Dan Teague NCSSM