Upload
david-gleich
View
499
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A talk I gave at the annual meeting for the MetroNY section of the MAA about how Google works from a link-ranking perspective. (http://sections.maa.org/metrony/) Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)
Citation preview
How Does Google? !!
David F. Gleich!Computer Science!Purdue University!
A journey into the wondrous mathematics behind your favorite websites
1
Mathematics underlies an enormous number of the websites we use everyday!
2
1. ‘s PageRank 2. Multi-armed bandits and
internet experiments
3
4
Larry Page !Sergey Brin! • Created a web-search algorithm
called “backrub” • Spun-off a company “Googol”
based on the paper
• The importance of a page is
determined by the importance of pages that link to it.
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd “The PageRank Citation Ranking: Bringing Order to the Web” TR, Stanford InfoLab, 1999
5
A websearch primer 1. Crawl webpages 2. Analyze webpage text (information retrieval) 3. Analyze webpage links 4. Fit over 200 measures to human evaluations 5. Produce rankings 6. Continuously update
6
Pages, nodes, incoming links, outgoing links, and “importance”
7
“Important” pages that link to me!
c
b
a “Important” pages that link to Purdue!
8
Tim Davis and Yifan Hu Sparse Matrix Gallery
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
1000 vertices on 8.5-by-11 paper
1,000,000,000,000 vertices (one trillion) Paper the size of Manhattan island !(23 sq miles)?
The web
10
We need something better!
11
A wee web-graph: link counting is too easy to game!
1
2
3
4
5 6
1/3 1/3
1/3
1/2
1/2
12
A wee web-graph: link counting is too easy to game!
1
2
3
4
5 6
1/3 1/3
1/3
1/2
1/2
The importance of a page is determined by the importance of pages that link to it. x1 = 0
x2 =13
x1
x3 =13
x1 +12
x2
x4 =13
x1 + x3 + x5
x5 = x4
x6 =12
x2 13
The importance of a page is determined by the importance of pages that link to it
x
i
=X
j2Bi
1d
j
x
j
“Back-links from page i” Why it was called Backrub!
“Importance” of page i
“Importance” of page j
Number of links page j uses!out-degree in graph theory
x3 =13
x1 +12
x2
1
2
3
1/3
1/2
14
We can rewrite this equation in a more mathematically convenient way
1 1 2 3 4 5 6
2 1 2 3 4 5 6
3 1 2 3 4 5 6
4 1 2 3 4 5 6
5 1 2 3 4 5 6
6 1 2 3 4 5 6
x 0 x 0 x 0 x 0 x 0 x 0 x
1x x 0 x 0 x 0 x 0 x 0 x31 1x x x 0 x 0 x 0 x 0 x3 21x x 0 x 1x 0 x 1x 0 x3
x 0 x 0 x 0 x 1x 0 x 0 x
1x 0 x x 0 x 0 x 0 x 0 x2
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
= + + + + +
15
1 1
2 2
3 3
4 4
5 5
6 6
x x0 0 0 0 0 0x x1/ 3 0 0 0 0 0x x1/ 3 1/ 2 0 0 0 0
orx x1/ 3 0 1 0 1 0x x0 0 0 1 0 0x x0 1/ 2 0 0 0 0
⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥
=⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
x = Px
And even more conveniently!
Element k in column m = "probability" of going from node m to node k
16
The matrix P for websites shows a lot of structure
Every dot is a non-zero element indicating a link Matrices are sparse, and generally with block structure block structure can be explored to speed up ranking algorithm
17
But this idea doesn’t work for the wee web-graph
1
2
3
4
5 6
1/3 1/3
1/3
1/2
1/2
Nodes 1, 4 and 5 determine everything!
x1 = 0
x2 =13
x1
x3 =13
x1 +12
x2
x4 =13
x1 + x3 + x5
x5 = x4
x6 =12
x2
x1 = 0
x2 =13
x1 = 0
x3 =13
x1 +12
x2 = 0
x4 =13
x1 + x3 + x5 = x5
x5 = x4
x6 =12
x2 = 0
18
But this idea doesn’t work for the wee web-graph
1
2
3
4
5 6
1/3 1/3
1/3
1/2
1/2 Node 1 !“lonely” Nodes 4 and 5 !“mutual admiration societies” Node 6 “anti-social”
These nodes need to be “fixed” to get a reliable and useful ranking!
19
The gang of four to the rescue
Andrei Markov
Oscar Perron
Georg Frogenius
Richard !von Mises
20
Let’s fix it up and force node 6 to choose, or link to everyone
1
2
3
4
5 6
P =
2
6666664
0 0 0 0 0 01/3 0 0 0 0 01/3 1/2 0 0 0 01/3 0 1 0 1 00 0 0 1 0 00 1/2 0 0 0 0
3
7777775
P =
2
6666664
0 0 0 0 0 1/61/3 0 0 0 0 1/61/3 1/2 0 0 0 1/61/3 0 1 0 1 1/60 0 0 1 0 1/60 1/2 0 0 0 1/6
3
7777775
21
Taxation is the way to representation!
c b
a
If is a good page, then it’ll still be a good page if we “tax” the importance from a, b, and c We can redistribute the taxed amounts to all including lonely nodes!
22
The importance of a page is determined by the importance of pages that link to it*
* After tax and any benefits
The total importance that page j !contributes to page i
Benefits to page i
The taxation rate of all
x
i
=X
j2Bi
↵x
j
d
j
+ (1 � ↵)bi
23
x1x2x3x4x5x6
!
"
#########
$
%
&&&&&&&&&
= α
0 0 0 0 0 1/ 61/ 3 0 0 0 0 1/ 61/ 3 1/ 2 0 0 0 1/ 61/ 3 0 1 0 1 1/ 60 0 0 1 0 1/ 60 1/ 2 0 0 0 1/ 6
!
"
#######
$
%
&&&&&&&
x1x2x3x4x5x6
!
"
#########
$
%
&&&&&&&&&
+ (1−α)
b1b2b3b4b5b6
!
"
#########
$
%
&&&&&&&&&
Perron and Frobenius showed the new equation always has a unique solution
x = ↵Px + (1 � ↵)b
24
1
2
3
4
5 6
1/3 1/3
1/3
1/2
1/2
What von Mises and Richardson showed is that guess, check, and correct works!
x
(new)
= ↵Px
(old)
+ (1 � ↵)b
x
(start) =
2
6666664
0.170.170.170.170.170.17
3
7777775x
(1) =
2
6666664
0.050.100.170.380.190.12
3
7777775x
(2) =
2
6666664
0.040.060.100.360.360.08
3
7777775
x
(1) =
2
6666664
0.030.040.060.430.390.05
3
7777775
25
26
There’s still a lot of work left to do to make a search engine
Make it fast! Watch out for spam Watch out for manipulation Personalize Experiment!
27
1. ‘s PageRank 2. Multi-armed bandits and
internet experiments
28
http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/
Not this!
29
http://upload.wikimedia.org/wikipedia/en/8/82/Las_Vegas_slot_machines.jpg
This!
Pays out !$0.92/dollar
Pays out !$0.98/dollar
Pays out !$0.95/dollar
Pays out !$0.99/dollar
30
What in the heck does a multi-armed bandit have to do with Google?
31
What in the heck does a multi-armed bandit have to do with Google?
Pays out !$0.92/view
Pays out !$0.66/view
Pays out !$0.91/view to
show ads
Pays out !-$0.02/view
hide ads 32
How to optimize your website without exploiting the bandits
Try condition A 100 times, find 45 “wins” Try condition B 100 times, find 85 “wins” Try condition C 100 times, find 10 “wins” … Choose the best!
33
This field has some of the best terminology Explore ! Exploit ! Regret
34
This field has some of the best terminology Explore – Visiting Las Vegas! Exploit – Your new winning strategy! Regret – That you didn’t quit after winning the first round
35
This field has some of the best terminology Explore – Testing slot machines/experiments for their reward Exploit – Playing the best reward you’ve found so far Regret – How much you lost due !to exploration
36
How to optimize your website without exploiting the bandits
Try condition A 100 times, find 45 “wins” Try condition B 100 times, find 85 “wins” Try condition C 100 times, find 10 “wins” … Choose the best!
Pure exploration!
We only exploit our findings at the end!
37
How to optimize your website exploiting the bandits Try condition A 5 times, find 4 wins!Try condition B 5 times, find 4 wins!Try condition C 5 times, find 2 wins Try condition A 7 times, find 3 wins!Try condition B 7 times, find 5 wins!Try condition C 1 time, find 0 wins
Pure exploration!
Exploit our knowledge
Condition A B C Est. Return 0.58 0.75 0.33
38
The goal of these problems is to construct optimal strategies to minimize regret Regret how much you left “on the table” by exploring
zero-regret strategy is one where regret(T trials) is sublinear in T! as the number of plays T → ∞
E[play best always � plays made based on data]
regret 100-each 255/300 � 140/300 = 0.38
regret 30-mixed 25.5/30 � 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.31
39
[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.
Peter Whittle (Whittle, 1979) Discussion of “Bandit processes and dynamical allocation indices”
Their importance to website optimization, advertising, and recommendation has rejuvenated research on these problems with fascinating new questions.
40
Math is everywhere and especially your favorite websites! Matrices and probability are key ingredients.
41
PageRank on Wikipedia� = 0.50
United States
C:Living people
France
Germany
England
United Kingdom
Canada
Japan
Poland
Australia
� = 0.85
United States
C:Main topic classif.
C:Contents
C:Living people
C:Ctgs. by country
United Kingdom
C:Fundamental
C:Ctgs. by topic
C:Wikipedia admin.
France
� = 0.99
C:Contents
C:Main topic classif.
C:Fundamental
United States
C:Wikipedia admin.
P:List of portals
P:Contents/Portals
C:Portals
C:Society
C:Ctgs. by topic
Note Top 10 articles on Wikipedia with highest PageRank
David F. Gleich (Sandia) Sensitivity Purdue 11 / 36
42