24
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture Christian Schindelhauer [email protected]

Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Embed Size (px)

DESCRIPTION

Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture. Christian Schindelhauer [email protected]. Chapter III. Chapter III Searching the Web 22 Nov 2004. Searching the Web. Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm The Simple Algorithm - PowerPoint PPT Presentation

Citation preview

Page 1: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

1

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

22 Nov 20046th Lecture

Christian Schindelhauer

[email protected]

Page 2: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 2

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Chapter III

Chapter IIISearching the Web

22 Nov 2004

Page 3: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 3

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching the Web

Introduction

The Anatomy of a Search Engine

Google’s Pagerank algorithm

– The Simple Algorithm

– Periodicity and convergence

Kleinberg’s HITS algorithm

– The algorithm

– Convergence

The Structure of the Web

– Pareto distributions

– Search in Pareto-distributed graphs

Page 4: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 4

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Overview Search Engineshttp://www.searchengineshowdown.com/(March 2002)

Number of documents

Search EngineShowdown

Estimate (millions)

Claim  (millions)

Google 968 1,500

WiseNut 579 1,500

AllTheWeb 580 507

Northern Light 417 358

AltaVista 397 500

Hotbot 332 500

MSN Search  292 500

Page 5: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 5

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Overview Search Engineshttp://www.searchengineshowdown.com/(Dez. 2002)

Number of documents

Search EngineShowdown

Estimate (millions)

Claim  (millions)

Google 3,033 3,083

AlltheWeb 2,106 2,116

AltaVista 1,689 1,000

WiseNut 1,453 1,500

Hotbot 1,147 3,000

MSN Search 1,018 3,000

Teoma 1,015 500

NLResearch 733 125

Gigablast 275 150

Page 6: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 6

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Problems of Searching the Web

Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages– 10.000 words cover more than 95% of each text– much more web-pages than words– Users hardly ever look through more than 40 results

The problem is not to find a pattern, but to find the most important pages

Problems:– Important pages do not contain the search pattern

• www.porsche.com does not contain sports car or even car• www.google.com does not contain web search engine• www.airbus.com does not contain airplane

– Certain pages have nearly every word (dictionary)– Names are misleading

• http://www.whitehouse.org/ is not the web-site of the white house• www.theonion.com is not about vegetables

– Certain pattern can be found everywhere, e.g. page, web, windows, ...

Page 7: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 7

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How to rank Web-pages

The main problem about searching the web is to rank the importance

Links are very helpful:

– Humans are usually introduced on purpose

– The context of the links gives some clues about the meaning of the web-page

– Pages where many people point to are of probably very important

– Most search rely on links

Other approach: Ontology of words

– Compare the combination of words with the search word

– Good for comparing text

– Difficult if single word patterns are given

Page 8: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 8

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Anatomy of a Web Search Engine

“The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page, Computer Networks and ISDN Systems, Vol. 30, 1-6, p. 107-117, 1998

Design of the prototype– Stanford University 1998

Key components:– Web Crawler– Indexer– Pagerank– Searcher

Main difference between Google and other search engines (in 1998)

– The Pagerank mechanism

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Page 9: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 9

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Simplified PageRank-Algorithmus

Simplified PageRank-Algorithmus

– Rank of a wep-page R(u) [0,1]

– Important pages hand their rank down to the pages they link to.

– c is a normalisation factor such that ||R(u)||1= 1, i.e.

• the sum of all page ranks add to 1

– Predecessor nodes Bu

– sucessor nodes Fu

Page 10: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 10

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Simplifed Pagerank Algorithm and an example

Page 11: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 11

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Matrix representaion

R c M R ,

where R is a vector (R(1),R(2),… R(n)) and M denotes the following n n – Matrix

Page 12: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 12

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Simplified Pagerank Algorithm

Does it converge?

If it converges, does it converge to a single result?

Is the result reasonable?

Page 13: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 13

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Eigenvector and Eigenvalue of the Matrix

For vector x and n n-matrix and a number λ:

– If M x = λ x then x is called the eigenvector and λ the eigen-value Every n n-matrix M has at most n eigenvalues

Compute the eigenvalues by eigen-decomposition

M x = λ x (M - I λ) x = 0,

where I is the identity matrix

– This equality has only non-trivial solutions if

Det(M - I λ) = 0

– This leads to a polynomial equation of degree n, which has always n solutions λ1, λ2, ..., λn

• (Fundamental theorem of algebra)

– Solving the linear equations (M - I λi) x = 0 lead to the eigenvectors

The eigenvektor of the matrix is a fix point of the recursion of the simplified pagerank algorithm

Page 14: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 14

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Consider n discrete states and a sequence of random variable X1, X2, ... over this set of states

The sequence X1, X2, ... is a Markov chain if

A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix:

– Elements of the matrix M must be real numbers of [0, 1].

– The sum of all column in M is 1Observation for the matrix M of the simpl. pagerank algorithm

– M is stochastic if all nodes have at least one outgoing link

Stochastic Matrices

Page 15: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 15

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Random Surfer

Consider the following algorithm

– Start in a random web-page according to a probability distribution

– Repeat the following for t rounds

• If no link is on this page, exit and produce no output

• Uniformly and randomly choose a link of the web-page

• Follow that link and go to this web-page

– Output the web-page

Lemma

The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x1, .., xn is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization.

Proof follows applying the definition of Markov chains

Page 16: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 16

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Eigenvalues of Stochastic Matrices

Notations– Die L1-Norm of a vector x is defined as

– x0, if for all i: xi 0

– x0, if for all i: xi 0

Lemma

For every stochastic matrix M and every vector x we have

• || M x ||1 || x ||1

• || M x ||1 = || x ||1, if x0 or x0

Eigenvalues of M |i| 1

Theorem

For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x 0 and ||x||1 = 1

Page 17: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 17

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The problem of periodicity - Example

Page 18: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 18

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Periodicity - Example 2

Page 19: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 19

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Periodic Matrices

Definition– A square matrix M such that the matrix power Mk=M for k a positive integer is

called a periodic matrix.– If k is the least such integer, then the matrix is said to have period k. – If k = 1, then M2 = M and M is called idempotent.

Fact– For non-periodic matrices there are vectors x, such that limk Mk x does not

converge.

Definition– The directed graph G=(V,E) of a n x n-matrix consistis of the node set

V={1,..., n} and has edges• E = {(i,j) | Mij 0}

– A path is a sequence of edges (u1,u2),(u2,u3),(u3,u4),..,(ut,ut+1) of a graph– A graph cycle is a path where the start node is the end node– A strongly connected subgraph S is a maximum sub-graph such that every

graph cycle starting and ending in a node of S is contained in S.

Page 20: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 20

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Necessary and Sufficient Conditions for Periodicity

Theorem (necessary condition)

– If the stochastic matrix M is periodic with period t2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i.

Theorem (sufficient condition)

– Let the graph consist of one strongly connected subgraph and

– let L1,L2, ..., Lm be the lengths all directed graph cycles of maximal length n

– Then M is non-periodic if and only if gcd(L1,L2, ..., Lm) = 1

Notation:

– gcd(L1,L2, ..., Lm) = greatest common divisor of numbers L1,L2, ..., Lm

Corollary

– If the graph is strongly connected and there exists a graph cycly of length 1 (i.e. a loop), then M is non-periodic.

Page 21: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 21

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Disadvantages of the Simplified Pagerank-Algorithm

The Web-graph has sinks, i.e. pages without links

M is not a stochastic matrix

The Web-graph is periodic Convergence is uncertain

The Web-graph is not strongly connected Several convergence vectors possible

Rank-sinks – Strongly connected subgraphs absorb all weight of the predecessors – All predecessors pointing to a web-page loose their weight.

Page 22: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 22

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The (non-simplified) Pagerank-Algorithm

Add to a sink links to all web-pages

Uniformly and randomly choose a web-page

– With some probability q < 1 perform a step of the simplified Pagerank algorithm

– With probability 1-q start with the first step (and choose a random web-page)

Note M ist stochastic

Page 23: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search Algorithms, WS 2004/05 23

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Properties of the Pagerank-Algorithm

Graph der Matrix is strongly connected

There are graph cycles of length 1

Theorem

In non-periodic matrices of strongly connected graphs the Markov-chain converges to a unique eigenvector with eigenvalue 1.

PageRank converges to this unique eigenvector

Page 24: Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

24

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Thanks for your attentionEnd of 6th lectureNext lecture: Mo 29 Nov 2004, 11.15 am, FU 116

Next exercise class: Mo 22 Nov 2004, 1.15 pm, F0.530 or We 24 Nov 2004, 1.00 pm, E2.316