30
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture Christian Schindelhauer [email protected]

Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

  • Upload
    netis

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture. Christian Schindelhauer [email protected]. Chapter III. Chapter III Searching the Web 20 Dec 2004. Searching the Web. Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm The Simple Algorithm - PowerPoint PPT Presentation

Citation preview

Page 1: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

1

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

20 Dec 200410th Lecture

Christian [email protected]

Page 2: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 2

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Chapter III

Chapter IIISearching the Web

20 Dec 2004

Page 3: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 3

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching the Web

Introduction

The Anatomy of a Search Engine

Google’s Pagerank algorithm– The Simple Algorithm– Periodicity and convergence

Kleinberg’s HITS algorithm– The algorithm– Convergence

The Structure of the Web– Pareto distributions– Search in Pareto-distributed graphs

Page 4: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 4

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Webgraph

GWWW:– Static HTML-pages are nodes – links are directed edges

Outdegree of a node: number of links of a web-pageIndegree of a node: number of links to a web-page

Directed path from node u to v– series of web-pages, where one follows links from the page u to page v

Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v– For all i:

• There is a link from wi zu wi+1 or from wi+1 to wi

Strong (weak) connected subgraph– minimal node set including all nodes which have a directed (undirected) path

from and to a reference node

Page 5: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 5

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Web-Graph (1999)

Page 6: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 6

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Distributions of indegree/outdegree

In and Out-degree obey a power law– i.e. in- and out-degree appear with probability ~ 1/iα

According to experiments of– Kumar et al 97: 40 million Webpages– Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3– Broder et al 00: 204 million webpages (Scan May and Oct 1999)

Page 7: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 7

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Is the Web-Graph a Random graph? No!

Random graph Gn,p:– n nodes– Every directed edge occurs with probability p

Is the Web-graph a random graph Gn,p?

The probability of high degrees decrease exponentially In a random graph degrees are distributed according to a Poisson

distribution

Therefore: The degree of a random graph does not obey a power law

Page 8: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 8

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Pareto Distribution

Discrete Pareto (power law) distribution for x {1,2,3,…}

with constant factor

(also known as the Riemann Zeta function)

Heavy tail property– not all moments E[Xk] are defined– Expected value exists if and only if α>2– Variance and E[X2] exist if and only if α>3– E[Xk] defined if and only if α>k+1

Density function of the continuous function for x>x0

Page 9: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 9

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Special Case: Zipf Distribution

George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c

Zipf probability distribution for x {1,2,3,…}

with constant factor conly defined for finite sets, since

tends to infinity for growing n

Zipf distributions refer to ranks– The Zipf exponent can be larger than 1, i.e. f(n) = c/n

Pareto distributions refer to absolute size– e.g. number of inhabitants

Page 10: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 10

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Pareto-Verteilung (I)

Example for Power Laws (= Pareto distributions)

– Pareto 1897: Wealth/income in population– Yule 1944: Word frequency in languages– Zipf 1949: Size of towns– Length of molecule chaings– File length of UNIX-files– ….

– Access density of web-pages– Access density of a web-surfer at a particular web-page– …

Page 11: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 11

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

City Size DistributionScaling Laws and Urban Distributions, Denise Pumain, 2003

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zipf distribution

Page 12: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 12

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Paretodistribution

Page 13: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 13

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 14: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 14

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 15: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 15

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Heavy-Tailed Probability Distributions in the World Wide WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 16: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 16

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Size of connected components

Strong and weak connected components obey a power law

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000.

Large weak connected component with 91% of all web-pages Largest strong connected component has size 28%

– Diameter ≥ 28

Page 17: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 17

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching in Power Law Networks

Task:–Given a network with undirected edges–Degrees underlie a power law–From a source node–Find a target node

Features–Keep it simple

• no markers–Visit one node at a time–Every node knows its neighbor (and its degree)

From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135

Three approaches–Neighbors of random nodes

–Neighbors of a random walk:• First random neighbor and

continue

–Neighbors of High Degree Seeking:• Start with random node• Prefer neighbors with larger

degree

Page 18: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 18

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Power Law Networks

Undirected graph of n nodes– The probability that a node has k

neighbors is pk

– where pk = c k- for a normalization factor c

For search in power law network– Consider largest connected

component and– exponent t with 2<<3

Theorem– For large enough power law graphs

with exponent • For <1 the graph is almost surely

connected• For 1< <2: There is a giant

connected component of size (n)• For 2< <3.4785: There is a giant

component and all smaller components are of size O(log n)

• For >3.4785: The graph has almost surely no giant component, ie. all components have size o(n)

• For >4: All connected components underlie a power law

– by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)

Page 19: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 19

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Random Walk

Random Walk:Start with random node as node uwhile neighbor of u is not target do

u random neighbor of uod

TheoremIn undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run).

Conclusion:–High degree nodes are preferred

Possible improvement–Avoid going back–Avoid visiting already visited nodes–Scan also second degree neighbors for

target node

RW: Random walk in 2.1 power law graph

–avoiding going back–second degree scanning

Page 20: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 20

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Degree Seeking

Degree SeekingStart with random node as node uwhile neighbor of u is not target do

u neighbor of u with highest

degree that was not visited so far

od

Improvement:– Scan also second degree neighbors

for target

Observation:– The search in Power Law networks is

considerably faster

Why?

RW: Random walk in 2.1 power law graph

DS: Degree Seeking in the same graph–avoiding already visited neighbors–second degree scanning

Page 21: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 21

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Comparison Random Walk and Degree Seeking

Page 22: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 22

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Probability Generating Functions

For a discrete probability distribution X over {0,1,2,3,4..} let pk be the probability that event k {0,1,2,3,...}

Then the generating function for the probability distribution is

Probability values

–where G(k) is the k-th derivative of GFor probability distributions X and Y and their distribution generating

functions GX, GY we have

Page 23: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 23

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Probability Generating FunctionsProperties

Sum of probabilites

Expectation

If Xi are independent discrete random variables and GXi the generating function then for

the generating function is

This implies for S=X1-X2, where X1 and X2 are independent

Let N be an independent random variable. Let X1,X2, .., independent and identically random variables. Then for the random variable XN the generating function is given by

Page 24: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 24

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Remember that

Example:– Consider the random variable

– then the generating function is

Poisson probability distribution with

– Generating function:

Pareto (power law) probability distribution

Probability Generating FunctionsExamples

Page 25: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 25

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Analyzing Power Law Graphs

Consider the generating function for the degree

Let pk = 0 for all k > m= n1/ and k=0

Hence, the generating function is

Choose the normalization factor c such that

Then, the average degree is given by

If m>n1/ then pm<n-1

This means less thanone edge exists in the

expectation

Page 26: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 26

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Average Degree

Average degree of a node

A random edge chooses high degree nodes with higher probability, – if a node has k edges then the probability increases (for large networks) by a

factor of k – i.e. probability p’(k) = k pk

– the corresponding normalized generating probability function is

The probability function of a node after one random walk is given by this function shifted by one place, i .e.

Page 27: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 27

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Let z2b denote the average number of second neighbors starting from a node chosen by a random edge

– Choose N according to G1

– Choose Xi according to G1

– Consider XN and the generating function

– Then

The Neighbor’s Degree

Assume that – a node “knows” the degree of all neighbors– the probability that any second neighbor is connected to more than one first neighbor

can be neglected• Then, the degree of the first neighbors and second neighbors are independent• Second neighbors are the neighbors in the next step

Let z2a denote the average number of second neighbors starting from a random node

– Choose N according to G0

– Choose Xi according to G1

– Consider XN and the generating function

– Then

Page 28: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 28

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Random Walks outperform Random Nodes

Let z2a denote the average number of second neighbors starting from a random node

The degree is dependent on the cut-off value m = (n1/)

For 2<<3 one can obtain

Hence,

Let z2b denote the average number of second neighbors starting from a node chosen by a random edge

The degree is dependent on the cut-off value m = (n1/)

For 2<<3 one can obtain

Hence,

Page 29: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Search Algorithms, WS 2004/05 29

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Conclusions

The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network

This effect can be increased if we prefer the neighbor with the highest degree

This improves the search in power law networks

–because more neighbors are in reach

In random graphs (Poisson graphs) this technique does not help such much

–since the the degree distribution is sharply concentrated around the expectation.

Page 30: Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

30

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Thanks for your attentionEnd of 10th lectureHappy X-mas and a happy new yearNext lecture: Mo 10 Jan 2005, 11.15 am, FU 116Next exercise class: Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316