Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

1

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

20 Dec 200410th Lecture

Christian [email protected]

Search Algorithms, WS 2004/05 2



Chapter III

Chapter IIISearching the Web

20 Dec 2004




Searching the Web

Introduction

The Anatomy of a Search Engine

Google’s Pagerank algorithm– The Simple Algorithm– Periodicity and convergence

Kleinberg’s HITS algorithm– The algorithm– Convergence

The Structure of the Web– Pareto distributions– Search in Pareto-distributed graphs




The Webgraph

GWWW:– Static HTML-pages are nodes – links are directed edges

Outdegree of a node: number of links of a web-pageIndegree of a node: number of links to a web-page

Directed path from node u to v– series of web-pages, where one follows links from the page u to page v

Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v– For all i:

• There is a link from wi zu wi+1 or from wi+1 to wi

Strong (weak) connected subgraph– minimal node set including all nodes which have a directed (undirected) path

from and to a reference node




The Web-Graph (1999)




Distributions of indegree/outdegree

In and Out-degree obey a power law– i.e. in- and out-degree appear with probability ~ 1/iα

According to experiments of– Kumar et al 97: 40 million Webpages– Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3– Broder et al 00: 204 million webpages (Scan May and Oct 1999)




Is the Web-Graph a Random graph? No!

Random graph Gn,p:– n nodes– Every directed edge occurs with probability p

Is the Web-graph a random graph Gn,p?

The probability of high degrees decrease exponentially In a random graph degrees are distributed according to a Poisson

distribution

Therefore: The degree of a random graph does not obey a power law




Pareto Distribution

Discrete Pareto (power law) distribution for x {1,2,3,…}

with constant factor

(also known as the Riemann Zeta function)

Heavy tail property– not all moments E[Xk] are defined– Expected value exists if and only if α>2– Variance and E[X2] exist if and only if α>3– E[Xk] defined if and only if α>k+1

Density function of the continuous function for x>x0




Special Case: Zipf Distribution

George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c

Zipf probability distribution for x {1,2,3,…}

with constant factor conly defined for finite sets, since

tends to infinity for growing n

Zipf distributions refer to ranks– The Zipf exponent can be larger than 1, i.e. f(n) = c/n

Pareto distributions refer to absolute size– e.g. number of inhabitants




Pareto-Verteilung (I)

Example for Power Laws (= Pareto distributions)

– Pareto 1897: Wealth/income in population– Yule 1944: Word frequency in languages– Zipf 1949: Size of towns– Length of molecule chaings– File length of UNIX-files– ….

– Access density of web-pages– Access density of a web-surfer at a particular web-page– …




City Size DistributionScaling Laws and Urban Distributions, Denise Pumain, 2003

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zipf distribution




Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002


benötigt.


benötigt.

Paretodistribution






benötigt.






benötigt.




Heavy-Tailed Probability Distributions in the World Wide WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996


benötigt.


benötigt.




Size of connected components

Strong and weak connected components obey a power law

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000.

Large weak connected component with 91% of all web-pages Largest strong connected component has size 28%

– Diameter ≥ 28




Searching in Power Law Networks

Task:–Given a network with undirected edges–Degrees underlie a power law–From a source node–Find a target node

Features–Keep it simple

• no markers–Visit one node at a time–Every node knows its neighbor (and its degree)

From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135

Three approaches–Neighbors of random nodes

–Neighbors of a random walk:• First random neighbor and

continue

–Neighbors of High Degree Seeking:• Start with random node• Prefer neighbors with larger

degree




Power Law Networks

Undirected graph of n nodes– The probability that a node has k

neighbors is pk

– where pk = c k- for a normalization factor c

For search in power law network– Consider largest connected

component and– exponent t with 2<<3

Theorem– For large enough power law graphs

with exponent • For <1 the graph is almost surely

connected• For 1< <2: There is a giant

connected component of size (n)• For 2< <3.4785: There is a giant

component and all smaller components are of size O(log n)

• For >3.4785: The graph has almost surely no giant component, ie. all components have size o(n)

• For >4: All connected components underlie a power law

– by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)




Random Walk

Random Walk:Start with random node as node uwhile neighbor of u is not target do

u random neighbor of uod

TheoremIn undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run).

Conclusion:–High degree nodes are preferred

Possible improvement–Avoid going back–Avoid visiting already visited nodes–Scan also second degree neighbors for

target node

RW: Random walk in 2.1 power law graph

–avoiding going back–second degree scanning




Degree Seeking

Degree SeekingStart with random node as node uwhile neighbor of u is not target do

u neighbor of u with highest

degree that was not visited so far

od

Improvement:– Scan also second degree neighbors

for target

Observation:– The search in Power Law networks is

considerably faster

Why?

RW: Random walk in 2.1 power law graph

DS: Degree Seeking in the same graph–avoiding already visited neighbors–second degree scanning




Comparison Random Walk and Degree Seeking




Probability Generating Functions

For a discrete probability distribution X over {0,1,2,3,4..} let pk be the probability that event k {0,1,2,3,...}

Then the generating function for the probability distribution is

Probability values

–where G(k) is the k-th derivative of GFor probability distributions X and Y and their distribution generating

functions GX, GY we have




Probability Generating FunctionsProperties

Sum of probabilites

Expectation

If Xi are independent discrete random variables and GXi the generating function then for

the generating function is

This implies for S=X1-X2, where X1 and X2 are independent

Let N be an independent random variable. Let X1,X2, .., independent and identically random variables. Then for the random variable XN the generating function is given by




Remember that

Example:– Consider the random variable

– then the generating function is

Poisson probability distribution with

– Generating function:

Pareto (power law) probability distribution

Probability Generating FunctionsExamples




Analyzing Power Law Graphs

Consider the generating function for the degree

Let pk = 0 for all k > m= n1/ and k=0

Hence, the generating function is

Choose the normalization factor c such that

Then, the average degree is given by

If m>n1/ then pm<n-1

This means less thanone edge exists in the

expectation




The Average Degree

Average degree of a node

A random edge chooses high degree nodes with higher probability, – if a node has k edges then the probability increases (for large networks) by a

factor of k – i.e. probability p’(k) = k pk

– the corresponding normalized generating probability function is

The probability function of a node after one random walk is given by this function shifted by one place, i .e.




Let z2b denote the average number of second neighbors starting from a node chosen by a random edge

– Choose N according to G1

– Choose Xi according to G1

– Consider XN and the generating function

– Then

The Neighbor’s Degree

Assume that – a node “knows” the degree of all neighbors– the probability that any second neighbor is connected to more than one first neighbor

can be neglected• Then, the degree of the first neighbors and second neighbors are independent• Second neighbors are the neighbors in the next step

Let z2a denote the average number of second neighbors starting from a random node

– Choose N according to G0

– Choose Xi according to G1

– Consider XN and the generating function

– Then




Random Walks outperform Random Nodes

Let z2a denote the average number of second neighbors starting from a random node

The degree is dependent on the cut-off value m = (n1/)

For 2<<3 one can obtain

Hence,

Let z2b denote the average number of second neighbors starting from a node chosen by a random edge

The degree is dependent on the cut-off value m = (n1/)

For 2<<3 one can obtain

Hence,




Conclusions

The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network

This effect can be increased if we prefer the neighbor with the highest degree

This improves the search in power law networks

–because more neighbors are in reach

In random graphs (Poisson graphs) this technique does not help such much

–since the the degree distribution is sharply concentrated around the expectation.

30



Thanks for your attentionEnd of 10th lectureHappy X-mas and a happy new yearNext lecture: Mo 10 Jan 2005, 11.15 am, FU 116Next exercise class: Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316

Documents

Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture