Upload
netis
View
42
Download
0
Embed Size (px)
DESCRIPTION
Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture. Christian Schindelhauer [email protected]. Chapter III. Chapter III Searching the Web 20 Dec 2004. Searching the Web. Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm The Simple Algorithm - PowerPoint PPT Presentation
Citation preview
1
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Search AlgorithmsWinter Semester 2004/2005
20 Dec 200410th Lecture
Christian [email protected]
Search Algorithms, WS 2004/05 2
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Chapter III
Chapter IIISearching the Web
20 Dec 2004
Search Algorithms, WS 2004/05 3
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching the Web
Introduction
The Anatomy of a Search Engine
Google’s Pagerank algorithm– The Simple Algorithm– Periodicity and convergence
Kleinberg’s HITS algorithm– The algorithm– Convergence
The Structure of the Web– Pareto distributions– Search in Pareto-distributed graphs
Search Algorithms, WS 2004/05 4
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Webgraph
GWWW:– Static HTML-pages are nodes – links are directed edges
Outdegree of a node: number of links of a web-pageIndegree of a node: number of links to a web-page
Directed path from node u to v– series of web-pages, where one follows links from the page u to page v
Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v– For all i:
• There is a link from wi zu wi+1 or from wi+1 to wi
Strong (weak) connected subgraph– minimal node set including all nodes which have a directed (undirected) path
from and to a reference node
Search Algorithms, WS 2004/05 5
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Web-Graph (1999)
Search Algorithms, WS 2004/05 6
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Distributions of indegree/outdegree
In and Out-degree obey a power law– i.e. in- and out-degree appear with probability ~ 1/iα
According to experiments of– Kumar et al 97: 40 million Webpages– Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3– Broder et al 00: 204 million webpages (Scan May and Oct 1999)
Search Algorithms, WS 2004/05 7
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Is the Web-Graph a Random graph? No!
Random graph Gn,p:– n nodes– Every directed edge occurs with probability p
Is the Web-graph a random graph Gn,p?
The probability of high degrees decrease exponentially In a random graph degrees are distributed according to a Poisson
distribution
Therefore: The degree of a random graph does not obey a power law
Search Algorithms, WS 2004/05 8
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Pareto Distribution
Discrete Pareto (power law) distribution for x {1,2,3,…}
with constant factor
(also known as the Riemann Zeta function)
Heavy tail property– not all moments E[Xk] are defined– Expected value exists if and only if α>2– Variance and E[X2] exist if and only if α>3– E[Xk] defined if and only if α>k+1
Density function of the continuous function for x>x0
Search Algorithms, WS 2004/05 9
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Special Case: Zipf Distribution
George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c
Zipf probability distribution for x {1,2,3,…}
with constant factor conly defined for finite sets, since
tends to infinity for growing n
Zipf distributions refer to ranks– The Zipf exponent can be larger than 1, i.e. f(n) = c/n
Pareto distributions refer to absolute size– e.g. number of inhabitants
Search Algorithms, WS 2004/05 10
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Pareto-Verteilung (I)
Example for Power Laws (= Pareto distributions)
– Pareto 1897: Wealth/income in population– Yule 1944: Word frequency in languages– Zipf 1949: Size of towns– Length of molecule chaings– File length of UNIX-files– ….
– Access density of web-pages– Access density of a web-surfer at a particular web-page– …
Search Algorithms, WS 2004/05 11
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
City Size DistributionScaling Laws and Urban Distributions, Denise Pumain, 2003
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zipf distribution
Search Algorithms, WS 2004/05 12
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Paretodistribution
Search Algorithms, WS 2004/05 13
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Search Algorithms, WS 2004/05 14
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Search Algorithms, WS 2004/05 15
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Heavy-Tailed Probability Distributions in the World Wide WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Search Algorithms, WS 2004/05 16
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Size of connected components
Strong and weak connected components obey a power law
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000.
Large weak connected component with 91% of all web-pages Largest strong connected component has size 28%
– Diameter ≥ 28
Search Algorithms, WS 2004/05 17
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in Power Law Networks
Task:–Given a network with undirected edges–Degrees underlie a power law–From a source node–Find a target node
Features–Keep it simple
• no markers–Visit one node at a time–Every node knows its neighbor (and its degree)
From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135
Three approaches–Neighbors of random nodes
–Neighbors of a random walk:• First random neighbor and
continue
–Neighbors of High Degree Seeking:• Start with random node• Prefer neighbors with larger
degree
Search Algorithms, WS 2004/05 18
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Power Law Networks
Undirected graph of n nodes– The probability that a node has k
neighbors is pk
– where pk = c k- for a normalization factor c
For search in power law network– Consider largest connected
component and– exponent t with 2<<3
Theorem– For large enough power law graphs
with exponent • For <1 the graph is almost surely
connected• For 1< <2: There is a giant
connected component of size (n)• For 2< <3.4785: There is a giant
component and all smaller components are of size O(log n)
• For >3.4785: The graph has almost surely no giant component, ie. all components have size o(n)
• For >4: All connected components underlie a power law
– by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)
Search Algorithms, WS 2004/05 19
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Random Walk
Random Walk:Start with random node as node uwhile neighbor of u is not target do
u random neighbor of uod
TheoremIn undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run).
Conclusion:–High degree nodes are preferred
Possible improvement–Avoid going back–Avoid visiting already visited nodes–Scan also second degree neighbors for
target node
RW: Random walk in 2.1 power law graph
–avoiding going back–second degree scanning
Search Algorithms, WS 2004/05 20
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Degree Seeking
Degree SeekingStart with random node as node uwhile neighbor of u is not target do
u neighbor of u with highest
degree that was not visited so far
od
Improvement:– Scan also second degree neighbors
for target
Observation:– The search in Power Law networks is
considerably faster
Why?
RW: Random walk in 2.1 power law graph
DS: Degree Seeking in the same graph–avoiding already visited neighbors–second degree scanning
Search Algorithms, WS 2004/05 21
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Comparison Random Walk and Degree Seeking
Search Algorithms, WS 2004/05 22
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Probability Generating Functions
For a discrete probability distribution X over {0,1,2,3,4..} let pk be the probability that event k {0,1,2,3,...}
Then the generating function for the probability distribution is
Probability values
–where G(k) is the k-th derivative of GFor probability distributions X and Y and their distribution generating
functions GX, GY we have
Search Algorithms, WS 2004/05 23
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Probability Generating FunctionsProperties
Sum of probabilites
Expectation
If Xi are independent discrete random variables and GXi the generating function then for
the generating function is
This implies for S=X1-X2, where X1 and X2 are independent
Let N be an independent random variable. Let X1,X2, .., independent and identically random variables. Then for the random variable XN the generating function is given by
Search Algorithms, WS 2004/05 24
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Remember that
Example:– Consider the random variable
– then the generating function is
Poisson probability distribution with
– Generating function:
Pareto (power law) probability distribution
Probability Generating FunctionsExamples
Search Algorithms, WS 2004/05 25
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Analyzing Power Law Graphs
Consider the generating function for the degree
Let pk = 0 for all k > m= n1/ and k=0
Hence, the generating function is
Choose the normalization factor c such that
Then, the average degree is given by
If m>n1/ then pm<n-1
This means less thanone edge exists in the
expectation
Search Algorithms, WS 2004/05 26
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Average Degree
Average degree of a node
A random edge chooses high degree nodes with higher probability, – if a node has k edges then the probability increases (for large networks) by a
factor of k – i.e. probability p’(k) = k pk
– the corresponding normalized generating probability function is
The probability function of a node after one random walk is given by this function shifted by one place, i .e.
Search Algorithms, WS 2004/05 27
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Let z2b denote the average number of second neighbors starting from a node chosen by a random edge
– Choose N according to G1
– Choose Xi according to G1
– Consider XN and the generating function
– Then
The Neighbor’s Degree
Assume that – a node “knows” the degree of all neighbors– the probability that any second neighbor is connected to more than one first neighbor
can be neglected• Then, the degree of the first neighbors and second neighbors are independent• Second neighbors are the neighbors in the next step
Let z2a denote the average number of second neighbors starting from a random node
– Choose N according to G0
– Choose Xi according to G1
– Consider XN and the generating function
– Then
Search Algorithms, WS 2004/05 28
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Random Walks outperform Random Nodes
Let z2a denote the average number of second neighbors starting from a random node
The degree is dependent on the cut-off value m = (n1/)
For 2<<3 one can obtain
Hence,
Let z2b denote the average number of second neighbors starting from a node chosen by a random edge
The degree is dependent on the cut-off value m = (n1/)
For 2<<3 one can obtain
Hence,
Search Algorithms, WS 2004/05 29
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Conclusions
The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network
This effect can be increased if we prefer the neighbor with the highest degree
This improves the search in power law networks
–because more neighbors are in reach
In random graphs (Poisson graphs) this technique does not help such much
–since the the degree distribution is sharply concentrated around the expectation.
30
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Thanks for your attentionEnd of 10th lectureHappy X-mas and a happy new yearNext lecture: Mo 10 Jan 2005, 11.15 am, FU 116Next exercise class: Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316