32
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture Christian Schindelhauer [email protected]

Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

  • Upload
    ford

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture. Christian Schindelhauer [email protected]. Chapter II. Chapter II Searching in Compressed Text 15 Nov 2004. Searching in Compressed Text (Overview). What is Text Compression Definition The Shannon Bound Huffman Codes - PowerPoint PPT Presentation

Citation preview

Page 1: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

1

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

15 Nov 20045th Lecture

Christian [email protected]

Page 2: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 2

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Chapter II

Chapter IISearching in

Compressed Text15 Nov 2004

Page 3: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 3

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching in Compressed Text (Overview)

What is Text Compression– Definition– The Shannon Bound– Huffman Codes– The Kolmogorov Measure

Searching in Non-adaptive Codes– KMP in Huffman Codes

Searching in Adaptive Codes– The Lempel-Ziv Codes– Pattern Matching in Z-Compressed Files– Adapting Compression for Searching

Page 4: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 4

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Ziv-Lempel-Welch (LZW)-Codes

From the Ziv-Lempel-Family– LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP

Literature– LZW: Terry A. Welch: "A Technique for High Performance Data

Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19– LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data

Compression", IEEE Transactions, p. 337-343– LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via

Variable-Rate Coding", IEEE Transactions on Information, p. 530-536known as Unix-command: “compress”Uses:

– TRIES

Page 5: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 5

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Trie = “reTRIEval TREE”

Name taken out of “ReTRIEval” Tree

–for storing/encoding text–efficient search for equal prefices

Structure–Edges labelled with letters–Nods are numbered

Mapping–Every node encodes a word of the text–The text of a node can be read on the path from the root to the node

• Node 1 = “m”• Node 6 = “at”

–Inverse direction: Every word uniquely points at a node

–(or at least some prefix points to a leaf) • “it” = node 11• “manaman” points with “m” to node 1

Encoding of –“manamanatapitipitipi”–1,2,3,4,5,6,7,8,9,10,11,12 or–1,5,4,5,6,7,11,10,11,10,8

Decoding of–5,11,2–“an”, “it”, “a” = anita

0

1

m

2

a

3

n

4

m

5

n

6

t

7

p

8

i

10

p

9

t

11

t

12

i

Page 6: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 6

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How LZW builds a TRIE

LZW – works bytewise– starts with the 256-leaf trie with leafs “a”, “b”, ...

numbered with “a”, “b”, ...

LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od

-

a

a

b

b

c

c

d

d

... zz

Example: nanananananana-

a

a

n

n......

naScanned:

na

a

Page 7: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 7

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How LZW builds a TRIE

LZW – works bytewise– starts with the 256-leaf trie with leafs “a”,

“b”, ... numbered with “a”, “b”, ...

LZW-Trie-Builder(T)– n length(T)– i 1– TRIE start-TRIE– m number of nodes in TRIE– u root(TRIE)– while i n do– if no edge with label T[i] under u then– m m+1– append leaf m to u with edge label T[i]– u root(TRIE)– else– u node under u with edge label T[i] – fi– i i +1– od

Example: nanananananana-

a

a

n

n......

naScanned:

nanananananana

a

Continue with:

nanananananana nan

Residual part:

nanananananana

Page 8: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 8

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How LZW produces the encoding

LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (m,u,T[i])9. m m+110. append leaf m to u with edge label T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi

The output m is predictable:256,257,258,...

Therefore use onlyoutput(u,T[i])

start-Trie = 256-leaf trie with bytes encoded as

0,1,2,..,255

Page 9: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 9

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

An Example Encoding

LZW-Encoder(T)1. n length(T)2. i 13. TRIE start-TRIE4. m number of nodes in TRIE5. u root(TRIE)6. while i n do7. if no edge with label T[i] under u then8. output (u,T[i])9. m m+110. append leaf m to u with edge label

T[i]11. u root(TRIE)12. else13. u node under u with edge label T[i] 14. fi15. i i +116. od17. if u root(TRIE) then18. output (u)19. fi

0

m

m

n

a

a

a

256 257

n

t p

i

i

262

p

t

t

261

t

Encoding of m a n a m a n a t a p i t i p i t i p i

(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi

258

n

a

259 260

263

i

p

p

264

i

Page 10: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 10

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Decoder

LZW-Decoder(Code)– i 1– TRIE start-TRIE– m 255– for i 0 to 255 do C(i)=“i” od– while not end of file do– (u,c) read-next-two-symbols(Code);– if c exists then– output (C(u), c)– m m+1– append leaf m to u with edge label

c– C(m) (C(u),c)– else– output (C(u))– odIf the last string of the code did not produce a new node

in the trie then output thecorresponding string

0

m

m

n

a

a

a

256 257

n

t p

i

i

262

p

t

t

261

t

Encoding of m a n a m a n a t a p i t i p i t i p i

(m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i)256 257 258 259 260 261 262 264 264mana (ma)n at ap it ip (it)i pi

258

n

a

259 260

263

i

p

p

264

i

Page 11: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 11

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Performance of LZW

Encoding can be performed in time O(n)– where n is the length of the given text

Decoding can be performed in time O(n)– where n is the length of the uncompressed output

The memory consumption is linear in the size of the compressed code

LZW can be nicely implemented in hardwareThere is no software patent

– so it is very populary, see “compress” for UNIXLZW can be further compressed using Huffman-Codes

– Every second character is a plain copy from the text!

Search in LZW is difficult– The encoding is embedded in the text (adaptive encoding)– For one search in a text there is a linear number of possibilities of encodings of the

search pattern (!)

Page 12: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 12

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

The Algorithm of Amir, Benson & Farach“Let Sleeping Files Lie”

Ideas– Build the Trie, but do not decode– Use KMP-Matcher with the nodes of the LZW-Trie– Prepare a data structure based on the pattern m– Then, scan the text and update this data structure

Goal: Running time of O(n + f(m))– where n is the code length– f(m) is some small polynomial depending on the pattern length m– for well compressed codes and f(m)<n it should be faster than decoding

and then running text search

Page 13: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 13

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching in LZW-CodesInside a node

Example: Search for tapioca

abtapiocaab blahblahabarb

tapiocais “inside” a node

Then we have found tapiocaFor all nodes u of a trie:

Set: Is_inside[u]=1 ifthe text of u contains the pattern

Page 14: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 14

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Searching in LZW-CodesTorn apart

Example: Search for tapioca

carasiabrastap io

Startingsomewhere in

a node

Parts are hiddenin some other

nodes

The end is thestart of another

node

All parts arenodes of the

LZW-Trie

Page 15: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 15

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns

Is the suffix of the node a prefix of the pattern

– And if yes, how long is it?– Classify all nodes of the trie

For very long text encoded by a node only the last m letters matter

Can be computed using the KMP-Matcher-algorithm while building the Trie

Example: –Pattern: “manamana”

pamanaThe last fourletter are thefirst four ofthe pattern

mama

length of suffix of node which is prefix of patter is 2

papa result: 0

mana result: 4

amanaplanacanalpamana

result: 4

amanaplanacanalpamana m

amanaplanacanalpamanam

Page 16: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 16

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Is the node inside of a Pattern?

Find positions where the text of the node is inside the pattern

Several occurrences are possible–e.g. one letter–There are at most m(m-1)/2 encodings of such sub-strings

–For every sub-string there is exactly one node that fits

Define table Inside-Node of size O(m2)–Inside-Node[start,end] := Node that encodes pattern P[start]..P[end]

From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created

To quickly find all occurrences use pointer–Next-inside-occurrence(start,end) indicates the next position where the substrings lies

–It is initialized for start=end with the next occurrence of the letter

Example: –Pattern: “manamana”

ana

This text could be in positions 2-4 or positions 6-8 of the pattern

anamresult: (2,5)

rorororororo result: (0,0)is not in the pattern

ana m

Page 17: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 17

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern

Is the prefix of the node a suffix of the pattern

– And if yes, does it complete the pattern, if already i letters were found?

– Classify all nodes of the trieFor very long text encoded by a node

only the first m letters matter

Since the text is added at the right side this property can be derived from the ancestor

Example: –Pattern: “manamana”

ananimal

Here 3 and 1 could be the solutionWe take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using on the reverse string)

manamanamanaresult: 8

panamacanal result: 0

manammanaaaaaaaaaaaa

manammanaaaaaaaaaaaa m

manammanaaaaaaaaaaaam

Page 18: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 18

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How does it fit?

On the left side we have the maximum prefix of the pattern

On the right side we have the maximum suffix of the pattern

panapamapama

10 letter pattern:pamapamapa

pamapana14 letters?

Yet the pattern is inside, though,since the last 6 letters +

the first 8 letters of the patterngive the pattern

8 letter prefix found 6 letter suffix found

Solution: Define prefix-suffix-table PS-T(p,s) = 1 if

p-letter prefix of P and s-letter suffix of P contain the pattern

Page 19: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 19

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Computing the PS-Table in time O(m3)

For all p and s such that p+sm compute PS-T[p,s]

Run the KMP-Matcher for pattern P in P[m-p+1..m]P[1..s]

– needs time O(m) for each combination of p and s

Leads to run time of O(n3)

xyzpamapama pamapaxyz

10 letter pattern:pamapamapa

PS-T[8,6] = 1

If pattern pamapamapa found in text

pamapamapamapathen

Page 20: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 20

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Computing the Prefix-Suffix-Table in Time O(m2) - Preparation

ptr[i,j] = next left position from i where the suffix of P of length j occurs = max{k < i | P[m-j+1..m] = P[k..k+j-1] or k = 0}

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

Page 21: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 21

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Computing the Prefix-Suffix-Table in time O(m2)Initialization

Init-ptr (P)1. m length(P)2. for i 1 to m do ptr[i,0] i-1 od3. for j 1 to m-1 do4. last m-j+15. i ptr[last+1,j-1]-1 6. while i 0 do 7. if P[i]=P[last] then 8. ptr[last,j] i9. last i10. fi11. i ptr[i+1,j-1]-1 12. od13. od

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

p a m a p a m a p a

Run time: O(m2)

Page 22: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 22

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Computing the Prefix-Suffix-Table in time O(m2)

Init-PS-T(P)1. m length(P)2. ptr Init-ptr(P)3. for i 1 to m-1 do4. j i+15. while j 0 do 6. PS-T[i,m-j+1] = 17. j ptr[j,m-i]8. od9. od

p a m a p a m a p a

p a m a p a m a p a

ptr[9,2]ptr[5,2]

PS-T[8,2]=1

PS-T[8,6]=1PS-T[8,8]=1

Page 23: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 23

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

ABF-LZW-Matcher(LZW-Code C, uncompressed pattern P)

1. n length( C), m length( M) 2. Init-PS-T(P)3. longest_prefix[P[1]] 14. longest_suffix[P[m]] 15. for i 1 to m do 6. inside_node[i,i] P[i]7. od8. Compute-Prefix(P)9. TRIE start-TRIE10. v 25511. prefix 012. for i 0 to 255 do 13. C(i)=“i” 14. od 15. for l 1 to n do16. (u,c) read-next-two-symbols(Code)17. v v+118. Update_DS()19. Check_for_Occurrence()20. od

longest prefix of P can be found in node P[1]

longest suffix of P can be found in node P[m]

Only single node characters can be inside of P

Standard LZW-Trie Initialization

Insert new node v into data structure

Check for occurences of P

Page 24: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 24

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Update Data Structure

Update_DS()1. length[v] length[u]+12. /* omitted C[v] C[u]c */3. is_inside[v] is_inside[u] 4. if longest_prefix[u]< m and P[longest_prefix[u]+1]= c

then5. longest_prefix[v] longest_prefix[u] +16. fi7. if length[u]<m then8. for all entries (start,end) of u in inside_node9. do if P[end+1]=c and end<m then10. inside_node[start,end+1] v11. Link new entry of v12. fi13. do14. if longest_suffix[u] < length[u] or P[length[v]] c then15. longest_suffix[v] longest_suffix[u] 16. else 17. longest_suffix[v] 1+longest_suffix[u]18. if longest_suffix[v] = m then is_inside[v] 1 fi19. fi

Standard LZW code

if u contains the pattern, so does v

There is a linked list of u for all positions of inside_node pointing to u

manamm x

manammx

manama n

manaman

xyzmana m

xyzmanam

This occurs at most m2

times over all rounds

Page 25: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 25

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Check for Occurrences

Check_for_Occurrences()1. if is_inside[v] = m then2. return “pattern found at l”3. prefix longest_prefix[v]4. else if prefix = 0 then 5. prefix longest_prefix[v]6. else if prefix + length[v] < m then7. while prefix 0 and inside-node[prefix+1,prefix+length[v]] v do8. prefix (prefix)9. od10. if prefix = 0 then prefix longest_prefix[v]11. else prefix prefix+length[v] 12. fi13. else14. suffix longest_suffix[v]15. if PS-T[prefix,suffix]=1 then16. return “pattern found at l”17. prefix longest_prefix[v]18. else19. prefix longest_prefix[v]20. fi21. fi

xyzmanamanaxyz

xyzmana man

xyzmana namanaxy

Like in KMP-matcher

This occurs at most || m2 times

Page 26: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 26

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Running Time of the Matcher

Initialization needs time O(m2)

Amortized analysis leads to (additional) time for checking of inner words of O(min{N,|| m2})

– Every inner word occurs at most || times– Where N is the length of the uncompressed text– and n is the length of the compressed text

Run time: O(n + m2 + min{N,|| m2})

For small search pattern faster than the alternative– which is Decompress and apply Boyer-Moore-Matcher

Page 27: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 27

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Text Compression Allowing Fast Searching Directly

“A Text Compression Scheme that Allows Fast Searching Directly in the Compressed File”, Udi Manber, ACM Trans. Inf. Systems, Vol15, No. 2 , 1997,124-136

Idea:– Do not use LZ-compression or Huffman Codes– Combine some letter pairs (a,b) and encode them into the “free” ASCII space

(128-255)– Let f(a,b) denote the weight of such a pair– Encode the 128 most frequent pairs into a letter of {128,..,255} each– Use only non-overlapping pairs V1 times V2 that are disjoint, i.e.– Sum of weights of f(a,b) gives the compression ratio– Then one can apply Boyer-Moore-Algorithm directly on the code– Since pattern and text will be encoded with the same byte string

Problem: Choosing these sets optimally is NP-complete!Solution:

– Greedy heuristic (of unclear performance) gives compression rate of 28-33%

Page 28: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 28

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Example

The most common “digraphs”:– th er on an re he in ed nd ha ...

Encoding: f(th)=128, f(er)= 129, f(on)=130, f(an)=131, f(ed)=132No compression: re, he, in, nd, ha

t he

r

o

ni

ad

V1V2

Page 29: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 29

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Chapter III

Chapter IIISearching the Web

15 Nov 2005

Page 30: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 30

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Problems of Searching the Web

Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages– 10.000 words cover more than 95% of each text– much more web-pages than words– Users hardly ever look through more than 40 results

The problem is not to find a pattern, but to find the most important pages

Problems:– Important pages do not contain the search pattern

• www.porsche.com does not contain sports car or even car• www.google.com does not contain web search engine• www.airbus.com does not contain airplane

– Certain pages have nearly every word (dictionary)– Names are misleading

• http://www.whitehouse.org/ is not the web-site of the white house• www.theonion.com is not about vegetables

– Certain pattern can be found everywhere, e.g. page, web, windows, ...

Page 31: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search Algorithms, WS 2004/05 31

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

How to rank Web-pages

The main problem about searching the web is to rank the importance

Links are very helpful:– Humans are usually introduced on purpose– The context of the links gives some clues about the meaning of the web-page– Pages where many people point to are of probably very important– Most search rely on links

Other approach: Ontology of words– Compare the combination of words with the search word– Good for comparing text– Difficult if single word patterns are given

Page 32: Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

32

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Thanks for your attentionEnd of 5th lectureNext lecture: Mo 22 Nov 2004, 11.15 am, FU 116

Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316