46
Building a Distributed Full- Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Embed Size (px)

Citation preview

Page 1: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Building a Distributed Full-Text Index for the

WebS. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Page 2: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

Page 3: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

Page 4: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Pig

Cat

Fish

Cat

Fly

Dog

Pig

Dog

Cat

Fish

Dog

123

Inverted list

Cat-> (1,2), (1,4), (3,2)

Dog->(2,2), (3,1), (3,4)

Fish->(1,3), (3,3)

Pig->(1,1), (2,3)

Inverted

index location

Page 5: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Inverted index consist of an inverted lists for each sorted term.

Inverted list consist of a locations in sorted way.

Location consist of

)page identifier, position in the page.(

Posting consist of (index term, location).

Page 6: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Building an inverted index over a collection of web pages involves:

1 .Processing each page to extract postings.

2 .Building for each term inverted list.

3 .Writing out on disk.

Page 7: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Important problems when building web-scale inverted index:

1 .Scale and growth rate.

2 .Rate of change

Page 8: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

Page 9: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 10: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Distributors.

• Indexers.

• Query servers.

Page 11: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Distributed inverted index organization:

1. Local inverted files.

2. Global inverted files.

Page 12: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Global inverted files

Query server 1Cat->(1,2), (1,4), (3,2)

Dog->(2,2), (3,1), (3,4)a-e

Query server 2 Fish->(1,3), (3,3)

Pig->(1,1), (2,3)f-z

Dog

Cat

Fish

Dog

Fly

Dog

Pig

Pig

Cat

Fish

Cat

2 13

Page 13: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Local inverted files

Query server 1

Cat->(1,2), (1,4)

Dog->(2,2)

Fish->(1,3)

Fly->(2,1)

Pig->(1,1), (2,3)

a-e Query server 2

Cat->(3,2)

Dog->(3,1), (3,4)

Fish->(3,3)

f-z

Dog

Cat

Fish

Dog

Fly

Dog

Pig

Pig

Cat

Fish

Cat

2 13

Page 14: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Local vs. Global

• Resilience to failures.

• Network load.

Page 15: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Testbed environment:

The indexers and the query servers are single processor PC’s with 350-500 MHz processors, 300-500 MB of main memory, and equipped with multiple disks.

All the machines are interconnected by a 100 Mbps Ethernet LAN network.

Page 16: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

The WebBase collection:

To study some properties of web pages that are relevant to text indexing, we analyzed 5 samples, of 100,000 pages each, from different portions of the WebBase repository.

Page 17: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Propertyvalue

Average number of words per page438

Average number of distinct words per page171

Average size of each page (as HTML)8650

Average size of each page after removing HTML tags

2815

Average size of a word in the vocabulary8

Table 1: Properties of the WebBase collection

Page 18: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 19: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

Page 20: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 21: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Design of the Indexer

• Software pipeline.

•The storage of the inverted files generated by the process.

Page 22: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Software pipeline

The process can logically be split into 3 phases:

• Processing -> CPU intensive.

• Flushing -> disk.

• loading -> network.

Page 23: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 24: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

The goal of our pipelining technique is to design an execution schedule for the different indexing phases that will result in minimal overall running time.

Examples:

L

P

F

Execution of the pipeline

Page 25: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 26: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

t

Pipeline time

Page 27: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Theoretical analysis vs. experimental results

Page 28: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 29: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 30: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Design of the Indexer

• Software pipeline.

•The storage of the inverted files generated by the process.

Page 31: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Storage schemes:

We consider ed three storage schemes for storing inverted files as sets of (key, value) pairs in a B-tree:

1 .Full list .

2 .Single payload.

3 .Mixed list.

Page 32: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 33: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

A qualitative comparison of these storage schemes:

• Index size

• Zig-zag joins

• Hot updates

Page 34: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Zig-zag join using ordered indexes

1 2 3 4 7 9 18

1 7 9 11 1712 19

Page 35: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Experimental results (using mixed list)

Page 36: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Number of pages(million)

Input size (GB)

Index size (GB)

Index size (%age)

0.10.810.056.17

0.54.030.276.70

2.016.111.137.01

5.040.282.786.90

Table 5:Mixed-list scheme index sizes

Only one posting was generated for all the occurrences of a word in a page

Page 37: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 38: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

Page 39: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Two problems that must be addressed when building an inverted index on a distributed

architecture:

•Page distribution: The question of when and how to distribute pages to the indexing nodes.

•Collecting global statistics: the question of where, when, and how to compute and distribute global statistics.

Page 40: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Two strategies for page distribution:

• A priori distribution.

• Runtime distribution.

Page 41: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Three advantages of runtime distribution:

• Space.

• Load balancing.

• Effective pipelining.

Page 42: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Collecting global statistics

A dedicated server known as the statistician.

•Parallel computation.

•Minimize the number of conversations among servers.

•Avoid extra disk I/O

•Reduces network overhead.

Page 43: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Two strategies for sending information to the statistician:

• ME Strategy: sending local information during merging.

•FL Strategy: sending local information during flushing.

Page 44: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 45: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Page 46: Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

comparison