29
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999

Modern Information Retrieval

  • Upload
    jens

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Modern Information Retrieval. Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999. Summary. Introduction Review of parallel computing and parallel program performance measures - PowerPoint PPT Presentation

Citation preview

Page 1: Modern Information Retrieval

Modern Information Retrieval

Chapter 9: Parallel and Distributed IR

Section 9.1: Introduction

Section 9.2.2.: MIMD Architectures

Inverted Files

November 5, 1999

Page 2: Modern Information Retrieval

Summary Introduction Review of parallel computing and parallel

program performance measures Exploration of techniques for implementing

inverted file on MIMD parallel architecture Conclusion

Page 3: Modern Information Retrieval

Introduction The volume of electronic text available online today is

staggering. The WWW contains over 800 millions pages of text,

comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com).

As document collections grow larger, they become more expensive to manage with an information retrieval system.

To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.

Page 4: Modern Information Retrieval

Parallel Computing Parallel computing is the simultaneous

aplication of multiple processors to solve a single problem.

Flynn’s Taxonomy: SISD single instruction, single data SIMD single instruction, multiple data MISD multiple instruction, single data MIMD multiple instruction, multiple data

Page 5: Modern Information Retrieval

Parallel Program Performance Measures

Speedup

Amdahl’s Law

where f is the fraction of the problem that must be computed sequencially;

N is the number of processors.

SRunning time of best available sequential algorithm

Running time of parallel algorithm

fNffS

1

/)1(

1

Page 6: Modern Information Retrieval

Parallel Program Performance Measures

Efficiency

where S is speedup;

N is the number of processors.

N

S

Page 7: Modern Information Retrieval

MIMD Architectures MIMD architectures offer a great deal of

flexibility in how parallelism is defined and exploited to solve a problem.

There are two ways in which a retrieval system can exploit a MIMD machine: Parallel multitasking; Partitioned parallel processing.

Page 8: Modern Information Retrieval

MIMD Architectures

Parallel multitasking on a MIMD machine

Broker

UserQuery

Result

UserQuery

Result

SearchEngine

SearchEngine Search

Engine

SearchEngine Search

Engine

Page 9: Modern Information Retrieval

MIMD Architectures

Partitioned parallel processing on a MIMDmachine

Broker

UserQuery

Result

Subquery/Results

SearchProcess

SearchProcess Search

ProcessSearchProcess

SearchProcess

Page 10: Modern Information Retrieval

MIMD Architectures

Basic data elements processed by a seachalgorithm

k1 k2 . . . ki . . . kt

d1 w1,1 w2,1 . . . wi,1 . . . wt,1

d2 w1,2 w2,2 . . . wi,2 . . . wt,2

. . . . . . . . . . . . . . . . . . . . .dj w1,j w2,j . . . wi,j . . . wt,j

. . . . . . . . . . . . . . . . . . . . .dN w1,N w2,N . . . wi,N . . . wt,N

Indexing Items

Documents

Page 11: Modern Information Retrieval

MIMD Architectures There are two possible methods for

partitioning the data: Document partitioning: the N documents are

distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it;

Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.

Page 12: Modern Information Retrieval

Inverted FilesLogical Document Partitioning Data Partitioning

The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm;

The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.

Page 13: Modern Information Retrieval

Extended dictionary entry for documentpartitioning

Inverted FilesLogical Document Partitioning

item i

P1

P2

P3

P4

Inverted ListTerm i

Dictionary

Page 14: Modern Information Retrieval

Query Evaluation The broker initiates P parallel processes to

evaluate the query; Each process executes the same document scoring

algorithm on its document subcollection; The search processes record document scores in a

single shared array of document score accumulators;

The broker produces the final ranked list of documents.

Inverted FilesLogical Document Partitioning

Page 15: Modern Information Retrieval

Inverted File Construction The indexer partitions the documents

among the processors; Each indexing process generates a batch of

inverted lists, sorted by indexing item; A merge step is performed to create the final

inverted file.

Inverted Files Logical Document Partitioning

Page 16: Modern Information Retrieval

Data Partitioning The documents are physically partitioned

into separate subcollections, one for each parallel processor;

Each subcollection has its own inverted file.

Inverted FilesPhysical Document Partitioning

Page 17: Modern Information Retrieval

Query Evaluation The broker distributes the query to all of the

parallel search processes; Each parallel search process evaluates the

query on its portion of the document collection, producing an intermediate hit-list;

The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.

Inverted FilesPhysical Document Partitioning

Page 18: Modern Information Retrieval

Inverted File Construction Each processor creates, in parallel, its own

complete index corresponding to its document partition;

A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.

Inverted FilesPhysical Document Partitioning

Page 19: Modern Information Retrieval

Data Partitioning Inverted lists are spread across the

processors.

Inverted FilesTerm Partitioning

Page 20: Modern Information Retrieval

Query Evaluation Query is decomposed into indexing items

and each indexing item is sent to the processor that holds the corresponding inverted list;

The processors create hit-lists with partial document scores and return them to the broker;

The broker combines the hit-lists.

Inverted FilesTerm Partitioning

Page 21: Modern Information Retrieval

Inverted File Construction Inverted file is created using the parallel

construction technique described for logical document partitioning.

Inverted FilesTerm Partitioning

Page 22: Modern Information Retrieval

Example

Document collection

Document Text

1 Pease porridge hot

2 Pease porridge cold

3 Pease porridge in the pot

4 Pease porridge hot, pease porridge not cold

5 Pease porridge cold, pease porridge not hot

6 Pease porridge hot in the pot

Page 23: Modern Information Retrieval

ExampleInverted File

<6,1>

cold

hot

in

not

pease

porridge

pot

the

<1,1> <2,1> <3,1> <4,2> <5,2>

Dictionary

<2,1> <4,1>

<1,1> <4,1> <5,1> <6,1>

<3,1> <6,1>

<4,1> <5,1>

<6,1><1,1> <2,1> <3,1> <4,2> <5,2>

<3,1> <6,1>

<3,1> <6,1>

Inverted Lists

<5,1>

Page 24: Modern Information Retrieval

Example

Logical Document Partitioning

<6,1>

cold

hot

in

not

pease

porridge

pot

P1

P2

P3

the

<1,1>

<2,1>

<3,1>

<4,2>

<5,2>

Inverted ListTerm “pease”

Dictionary

Page 25: Modern Information Retrieval

Example

Physical Document Partitioningcold

hot

in

not

pease

porridge

pot

the

<3,1> <4,2>

<4,1>

<4,1>

<3,1>

<4,1>

<3,1> <4,2>

<3,1>

<3,1>

P2

hot

pease

porridge

<1,1> <2,1>

<1,1>

<1,1> <2,1>

P1

cold <2,1>

<6,1>

hot

in

not

pease

porridge

pot

the

<5,2>

<5,1> <6,1>

<6,1>

<5,1>

<6,1><5,2>

<6,1>

<6,1>

P3

cold <5,1>

Page 26: Modern Information Retrieval

Example

Term Partitioning

<6,1>

cold

hot

in

not

pease

porridge

pot

the

<1,1> <2,1> <3,1> <4,2> <5,2>

<2,1> <4,1>

<1,1> <4,1> <5,1> <6,1>

<3,1> <6,1>

<4,1> <5,1>

<6,1><1,1> <2,1> <3,1> <4,2> <5,2>

<3,1> <6,1>

<3,1> <6,1>

P1

P2

P3

<5,1>

Page 27: Modern Information Retrieval

Conclusion The task of indexing and searching in very large text

collections is costly; Faster indexing and searching algorithms are always

desirable and the use of parallel hardware is and obvious alternative;

We discussed two possible organization for the document collection index on a MIMD parallel architecture: Document partitioning; Term partitioning.

Page 28: Modern Information Retrieval

Conclusion Document partitioning affords simpler inverted

index construction and maintenance than term partitioning;

When term distributions in the documents and queries are more skewed, document partitioning performs better;

When terms are uniformily distributed in user queries, term partitioning performs better.

Page 29: Modern Information Retrieval

Adicional References

Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109.

Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190.

Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.