24
Parallel and Distributed IR Eric Brown

Parallel and Distributed IR

  • Upload
    kera

  • View
    138

  • Download
    8

Embed Size (px)

DESCRIPTION

Parallel and Distributed IR. Eric Brown. Parallel Computing. SISD : single instruction stream, single data stream. SIMD : single instruction stream, multiple data stream. MISD : multiple instruction stream, single data stream. MIMD : multiple instruction stream, multiple data stream. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel and Distributed IR

Parallel and Distributed IR

Eric Brown

Page 2: Parallel and Distributed IR

Parallel Computing

SISD: single instruction stream, single data stream. SIMD: single instruction stream, multiple data stream. MISD: multiple instruction stream, single data stream. MIMD: multiple instruction stream, multiple data stream.

Page 3: Parallel and Distributed IR

Performance Measures

S=Running time of best available sequential algorithm---------------------------------------------------------------

Running time of parallel algorithm

S<=1

f +(1-f)/N1f

<=

=SN

Page 4: Parallel and Distributed IR

Parallel IR

Introduction: Develop new retrieval strategies that directly

lend themselves to parallel implementation. Adapt existing, well studied information retrieval

algorithms to parallel processing.

Page 5: Parallel and Distributed IR

MIMD Architecture

Page 6: Parallel and Distributed IR

MIMD Architecture

Inverted Files Logical Document Partitioning

Essentially the same basic underlying inverted file index as in the original sequential algorithm.

Physical Document PartitioningEach subcollection has its own inverted file and the se

arch processes shard nothing during query evaluation.

Page 7: Parallel and Distributed IR

MIMD Architecture

Logical document partitioning requires less communication than physical document partitioning with similar parallelization, and so is likely to provide better overall performance.

Physical document partitioning, on the other hand, offers more flexibility and conversion of an existing IR system into a parallel IR system is simpler using physical document partition.

Page 8: Parallel and Distributed IR

MIMD Architectures

Term partitioning When term partitioning is used with an inverted file is

created for the document collection and the inverted lists are spread across the processors.

Assuming each processor has its own I/O channel and disks when term distribution in the documents and the queries are more skewed, document partition performs better. When terms are uniformly distributed in user queries, term partition performs better.

Page 9: Parallel and Distributed IR

MIMD Architecture

Page 10: Parallel and Distributed IR

SIMD Architecture

Signature Files

Page 11: Parallel and Distributed IR

SIMD Architecture

Signature Files

Page 12: Parallel and Distributed IR

SIMD Architecture

Signature Files

Page 13: Parallel and Distributed IR

SIMD Architectures

Inverted Files

Page 14: Parallel and Distributed IR

SIMD Architectures

Page 15: Parallel and Distributed IR

SIMD Architectures

Inverted Files

Page 16: Parallel and Distributed IR

SIMD Architectures

Page 17: Parallel and Distributed IR

Distributed IR

Introduction A distributed computing system can be viewed

as a MIMD parallel processor with relatively slow inter-processor communication channel and the freedom to employ a heterogeneous collection of processors in the system.

Page 18: Parallel and Distributed IR

Distributed IR

Introduction Distributed Model is very similar to the MIMD

parallel processing model. The main difference here is that subtasks run on

different computers and the communication between the subtasks is performed using network protocol such as TCP/IP.

Page 19: Parallel and Distributed IR

Collection Partitioning

The procedure used to adding documents to search servers in a distributed IR system depends a number of factors. Consider whether or not the system is centrally

administered.

Page 20: Parallel and Distributed IR

Collection Partitioning

When the distribute system is centrally administered, more options are available.The first option is simple replication of the collection

across all of the search servers.The second option is random distribution of the

documents.The final option is explicit semantic partitioning of the

documents.

Page 21: Parallel and Distributed IR

Source Selection

Source selection is the process of determining which of the distributed document collections are most likely to contain relevant documents for the current query, and therefore should receive the query for processing.

The basic technique is to treat each collection as if it were a single large document, index the collections, and evaluate the query against the collections to produce a ranked listing of collections.

Page 22: Parallel and Distributed IR

Query Processing

Query processing in a distributed IR system proceeds

as follows:Select collection to search.

Distribute query to selected collections.

Evaluate query at distributed collection in parallel.

Combine results from distributed collection into final result.

Page 23: Parallel and Distributed IR

Web Issues

The parallel and distributed techniques described above can then be used directly as if the Web were any other large document collection. This is the approach currently taken by most of the popular Web search services.

Page 24: Parallel and Distributed IR

Trends and Research Issues

The trend in parallel hardware is the develop of general MIMD machines.

Many challenges remain in the area of parallel and distributed text retrieval. The first challenge is measuring retrieval effectiveness

on large text collections. The second significant challenge is interoperability, or

building distributed IR systems form heterogeneous components.