Upload
jens
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Modern Information Retrieval. Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999. Summary. Introduction Review of parallel computing and parallel program performance measures - PowerPoint PPT Presentation
Citation preview
Modern Information Retrieval
Chapter 9: Parallel and Distributed IR
Section 9.1: Introduction
Section 9.2.2.: MIMD Architectures
Inverted Files
November 5, 1999
Summary Introduction Review of parallel computing and parallel
program performance measures Exploration of techniques for implementing
inverted file on MIMD parallel architecture Conclusion
Introduction The volume of electronic text available online today is
staggering. The WWW contains over 800 millions pages of text,
comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com).
As document collections grow larger, they become more expensive to manage with an information retrieval system.
To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.
Parallel Computing Parallel computing is the simultaneous
aplication of multiple processors to solve a single problem.
Flynn’s Taxonomy: SISD single instruction, single data SIMD single instruction, multiple data MISD multiple instruction, single data MIMD multiple instruction, multiple data
Parallel Program Performance Measures
Speedup
Amdahl’s Law
where f is the fraction of the problem that must be computed sequencially;
N is the number of processors.
SRunning time of best available sequential algorithm
Running time of parallel algorithm
fNffS
1
/)1(
1
Parallel Program Performance Measures
Efficiency
where S is speedup;
N is the number of processors.
N
S
MIMD Architectures MIMD architectures offer a great deal of
flexibility in how parallelism is defined and exploited to solve a problem.
There are two ways in which a retrieval system can exploit a MIMD machine: Parallel multitasking; Partitioned parallel processing.
MIMD Architectures
Parallel multitasking on a MIMD machine
Broker
UserQuery
Result
UserQuery
Result
SearchEngine
SearchEngine Search
Engine
SearchEngine Search
Engine
MIMD Architectures
Partitioned parallel processing on a MIMDmachine
Broker
UserQuery
Result
Subquery/Results
SearchProcess
SearchProcess Search
ProcessSearchProcess
SearchProcess
MIMD Architectures
Basic data elements processed by a seachalgorithm
k1 k2 . . . ki . . . kt
d1 w1,1 w2,1 . . . wi,1 . . . wt,1
d2 w1,2 w2,2 . . . wi,2 . . . wt,2
. . . . . . . . . . . . . . . . . . . . .dj w1,j w2,j . . . wi,j . . . wt,j
. . . . . . . . . . . . . . . . . . . . .dN w1,N w2,N . . . wi,N . . . wt,N
Indexing Items
Documents
MIMD Architectures There are two possible methods for
partitioning the data: Document partitioning: the N documents are
distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it;
Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.
Inverted FilesLogical Document Partitioning Data Partitioning
The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm;
The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.
Extended dictionary entry for documentpartitioning
Inverted FilesLogical Document Partitioning
item i
P1
P2
P3
P4
Inverted ListTerm i
Dictionary
Query Evaluation The broker initiates P parallel processes to
evaluate the query; Each process executes the same document scoring
algorithm on its document subcollection; The search processes record document scores in a
single shared array of document score accumulators;
The broker produces the final ranked list of documents.
Inverted FilesLogical Document Partitioning
Inverted File Construction The indexer partitions the documents
among the processors; Each indexing process generates a batch of
inverted lists, sorted by indexing item; A merge step is performed to create the final
inverted file.
Inverted Files Logical Document Partitioning
Data Partitioning The documents are physically partitioned
into separate subcollections, one for each parallel processor;
Each subcollection has its own inverted file.
Inverted FilesPhysical Document Partitioning
Query Evaluation The broker distributes the query to all of the
parallel search processes; Each parallel search process evaluates the
query on its portion of the document collection, producing an intermediate hit-list;
The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.
Inverted FilesPhysical Document Partitioning
Inverted File Construction Each processor creates, in parallel, its own
complete index corresponding to its document partition;
A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.
Inverted FilesPhysical Document Partitioning
Data Partitioning Inverted lists are spread across the
processors.
Inverted FilesTerm Partitioning
Query Evaluation Query is decomposed into indexing items
and each indexing item is sent to the processor that holds the corresponding inverted list;
The processors create hit-lists with partial document scores and return them to the broker;
The broker combines the hit-lists.
Inverted FilesTerm Partitioning
Inverted File Construction Inverted file is created using the parallel
construction technique described for logical document partitioning.
Inverted FilesTerm Partitioning
Example
Document collection
Document Text
1 Pease porridge hot
2 Pease porridge cold
3 Pease porridge in the pot
4 Pease porridge hot, pease porridge not cold
5 Pease porridge cold, pease porridge not hot
6 Pease porridge hot in the pot
ExampleInverted File
<6,1>
cold
hot
in
not
pease
porridge
pot
the
<1,1> <2,1> <3,1> <4,2> <5,2>
Dictionary
<2,1> <4,1>
<1,1> <4,1> <5,1> <6,1>
<3,1> <6,1>
<4,1> <5,1>
<6,1><1,1> <2,1> <3,1> <4,2> <5,2>
<3,1> <6,1>
<3,1> <6,1>
Inverted Lists
<5,1>
Example
Logical Document Partitioning
<6,1>
cold
hot
in
not
pease
porridge
pot
P1
P2
P3
the
<1,1>
<2,1>
<3,1>
<4,2>
<5,2>
Inverted ListTerm “pease”
Dictionary
Example
Physical Document Partitioningcold
hot
in
not
pease
porridge
pot
the
<3,1> <4,2>
<4,1>
<4,1>
<3,1>
<4,1>
<3,1> <4,2>
<3,1>
<3,1>
P2
hot
pease
porridge
<1,1> <2,1>
<1,1>
<1,1> <2,1>
P1
cold <2,1>
<6,1>
hot
in
not
pease
porridge
pot
the
<5,2>
<5,1> <6,1>
<6,1>
<5,1>
<6,1><5,2>
<6,1>
<6,1>
P3
cold <5,1>
Example
Term Partitioning
<6,1>
cold
hot
in
not
pease
porridge
pot
the
<1,1> <2,1> <3,1> <4,2> <5,2>
<2,1> <4,1>
<1,1> <4,1> <5,1> <6,1>
<3,1> <6,1>
<4,1> <5,1>
<6,1><1,1> <2,1> <3,1> <4,2> <5,2>
<3,1> <6,1>
<3,1> <6,1>
P1
P2
P3
<5,1>
Conclusion The task of indexing and searching in very large text
collections is costly; Faster indexing and searching algorithms are always
desirable and the use of parallel hardware is and obvious alternative;
We discussed two possible organization for the document collection index on a MIMD parallel architecture: Document partitioning; Term partitioning.
Conclusion Document partitioning affords simpler inverted
index construction and maintenance than term partitioning;
When term distributions in the documents and queries are more skewed, document partitioning performs better;
When terms are uniformily distributed in user queries, term partitioning performs better.
Adicional References
Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109.
Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190.
Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.