Parallel and Distributed IR

Parallel and Distributed IR

2

Papers on Parallel and Distributed IR

Introduction

Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404]

Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

Comparison and Conclusion

URLs

Agenda

3

Introduction

Exponential growth in size of online electronic text.

Per surveys conducted, publicly indexable web contained 350 million pages ~ July 98 800 million pages ~ July 99 1 billion pages ~ January 00.

To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.

4

Parallel and Distributed IR Comparison

Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel.

The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory.

Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process. Paper A: discusses two schemes for Parallel IR implementation Paper B: gives methodologies for Distributed IR.

5

Paper A: Inverted file partitioning schemes - Objective

Goal of the paper is to reduce average response time by partitioning inverted file.

The paper identifies I/O time as a major cost factor in IR system. It exploits the potential of I/O parallelism and balances I/O work-

load for better response time by partitioning and distributing files. The paper discusses two partitioning schemes for inverted file

systems.

Inverted file partitioning schemes in Multiple Disk SystemsBy Byeong-Soo Jeong and Edward Omiecinski [1995]

6

Paper A: Inverted file structure

7

Paper A: Inverted file partitioning schemes

Paper A: Inverted file partitioning schemes1) Based on term-id

2) Based on document-id

Scheme 1: All postings for a term on one disk. Scheme 2: All postings for a document on one disk

(but for one term distributed across disks).

8

Paper A: Partitioning schemes – Pictorial presentation

9

Paper A: Two schemes - comparison

Document – ID based Term – ID based Space usage:

Index file needs to store disk information to indicate where posting entries are stored for term. More space usage for index file.

Space usage:

All postings for one term on same disk; less space usage for index file.

Number of I/O:

Posting entries for one term are spread across disk; hence number of I/O for posting file is equal to number of disks containing posting files entry for given term. More posting file I/O.

Number of I/O:

For one term, single posting file I/O.

Load distribution:

Though more posting file I/O, it could be parallel. Hence I/O load distribution is balanced.

Load distribution:

Maximum I/O parallelism is limited by number of terms in query [On web, average number of words per query = 2.35 per survey in Sept-98].

I/O time:

Small I/O size. Hence, result I/O time could be less.

I/O time:

I/O size is equal to complete posting entry for term. Result I/O time depends on size of longest posting entry.

10

Paper A: Two schemes performance comparison

Query Model:

Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model.

Query length:

Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries.

Number of disks:

Addition of number of disks improves performance of partition by document-ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID.

Performance comparison under different parameters

Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.

11

Paper B: Methodologies for Distributed IR - Objective

This paper is in the proceedings of 18th international conference on Distributed Computing Systems – 1998.

This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time.

Methodologies for Distributed Information RetrievalBy Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]

12

Paper B: Methodologies for Distributed IR

“Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged.

Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient.

[Will see why its not efficient in coming slides]

13

Paper B: Distributed IR Model

Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc.

Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info.

After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.

Librarian

Index

ReceptionistSub-Collection

User queryUser-Interface

Result Global information Librarianabout Librarians

Index

Sub-Collection

14

Paper B: Distributed IR methodologies

Three different methodologies are defined based on the global information stored at the receptionist.

Central Nothing – CNThe only global information maintained by the receptionist is a list of librarian.

Central Vocabulary – CVGlobal information stored by receptionist is the vocabularies of the sub-collections.

Central Index – CIReceptionist has a full access to the indexes of sub-collections.

15

Paper B: Central Nothing–Distributed IR

Advantage: Little or no storage space is required for global information at

receptionist.

Simple implementation.

Disadvantage: Receptionist has no basis for excluding any sub-collection

processes query in full.

Final ranking quality is poor (a term might be common in one sub-collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection-wide).

Global Information: List of librarians

16

Paper B: Central Vocabulary-Distributed IR

Advantage: Receptionist can decide better to choose sub-collections for query

distribution and sub-collections can be completely avoided if they contain none or few of the query terms.

It has a better global ranking (compared to CN) as it can use Central Vocabulary.

Disadvantage: More storage required for string collection-wide vocabulary.

Global Information: Vocabularies of all sub-collections.

17

Paper B: Central Index–Distributed IR

Advantage: Receptionist can perform all index processing and request, from

librarian, docs required to make final ranking.

Better selection of librarians.

Disadvantage: More storage required for string collection-wide vocabulary and

index.

More preprocessing required at the receptionist to request documents from librarians.

Receptionist has full access to indexes of sub-collection.

18

Paper A & Paper B comparison - Conclusion

Paper A: Inverted file partitioning schemes

Paper B: Methodologies for distributed IR

Query Processing:

Breaks query into keywords.

Query Processing:

Sends complete query to librarians.

Document Partitioning:

May or may not partition corpus.

Document Partitioning:

Corpus is partitioned.

Optimization:

Attempts to optimize I/O.

Optimization:

Attempts to optimize network delays and processing time.

Efficiency:

More efficient model.

Efficiency:

Less efficient model.

19

Paper A and Paper B - URLs

Paper A: Inverted File Partitioning Schemes in Multiple Disk Systems by

Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995)

http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm

Paper B: Methodologies for Distributed Information Retrieval by Owen de

Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )

http://csdl.computer.org/comp/proceedings/icdcs/1998/8292/00/82920066abs.htm

Documents

Parallel and Distributed IR