Upload
erich-rice
View
32
Download
0
Embed Size (px)
DESCRIPTION
Parallel and Distributed IR. Papers on Parallel and Distributed IR. Agenda. Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404] - PowerPoint PPT Presentation
Citation preview
Parallel and Distributed IR
2
Papers on Parallel and Distributed IR
Introduction
Paper A: Inverted file partitioning schemes in Multiple Disk Systems by Byeong-Soo Jeong and Edward Omiecinski [404]
Paper B: Methodologies for Distributed Information Retrieval by Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]
Comparison and Conclusion
URLs
Agenda
3
Introduction
Exponential growth in size of online electronic text.
Per surveys conducted, publicly indexable web contained 350 million pages ~ July 98 800 million pages ~ July 99 1 billion pages ~ January 00.
To manage this size and growth, we need a scalable model, multitasking algorithms, parallel and distributed IR.
4
Parallel and Distributed IR Comparison
Computation model in Distributed IR and Parallel IR is very similar. It divides the main task into sub-tasks and executes the sub-tasks in parallel.
The main difference is that, in Distributed IR sub-tasks are run on different processing units where interprocess communication is via network protocols rather than shared memory.
Distributed IR employs procedure to select subset of processes to broadcast request whereas Parallel IR broadcasts every request to every process. Paper A: discusses two schemes for Parallel IR implementation Paper B: gives methodologies for Distributed IR.
5
Paper A: Inverted file partitioning schemes - Objective
Goal of the paper is to reduce average response time by partitioning inverted file.
The paper identifies I/O time as a major cost factor in IR system. It exploits the potential of I/O parallelism and balances I/O work-
load for better response time by partitioning and distributing files. The paper discusses two partitioning schemes for inverted file
systems.
Inverted file partitioning schemes in Multiple Disk SystemsBy Byeong-Soo Jeong and Edward Omiecinski [1995]
6
Paper A: Inverted file structure
7
Paper A: Inverted file partitioning schemes
Paper A: Inverted file partitioning schemes1) Based on term-id
2) Based on document-id
Scheme 1: All postings for a term on one disk. Scheme 2: All postings for a document on one disk
(but for one term distributed across disks).
8
Paper A: Partitioning schemes – Pictorial presentation
9
Paper A: Two schemes - comparison
Document – ID based Term – ID based Space usage:
Index file needs to store disk information to indicate where posting entries are stored for term. More space usage for index file.
Space usage:
All postings for one term on same disk; less space usage for index file.
Number of I/O:
Posting entries for one term are spread across disk; hence number of I/O for posting file is equal to number of disks containing posting files entry for given term. More posting file I/O.
Number of I/O:
For one term, single posting file I/O.
Load distribution:
Though more posting file I/O, it could be parallel. Hence I/O load distribution is balanced.
Load distribution:
Maximum I/O parallelism is limited by number of terms in query [On web, average number of words per query = 2.35 per survey in Sept-98].
I/O time:
Small I/O size. Hence, result I/O time could be less.
I/O time:
I/O size is equal to complete posting entry for term. Result I/O time depends on size of longest posting entry.
10
Paper A: Two schemes performance comparison
Query Model:
Under skew Query model: partition by document-id performs better. Because I/O load is more balanced in partition by document-ID. Whereas, partition by term-ID performs better in uniform query model.
Query length:
Under uniform query environment, partition by term-ID model performs twice as fast for long queries and 5-10 times fast for short queries.
Number of disks:
Addition of number of disks improves performance of partition by document-ID scheme at higher rate, since I/O load is more evenly distributed in partition by document-ID.
Performance comparison under different parameters
Conclusion: Partition by Term ID performs better under uniform query models, but has high fluctuation in response time depending on terms in query. In Partition by Doc-ID, there is little variation in response time for almost all cases.
11
Paper B: Methodologies for Distributed IR - Objective
This paper is in the proceedings of 18th international conference on Distributed Computing Systems – 1998.
This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time.
Methodologies for Distributed Information RetrievalBy Alister Moffat, Justin Zobel, Owen De Kretser, Tim Shimmin [1998]
12
Paper B: Methodologies for Distributed IR
“Parallel Text Search Methods”- paper by Salton and Buckley, in 1988, [701], has interesting comments about early implementation of Parallel IR where its effectiveness and efficiency are challenged.
Moffat and Zobel, in this paper, conclude that Distributed IR can be fast and effective; but agree with Salton-Buckley that its not efficient.
[Will see why its not efficient in coming slides]
13
Paper B: Distributed IR Model
Librarian – Individual node that has its own sub-collection, maintains index for sub-collection, evaluates queries, fetches doc.
Receptionist – provides user interface, posts user queries to all or set of librarians, merges results from librarians, generates final ranked list of result using global info.
After global ranking by the receptionist, many of the docs returned by librarian may not even be presented to the user. Thus, there is wastage of resource in calculating similarity and transmission of those unwanted docs, therefore efficiency is low in distributed model.
Librarian
Index
ReceptionistSub-Collection
User queryUser-Interface
Result Global information Librarianabout Librarians
Index
Sub-Collection
14
Paper B: Distributed IR methodologies
Three different methodologies are defined based on the global information stored at the receptionist.
Central Nothing – CNThe only global information maintained by the receptionist is a list of librarian.
Central Vocabulary – CVGlobal information stored by receptionist is the vocabularies of the sub-collections.
Central Index – CIReceptionist has a full access to the indexes of sub-collections.
15
Paper B: Central Nothing–Distributed IR
Advantage: Little or no storage space is required for global information at
receptionist.
Simple implementation.
Disadvantage: Receptionist has no basis for excluding any sub-collection
processes query in full.
Final ranking quality is poor (a term might be common in one sub-collection and be assigned a minimal weight, but in context of the collection as a whole that term might be rare. When results from different sub-collection are merged, no basis to rank collection-wide).
Global Information: List of librarians
16
Paper B: Central Vocabulary-Distributed IR
Advantage: Receptionist can decide better to choose sub-collections for query
distribution and sub-collections can be completely avoided if they contain none or few of the query terms.
It has a better global ranking (compared to CN) as it can use Central Vocabulary.
Disadvantage: More storage required for string collection-wide vocabulary.
Global Information: Vocabularies of all sub-collections.
17
Paper B: Central Index–Distributed IR
Advantage: Receptionist can perform all index processing and request, from
librarian, docs required to make final ranking.
Better selection of librarians.
Disadvantage: More storage required for string collection-wide vocabulary and
index.
More preprocessing required at the receptionist to request documents from librarians.
Receptionist has full access to indexes of sub-collection.
18
Paper A & Paper B comparison - Conclusion
Paper A: Inverted file partitioning schemes
Paper B: Methodologies for distributed IR
Query Processing:
Breaks query into keywords.
Query Processing:
Sends complete query to librarians.
Document Partitioning:
May or may not partition corpus.
Document Partitioning:
Corpus is partitioned.
Optimization:
Attempts to optimize I/O.
Optimization:
Attempts to optimize network delays and processing time.
Efficiency:
More efficient model.
Efficiency:
Less efficient model.
19
Paper A and Paper B - URLs
Paper A: Inverted File Partitioning Schemes in Multiple Disk Systems by
Byeong-Soo Jeong, Edward Omiecinski. (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995)
http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm
Paper B: Methodologies for Distributed Information Retrieval by Owen de
Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )
http://csdl.computer.org/comp/proceedings/icdcs/1998/8292/00/82920066abs.htm