«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language

«Full-text federated search of text-based digital libraries in peer-to-peer networks»Information Retrieval 2006, Springer

Jie Liu, Jamie Callan

Language Technologies Institute, School of Computer Science, Carnegie Mellon University

Paper presentation:

Konstantinos Zaharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline

• Introduction• Overview / prior research• Full-text federated search in p2p• Test data• Evaluation methodology – experimental

settings• Results• Conclusions and future work

Introduction

• Federated ~ distributed• Problem addressed: use of p2p nets as a search

layer for text-based digital libraries (dl’s).• Why p2p? Because they do not need central

authority (decentralized), they connect heterogeneous, multi-vendor and lightly – managed enterprise nets. In short they are robust and scalable

Two types of environments in a p2p net

• Cooperative p2p environments: each provider gives its own accurate resource description to each neighbouring directory service (hub)

• Uncooperative p2p environments: each directory service conducts independently query-based sampling to obtain sample documents from its neighbouring providers in order to create their own resource description

Overview

Distributed IR poses three main problems:

1. Resource representation: discover content areas covered by each dl

2. Resource selection: decide which dl’s are most appropriate for an information need based on their descriptions

3. Result merging: merge ranked retrieval results from a set of selected dl’s

Source representation (prior research)

• STARTS cooperative protocol (Gravano et al., Proc. of ACM SIGMOD, 1997)

• Query-based sampling for uncooperative environments (Callan, 2000). Directly refers to “hidden web” problem

Source selection (prior research)

• Algorithms based on resource ranking (CORI, gGIOSS, Kullback-Leibler divergence based)

• Threshold for resource selection usually set to a heuristic value (e.g. 5 or 10)

Result merging (prior research)• 1st approch: normalize resource specific document

scores into resource independent document scores (CORI, SSL merging algorithms)

• 2nd approach: recalculate document scores at directory service (Kirsch algorithm – each resource provides summary statistics)

P2p network architecture

1. Clients (information consumers): issue requests (queries)

2. Servers (information providers, dl’s): route requests (query routing) to other servers (directory services) and respond to requests (retrieval)

3. Lower level leaf nodes: providers and consumers. Only connect to hubs

4. Upper level hub nodes: directory services. Connect with leaves and other hubs

5. Query routing is the unique/critical issue in p2p nets

Structured vs hierachical p2p architecture

• Important distinction: structured architecture uses DHT (distributed hash tables) which maps every data object to a distributed key. On the contrary hierarchical architectures that automatically discover contents of dl (appropriate for dynamic, heterogenous, privacy protected nets)

• Hierarchical architecture support sophisticated search techniques that are not constrainted to controlled or small vocabularies (more appropriate for full-text search). However they are more complex and demand higher communication costs

• Common characteristic: construction of an overlay to organize peers for efficient query routing (semantic overlay networks)

Existing implementations

• PlanetP, each peer uses a TF.IDF algorithm to decide which peers to contact for information request (Cuenca-Acuna and Ngugen, 2002)

• pSearch, uses the semantic vector of each document (through LSI) to distribute document indices in a structured p2p net (Tang et al., 2003)

Paper contribution

• Revise and adapt methods to solve more efficiently the problems in hierarchical p2p nets

• Develop new approaches (e.g. resource ranking)• Discriminate between cooperative and uncooperative

environments• Support thesis by extended experimental results

Resource description (1)

• Format: a collection language model (lists of terms and frequencies along with corpus statistics)

• Resource: can be a single provider (dl), a hub (multiple connected providers) or a neighborhood (all peers reachable from a hub)

• Description of providers: cf slide #9• Description of hubs: aggregation of description of

neighboring providers (within 1 hop)

Neighborhood description

• Routing indices: terms+freq+path to other docs (Crespo and Garcia-Molina, 2002b)

•Each hub calculates and sends to its hub neighbor the resource description of its neighborhood

•Total # of documents aggregated in exponential time

•Detection/avoidance of graph cycles because it affects the accuracy of descriptions

Resource selection (2)

Query routing: directing queries to peers that are most likely to contain relevant documents. Cost proportional to # of messages carrying the query

• Flooding technique: accurate but inefficient (exponential # of query messages)

• Random forward technique: relatively efficient but inaccurate

Then what?

Resource ranking (full-text)

• Providers: use of K-L divergence resource ranking algorithm to calculate P(Pi | Q) (Si and Callan, 2004)

• Hubs: same as above with aggregation over selected neighborhoods

• After ranking, the idea is to select the top-ranked entities by either a) specifying a predetermined number (not as good for dynamically changed nets) or b) letting the entities to learn their own threshold values automatically and autonomously

Unsupervised threshold learning method

• Providers estimate ranking scores of relevant and non-relevant documents using the merged retrieval results of a set of training queries (set-based threshold learning)

• Hubs, however use individual training queries for each member of their neighborhoods (individual-based threshold learning

Result merging (3)

• Cooperative environments: use of Kirsch algorithm (Kirsch, 1997) modified to the point that it no longer needs global statistics (fewer costs)

• Uncooperative environments: no summary statistics are available, so adapt the Semi-Supervised-Learning algorithm (Si and Callan, 2003a). Use linear regression with local weights and overlapping documents

a real P2P network search model

Source:“Full text federated search in P2P networks”, Lu J., PhD Dissertation, CMU 2007

Test data and evaluations

• Use of WT10g-based testbed collection• # provides (websites) ~ 2500• # hubs (content-based clustering) ~ 25• # documents ~1500000• Queries automatically generated (by extracting key

terms from documents)

• Evaluation criteria: a) search accuracy and b) query routing efficiency

Experimental settings

• Four methods for resource selection– Flooding– Random selection– Full-text selection using a fixed threshold (e.g. 1% of the top-

ranked neighbouring hubs)– Full-text selection using learned thresholds

• TTL (time-to-live) value for each query message set to 6• Query-based sampling for resource representation in

uncooperative environments• # of training points to apply SSL method set to 3

Experimental results (numbers)

Experimental results show …

• Full-text selection performs better than flooding or random selection

• Using learned thresholds for resource selection yields a few more query messages (than using a fixed threshold) but improves search accuracy

• Uncooperative environments exhibit ~10% search performance degradation in comparision to cooperative once, which is generally accepted

Conclusions and future work

• Enhance hub functionality so as not only to provide sufficient information for its connected providers, but also calculate a path to other, probably useful, peers (provider routing technique)

• Method works well for small/medium sized p2p nets with regulated network structures and organized content distribution. But what happens in larger-scale networks?

• What happens in dynamically/temporally evolved nets? What about load balancing, dynamic clustering and fault tolerance?

Comments …

• Paper contained no intriguing ideas but proposed practical modifications to existing methods • Writing style demonstrated frequent repetitions, verbatimism and often vagueness• It is obvious that researchers are more inclined to better empirical results/tools for real world

applications than theoretical models• All references are taken from commenting paper reference list

• Any questions?

Thank you for your attention!

Documents

«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language