Upload
frederica-cain
View
220
Download
0
Embed Size (px)
Citation preview
«Full-text federated search of text-based digital libraries in peer-to-peer networks»Information Retrieval 2006, Springer
Jie Liu, Jamie Callan
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
Paper presentation:
Konstantinos Zaharis, Dept. of Comp. & Comm.Engineering, UTH
Paper Outline
• Introduction• Overview / prior research• Full-text federated search in p2p• Test data• Evaluation methodology – experimental
settings• Results• Conclusions and future work
Introduction
• Federated ~ distributed• Problem addressed: use of p2p nets as a search
layer for text-based digital libraries (dl’s).• Why p2p? Because they do not need central
authority (decentralized), they connect heterogeneous, multi-vendor and lightly – managed enterprise nets. In short they are robust and scalable
Two types of environments in a p2p net
• Cooperative p2p environments: each provider gives its own accurate resource description to each neighbouring directory service (hub)
• Uncooperative p2p environments: each directory service conducts independently query-based sampling to obtain sample documents from its neighbouring providers in order to create their own resource description
Overview
Distributed IR poses three main problems:
1. Resource representation: discover content areas covered by each dl
2. Resource selection: decide which dl’s are most appropriate for an information need based on their descriptions
3. Result merging: merge ranked retrieval results from a set of selected dl’s
Source representation (prior research)
• STARTS cooperative protocol (Gravano et al., Proc. of ACM SIGMOD, 1997)
• Query-based sampling for uncooperative environments (Callan, 2000). Directly refers to “hidden web” problem
Source selection (prior research)
• Algorithms based on resource ranking (CORI, gGIOSS, Kullback-Leibler divergence based)
• Threshold for resource selection usually set to a heuristic value (e.g. 5 or 10)
Result merging (prior research)• 1st approch: normalize resource specific document
scores into resource independent document scores (CORI, SSL merging algorithms)
• 2nd approach: recalculate document scores at directory service (Kirsch algorithm – each resource provides summary statistics)
P2p network architecture
1. Clients (information consumers): issue requests (queries)
2. Servers (information providers, dl’s): route requests (query routing) to other servers (directory services) and respond to requests (retrieval)
3. Lower level leaf nodes: providers and consumers. Only connect to hubs
4. Upper level hub nodes: directory services. Connect with leaves and other hubs
5. Query routing is the unique/critical issue in p2p nets
Structured vs hierachical p2p architecture
• Important distinction: structured architecture uses DHT (distributed hash tables) which maps every data object to a distributed key. On the contrary hierarchical architectures that automatically discover contents of dl (appropriate for dynamic, heterogenous, privacy protected nets)
• Hierarchical architecture support sophisticated search techniques that are not constrainted to controlled or small vocabularies (more appropriate for full-text search). However they are more complex and demand higher communication costs
• Common characteristic: construction of an overlay to organize peers for efficient query routing (semantic overlay networks)
Existing implementations
• PlanetP, each peer uses a TF.IDF algorithm to decide which peers to contact for information request (Cuenca-Acuna and Ngugen, 2002)
• pSearch, uses the semantic vector of each document (through LSI) to distribute document indices in a structured p2p net (Tang et al., 2003)
Paper contribution
• Revise and adapt methods to solve more efficiently the problems in hierarchical p2p nets
• Develop new approaches (e.g. resource ranking)• Discriminate between cooperative and uncooperative
environments• Support thesis by extended experimental results
Resource description (1)
• Format: a collection language model (lists of terms and frequencies along with corpus statistics)
• Resource: can be a single provider (dl), a hub (multiple connected providers) or a neighborhood (all peers reachable from a hub)
• Description of providers: cf slide #9• Description of hubs: aggregation of description of
neighboring providers (within 1 hop)
Neighborhood description
• Routing indices: terms+freq+path to other docs (Crespo and Garcia-Molina, 2002b)
•Each hub calculates and sends to its hub neighbor the resource description of its neighborhood
•Total # of documents aggregated in exponential time
•Detection/avoidance of graph cycles because it affects the accuracy of descriptions
Resource selection (2)
Query routing: directing queries to peers that are most likely to contain relevant documents. Cost proportional to # of messages carrying the query
• Flooding technique: accurate but inefficient (exponential # of query messages)
• Random forward technique: relatively efficient but inaccurate
Then what?
Resource ranking (full-text)
• Providers: use of K-L divergence resource ranking algorithm to calculate P(Pi | Q) (Si and Callan, 2004)
• Hubs: same as above with aggregation over selected neighborhoods
• After ranking, the idea is to select the top-ranked entities by either a) specifying a predetermined number (not as good for dynamically changed nets) or b) letting the entities to learn their own threshold values automatically and autonomously
Unsupervised threshold learning method
• Providers estimate ranking scores of relevant and non-relevant documents using the merged retrieval results of a set of training queries (set-based threshold learning)
• Hubs, however use individual training queries for each member of their neighborhoods (individual-based threshold learning
Result merging (3)
• Cooperative environments: use of Kirsch algorithm (Kirsch, 1997) modified to the point that it no longer needs global statistics (fewer costs)
• Uncooperative environments: no summary statistics are available, so adapt the Semi-Supervised-Learning algorithm (Si and Callan, 2003a). Use linear regression with local weights and overlapping documents
a real P2P network search model
Source:“Full text federated search in P2P networks”, Lu J., PhD Dissertation, CMU 2007
Test data and evaluations
• Use of WT10g-based testbed collection• # provides (websites) ~ 2500• # hubs (content-based clustering) ~ 25• # documents ~1500000• Queries automatically generated (by extracting key
terms from documents)
• Evaluation criteria: a) search accuracy and b) query routing efficiency
Experimental settings
• Four methods for resource selection– Flooding– Random selection– Full-text selection using a fixed threshold (e.g. 1% of the top-
ranked neighbouring hubs)– Full-text selection using learned thresholds
• TTL (time-to-live) value for each query message set to 6• Query-based sampling for resource representation in
uncooperative environments• # of training points to apply SSL method set to 3
Experimental results (numbers)
Experimental results show …
• Full-text selection performs better than flooding or random selection
• Using learned thresholds for resource selection yields a few more query messages (than using a fixed threshold) but improves search accuracy
• Uncooperative environments exhibit ~10% search performance degradation in comparision to cooperative once, which is generally accepted
Conclusions and future work
• Enhance hub functionality so as not only to provide sufficient information for its connected providers, but also calculate a path to other, probably useful, peers (provider routing technique)
• Method works well for small/medium sized p2p nets with regulated network structures and organized content distribution. But what happens in larger-scale networks?
• What happens in dynamically/temporally evolved nets? What about load balancing, dynamic clustering and fault tolerance?
Comments …
• Paper contained no intriguing ideas but proposed practical modifications to existing methods • Writing style demonstrated frequent repetitions, verbatimism and often vagueness• It is obvious that researchers are more inclined to better empirical results/tools for real world
applications than theoretical models• All references are taken from commenting paper reference list
• Any questions?
Thank you for your attention!