View
45
Download
1
Category
Preview:
DESCRIPTION
Presentation Topic. Searching Distributed Collections With Inference Networks. Searching. Distributed Collections. Inference Network. Keywords:. COSC 6341-Information Retrieval. What is Distributed IR ?. Homogeneous Collections Large single collection is partitioned and - PowerPoint PPT Presentation
Citation preview
Presentation Topic
Searching Distributed Collections With Inference Networks
SearchingDistributed Collections
Inference Network
Keywords:
COSC 6341-Information Retrieval
What is Distributed IR ?
Homogeneous Collections Large single collection is partitioned and distributed over network to improve efficiency E.g.: Google
Heterogeneous Collections Internet offers thousands of diverse collections available for searching on demand E.g.: P2P
Architectural Aspects
P2P –peer to peer Architecture
How to search such a collection ?Consider the entire collection as a
“single LARGE VIRTUAL Collection”
C6
C5
C3
C2C1
C4
Virtual Collection
Search each collection individually
C6
C5
C3
C2C1
C4
Virtual Collection
Results1
Results4
Results3
Results2
Results6
Results5
Search3
Search2
Search5
Search1
Search4
Search6
How to Merge the results ??
Get the Results
Communication Costs??
Time required ??
Solution : An IR system must automatically: Rank Collections Collection Selection - Select specific
collections to be searched Merging the Ranked Results - Effectively
Merge the results
Ranking Collection Ranking of Collection can be addressed by Inference Network CORI net - Collection Retrieval Inference
Network, is used to rank collections, as opposed to more commonDocument Retrieval Inference Network.
Inference Network
dj
k1 k2 ki
qq2
kt
I q1or
and
or
Document dj has index terms - k1,k2,kiQuery Q composed of index terms-k1,k2,kiq1 = [( k1 and k2 ) or ki ] Boolean q2 = ( k1 and k2 ) Formulation
Information Need I =q or q1
Ranking Documents in Inference Network
using tf - idf strategies Term frequency ( tf ) =f (i, j)= P (ki|dj) Inverse document Frequency ( idfi ) = P( q|¯k) In an Inference network ,ranking of a document is computed as :
P ( q ∆ dj ) = Cj * 1/|dj| * Σ f (i, j) * idfi * 1/(1-f (i, j))
P (ki | dj ) = influence of keyword ki on document dj
P (q | ¯k ) = influence of index terms ki in the query node
Analogy between a Document Retrieval and a Collection retrieval Network.
Documents• tf – term frequency. Number of occurrences of term in a document
• idf - inverse document freq. f (Number of documents containing term)
• dl - document length
• D – Number of Documents
Collections• df – document frequency # of documents containing term in a collection
• icf – inverse collection freq. f (Number of collections containing term)
• cl – Collection length
• C – Number of Collections
tf – idf Scheme for Documents
df – icf Scheme for Collections
Comparison between Document Retrieval and Collection Retrieval Inference Networks
Document Retrieval Inference Network
Collection Retrieval Inference Network – CORI Net
Retrieve documents based on a query
Retrieve Collections based on a query
Why to use Inference Network ? One system is used for ranking both Documents
and Collections.
Document retrieval becomes a simple process: Use the query to retrieve a ranked list of collections Select the top group of collections Search the top group of collections, in parallel or sequentially Merge the results from the various collections into a single
ranking
To the retrieval Algorithm ,a CORI net looks like a document retrieval inference network with very big documents:
Each document is a surrogate for a complete collection
Interesting facts: CORI net for 1.2 GB collection = 5MB(0.4 % of the
original collection)
CORI net to search well-known collection of CACM (having 3000document collections) shows high values of df and icf but does not affect the computational complexity of retrieval.
Experiments on Trec Collection
T = d_t +(1 –d_t) * log(df +0.5)/ log(max_df+1.0)
I = log(|C|+0.5/cf) / log(|C|+1.0)
Belief p(rk/ci) in collection ci due to observing term rk is given by:
P ( rk / ci ) = d_b + ( 1 – d_b ) * T * I
where,
df=number of documents in ci containing term rk
max_df=number of documents containing the most frequent term in ci
|C| =number of collections
cf=number of collections containing term rk
d_t =minimum term frequency component when term rk occurs in collection ci
d_b=minimum belief component when term rk occurs in collection ci
Effectiveness This approach to ranking collections was evaluated using
INQUERY retrieval system and TREC collection. Experiments were conducted with 100queries TREC Volume-1 Topics 51-150 Mean squared error of the collection ranking for a single query
is calculated as: 1/|C|*Σi € C (Oi - Ri)2
where, Oi = optimal ranking for collection i based on the number of
relevant documents it contained
Ri = the rank for collection determined by the retrieval algorithm
C = the set of collections being ranked
Results Mean squared error averaged for first 50 queries=2.3471 Ranking for almost 75% of the queries was perfect For rest of 25% - disorganized ranks
Reason: Ranking Collections is not exactly similar to Ranking Documents
Scaling df max_df restricts small sets of interesting documentsSo, modification: “scaling df df+K”
K=k *((1-b)+b*cw / ¯cw)Where, cw = number of words in the collection k,b =constants (b [0,1])
Thus, T = d_t +(1-d_t)* df / df +K
Modified results
Best combination of b and k, b: b=0.75 and k=200
Mean squared error averaged over 50 queries = 1.4586(38% better than previous results)
Rankings for 30 queries improved
Rankings for 8 queries changed slightly
Rankings for 12 queries did not change
Merging the ResultsFour approaches:
Interleaving Raw scores Normalized scores Weighted scores
Effectiveness increases
C1D1-D10
C4D31-D40
C3D21-D30
C5D41-D50
C2D11-D20
C6D51-D60
SINGLE LARGE HOMOGENOUS COLLECTIOND1-D60
Step 1:Ranking Collection
D1,D5,D7
D44, D37 D60,D52,D57,D59
Step 2:Merging Results
1
2
3D1
D60D44
D52D5D37D57D7D59
This scheme is not satisfying, as we have only document Rankings and Collection rankings
Raw ScoreInterleaving
10,12,37
29, 69 90, 32, 25, 1
Assigning Weights90
69
37D7
D37
D52 -32 D44 -29D57-25D5-12D1-10D59-1
But again these weights from different collections may not be directly comparable….
Normalized Scores In inference network, normalizing scores requires a
preprocessing step prior to query evaluation
Preprocessing : The system obtains from each collection the statistics about how many documents each query term matches. The statistics are merged to obtain normalized idf .(Global weighing scheme)
Problem: High communicational and Computational costs. (if wide distributed network)
Weighted Scores Weights can be based on document’s score
and/or the collection ranking information. Offers computational simplicity. Weight w gives weight results for different
collections.W = 1 + |C| * (s- ¯s / ¯s),
where,
|C|= the number of collections searched S = the collection’s score (not its rank)*
¯s = the mean of the collection scores*Assumption : Similar collections have similar weights
Weighted Scores Rank (document) = score (document)*
weight (collection) This method favors documents from
collections with high scores but also enables a good document from a poor collection. [which we are looking for]
Comparing results of the merging methods
Average precision -using 11 point Recall
10
15
20
25
30
35
40
Merge Techniques
Prec
ision
Precision 37.76 17.54 33.64 38.18
Normalized Interleaved Raw Score Weighted
R-precision(Trec vol-1-51-100)
20
25
30
35
40
45
Merge techniquesEx
act p
recis
ion
valu
es
R-precision 41.96 25.42 38.67 42.46
Normalized Interleaved Raw Score Weighted
Source: TREC Collection Volume-1
Topics: 51-100
Merging Results-Pros and Cons Interleaving: Extremely ineffective, losses in average precision.
(The reason is that documents ranked high from non-relevantcollection may reside near high-ranked documents from more relevantcollections)
Raw Scores: Sometimes scores from different collections may not bedirectly comparable (like idf weights). Use of some terms in the collection may penalize its common use.
Normalized Scores: Resembles at most the search in single collections,but normalizing has significant communication and computationalcosts when collections are distributed across wide-area network
Weighted Scores: As effective as normalized scores, but less robust(introduces deviations in recall and precision)
Collection SelectionApproaches: Top n collections Any collection with a score greater than
some threshold Top group (based on clustering) Cost based selection
Results
Topics 51-100 101-150Difference between 2 best clusters and using all 7 collections
<-5% >5%
R-precision -0.9% -5.6%
11-Point precision -2.2% -9.1%
Document Cut-Off 500-1000 200
Eliminating collections reduces Recall.
Related work on collection selection:Group Manually and Select Manually
Collections are organized into groups with a common theme e.g., financial, technology, appellate court decisions
Users selects which group to search Found in commercial service providers e.g., experienced users like Librarians
Groupings determined manually (+) time consuming, inconsistent groupings, coarse groupings, not good
for unusual information needs
Groupings determined automatically Broker agents maintain a centralized cluster index by periodically
querying collections on eachsubject
(+) automatic creation, better consistency, coarse groupings ( - )not good for unusual information needs
Rule-Based Selection The contents of each collection are described in a
knowledge-base A rule-based system selects the collections for a query.
EXPERT CONIT, a research system, tested on static and homogeneous collections
(-) time consuming to create(-)inconsistent selection if rules change(-)coarse groupings so not good for unusual information
needs
OptimizationRepresent a Collection With a Subset of Terms Inference network with most frequent terms
Atleast 20% most frequent words must be there
Proximity Information Proximity of terms can be handled by
CORI net. CORI net for one collection would be 30%
of the original collection. [before we had CORI net = 0.4% of original collection]
Retrieving Few documents Usually a user is interested in first 10 or utmost 20 results. Rest of the results are discarded
( Oh! Waste of resources and time)Example:
C220DOCS
C110DOCS
C330DOCS
MERGING OF THE RESULTS AND
SELECTING TOP 10 MERGED RESULTS
TOP 10TOP 10TOP 10
USER
20 Documents thrown out without even user looking into them
Collections=C=3Rankings of Interest=10Docs retrieved= C*n=3*10=30Docs discarded=(C-1)*n= (3-1)*10=20
Experiments and results The number of documents R retrieved
from the ith ranked collection is:
R(i) = M * n * [ 2(1+C – i ) / C * C + 1 ]
where,M Є [1,C+1/2 ],M*n =number of documents to be retrieved from all
collections
Result summaryCollection: TREC Volume-1C=5 M=2
Collection
C1 C2 C3 C4 C5
Docs retrieved
0.67n
0.53n
0.40n
0.27n
0.13n
Documents Retrieved
n (number of results reported)
Before After Savings
1000 4050 2000 51%M=1 Same results for ranks 1200M=2 Identical results for ranks 1500M=3 Better results for ranks 500 1000
Conclusions Representing collections by terms and frequencies is effective. Controlled vocabularies and schemas are not necessary. Collections and documents can be ranked with one algorithm (using different statistics). e.g., GlOSS, inference networks Rankings from different collections can be merged efficiently: – with precisely normalized scores (Infoseek’s method), or – without precisely normalized document scores, – with only minimal effort, and – with only minimal communication between client and server. Large scale distributed retrieval can be accomplished now.
Open Problems Multiple representations
stemming, stopwords, query processing, indexing cheating / spamming
How to integrate relevance feedback
query expansion browsing
References: Primary Source: Searching Distributed Collections With Inference
Networks By: James P. Callan , Zhihong Lu , W.Bruce Croft Secondary Source: 1. Distributed Information retrieval
By: James Allan, University of Massachusetts Amherst 2. Methodologies for Distributed Information Retrieval.
By: Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )
Questions ??
Recommended