16
«Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee» Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A. Ntoulas, J. Cho Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Embed Size (px)

Citation preview

Page 1: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

«Pruning Policies for Two-Tiered Inverted Index with

Correctness Guarantee»

Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007)

A. Ntoulas, J. Cho

Paper presentation:

Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Page 2: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Paper Outline

• Introduction• Architecture• Optimal size of pruned index• Pruning policies• Experimental evaluation• Conclusions

Page 3: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Introduction• Observation: approximately 80% of the users examine at most the

first 3 batches of the results. That is, 80% of the users typically view at most 30 to 60 results for every query that they issue to a search engine

• Contribution: is a new answer computation algorithm that guarantees that the top-matching pages (according to the search-engine’s ranking metric) are always placed at the top of search results, even though we are computing the first batch of answers from the pruned index most of the time

Page 4: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Two-tier Index Architecture

Page 5: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

ArchitectureDefinition 1 (Correctness indicator function): Given a query q,the p-index IP returns the answer A together with a correctnessindicator function C. C is set to 1 if A is guaranteed to be identical(i.e. same results in the same order) to the result computed fromthe full index IF . If it is possible that A is different, C is set to 0

Question 1: How can we compute the correctness indicator function C?

Question 2: How should we prune the index so as to realize maximum cost saving?

Page 6: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Pruned Index Size• Observation: the cost of the two-tier architecture depends on two important parameters: the size of the p-index and the fraction of the queries that can be handled by the 1st tier index alone

• Theorem 1: The cost for handling the query load Q is minimalwhen the size of the p-index, s, satisfies d f(s) / ds = 1, where s is the fraction of p-index relative to full-index

Page 7: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Pruning Policies

1) Term (keyword)-locality: the search engine will be able to answer a significant fraction of user queries even if it can handle only a few popular keywords (and possibly those that constitute the majority of query load

2) Document-locality: as long as search engines can compute the first few top-k answers correctly, users often will not notice that the search engine actually has not computed the correct answer for the remaining results (unless the users explicitlyrequest them). This is what actual commericial search engines do!

Page 8: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Pruning Policies: ranking function assumptions

• The exact ranking function that search engines employ is a closely guarded secret

• Query-dependent relevance: captures how relevant the query is to every document (cosine distance metric)

• Query-independent document quality: measures the overall “quality” of a document D independent of the particular query issued by the user (e.g. PageRank, Hits)

• Paper adopts as ranking function a linear combination of the two above factors (which should be monotonic)

Page 9: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Pruning Policies: term (horizontal) pruning

Problem 2 (optimal keyword pruning): Given the query load Q and a goal index size s · |IF | for the pruned index, select the inverted lists IP = {I(t1), . . . , I(th)} such that |IP| ≤ s · | IF | and the fraction of queries that IP can answer (expressed by f(s)) is maximized.

Theorem 2: The problem of calculating the optimal keyword pruning is NP-hard (proven reducible to knapsack or bin-packing problem)

Therefore paper implements a greedy policy by keeping the items with the maximum benefit per unit cost

Page 10: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Pruning Policies: document (vertical) pruning

Global and local - based document pruning algorithms. Neither guarantees the basic paper assumption (theorem 3)

Page 11: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Combined policy: extended keyword – specific document pruning

For every inverted list, paper picks two theshold values. This policy, when combined with a correct selected monotonic ranking function guarantees paper assumption (theorem 4)

Page 12: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Experimental setup

• Dataset: 130M pages crawled from www (on March 2004). Seed is ODP homepage

• Total uncompressed size of web pages: ~2TB• Full inverted index size: ~ 1.2TB• Query set available: ~ 450M queries, only a fraction of them (~5%)

processed1 (average # of terms/query is 2)• Selected ranking function:

r(D, q) = prnorm(D) + trnorm(D, q)

1 issued to web site www.looksmart.com

Page 13: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Term vs document pruning performance

1) ~73% of the queries can be answered using 30% of the original index. Furthermore, using keyword pruning only, the optimal index size is s = 0.17

2) for all index sizes larger than 40%, authors guarantee the correct answer for about 70% of the queries. Optimal index size here (doc-pruning) is s=0.20

3) For p-index sizes <20% the two approaches are equivalent. If s>20% then keyword-pruning performs much better

Page 14: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Combination

1) First apply term-pruning and subsequently doc-pruning

2) For p-index sizes smaller than 50%, combination does relatively well

Page 15: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

Conclusions

• authors provided a framework for new pruning techniques and answer computation algorithms that guarantee that the top matching pages are always placed at the top of search results in the correct order

• term-pruned index can guarantee 73% of the queries with a size of 30% of the full index

• document-pruned index can guarantee 68% of the queries with the same size

• combination of the two pruning algorithms guarantees 60% of the queries with an index size of 16%

Page 16: « Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A

• Any questions?

Thank you for your attention!