View
219
Download
0
Category
Tags:
Preview:
Citation preview
Search Engine Caching
Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.383959
Predictive Caching and Prefetching of Query Results in Search Engines, Ronny Lempel and Shlomo Moran, September 2001
http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/775152.775156
Presented by
Adam "So, is this gonna be on the test?" Edelman
The Problem
• The User: "I want my results now!"
• But...– Over 4 billion web pages– Over 1 million queries per minute
• How do we keep response times down as the web grows?
Search Engine Statistics
• 63.7% of the search phrases appear only once in the billion query log
• The 25 most popular queries in the log account for 1.5% of the submissions
• Considerable time and processing power can be saved through well implemented caching
Search Engine Statistics
• 58% of the users view only the first page of results (the top-10 results)
• No more than 12% of users browse through more than 3 result pages.
• We do not need to cache large result sets for a given query
What do we Cache?
• 36% of all queries have been retrieved before
• Can we apply caching even if the query does not exactly match any previous query?
What do we Cache?
• Saraiva et. al propose a two level cache
• In addition to caching query results, we also cache inverted lists for popular terms
Query Cache Implementation
• Store only the first 50 references per query– ~25KB per query
• Query logs show that the miss ratios do not drastically improve after query result cache exceeds 10 MB
Inverted List Cache Implementation• For this data set 50-75%
of inverted lists contain documents where term appears only once
• Use 4KB inverted list size per term– More work needs to be
done
• Asymptotic behavior is apparent after cache exceeds 200MB
• Use 250MB for IL Cache
Two-Level Cache Implementation
• Combine previous two caches
• 270MB total cache– Accounts for only 6.5% of overall index size
• Tested over a log of 100K queries to TodoBR
Two-Level Cache Results• Compared to caches of
270MB for only query results, only inverted lists and no cache
• Queries processed reduced by 62%– 21% increase compared to
only query result cache
• Page fetches from the database reduced 95%– 3% increase compared to
only inverted list cache
Two-Level Cache Results
• For more than 20 queries per second two-level cache is 20% disk reads of no cache
• Two-level cache can handle 64 queries per second against 22 per second with no cache
How do we cache?
• Saraiva et al use a least recently used (LRU) replacement policy for cache maintenance
• Users search in sessions, the next query will probably be related to the previous query
• Can we use this to improve caching?
Probability Driven Cache (PDC)
• Lempel and Moran propose a cache based on the probability of a page being requested
Page Least Recently Used (PLRU)
• Allocate a page queue that can accommodate a certain number of result pages
• When the queue is full and a new page needs to be cached, the least recently used page is removed from the cach
• Achieves hit ratios around 30% for warm, large caches
Page Segmented LRU (PSLRU)
• Maintains two LRU segments, a protected segment and a probationary segment
• Pages are first placed in the probationary segment, if requested again they are moved to the protected segment
• Pages evicted from the protected segment are moved to the probationary segment
• Pages evicted from the probationary segment are removed from the cache
• Consistently outperforms PLRU although difference is very small
Topic LRU (TLRU)
• Let t(q) denote the topic of the query q
• After the cache is warm, any cached result page of t(q) is moved to the tail of the queue.
• Each topic’s pages will reside contiguously in the queue
Topic SLRU (TSLRU)
• All pages are initially inserted in the probationary segment
• In addition to promoting pages from probationary to protected, we also promote all pages of t(q)
Recommended