Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001

  • View

  • Download

Embed Size (px)

Text of Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia...

  • Search Engine CachingRank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 Caching and Prefetching of Query Results in Search Engines, Ronny Lempel and Shlomo Moran, September 2001 by Adam "So, is this gonna be on the test?" Edelman

  • The ProblemThe User: "I want my results now!"

    But...Over 4 billion web pagesOver 1 million queries per minute

    How do we keep response times down as the web grows?

  • Search Engine Statistics63.7% of the search phrases appear only once in the billion query log

    The 25 most popular queries in the log account for 1.5% of the submissions

    Considerable time and processing power can be saved through well implemented caching

  • Search Engine Statistics58% of the users view only the first page of results (the top-10 results)

    No more than 12% of users browse through more than 3 result pages.

    We do not need to cache large result sets for a given query

  • What do we Cache?36% of all queries have been retrieved before

    Can we apply caching even if the query does not exactly match any previous query?

  • What do we Cache?Saraiva et. al propose a two level cache

    In addition to caching query results, we also cache inverted lists for popular terms

  • Query Cache ImplementationStore only the first 50 references per query~25KB per query

    Query logs show that the miss ratios do not drastically improve after query result cache exceeds 10 MB

  • Inverted List Cache ImplementationFor this data set 50-75% of inverted lists contain documents where term appears only onceUse 4KB inverted list size per termMore work needs to be doneAsymptotic behavior is apparent after cache exceeds 200MBUse 250MB for IL Cache

  • Two-Level Cache ImplementationCombine previous two caches

    270MB total cacheAccounts for only 6.5% of overall index size

    Tested over a log of 100K queries to TodoBR

  • Two-Level Cache ResultsCompared to caches of 270MB for only query results, only inverted lists and no cacheQueries processed reduced by 62%21% increase compared to only query result cachePage fetches from the database reduced 95%3% increase compared to only inverted list cache

  • Two-Level Cache ResultsFor more than 20 queries per second two-level cache is 20% disk reads of no cache

    Two-level cache can handle 64 queries per second against 22 per second with no cache

  • How do we cache?Saraiva et al use a least recently used (LRU) replacement policy for cache maintenance

    Users search in sessions, the next query will probably be related to the previous query

    Can we use this to improve caching?

  • Probability Driven Cache (PDC)Lempel and Moran propose a cache based on the probability of a page being requested

  • Page Least Recently Used (PLRU)Allocate a page queue that can accommodate a certain number of result pagesWhen the queue is full and a new page needs to be cached, the least recently used page is removed from the cachAchieves hit ratios around 30% for warm, large caches

  • Page Segmented LRU (PSLRU)Maintains two LRU segments, a protected segment and a probationary segmentPages are first placed in the probationary segment, if requested again they are moved to the protected segmentPages evicted from the protected segment are moved to the probationary segmentPages evicted from the probationary segment are removed from the cacheConsistently outperforms PLRU although difference is very small

  • Topic LRU (TLRU)Let t(q) denote the topic of the query qAfter the cache is warm, any cached result page of t(q) is moved to the tail of the queue. Each topics pages will reside contiguously in the queue

  • Topic SLRU (TSLRU)All pages are initially inserted in the probationary segmentIn addition to promoting pages from probationary to protected, we also promote all pages of t(q)