Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines

Qingqing Gan Torsten Suel

CSE Department

Polytechnic Institute of NYU

Improved Techniques for Result Caching in Web Search Engines

Content of this Talk

Result caching in web search engines

(1) The case of weighted caching: some queries more expensive to recompute than other

- investigate algorithms for this case

- hybrid algorithms, and impact of power laws

(2) Feature-based approach to caching

- improvements for result and index caching

• Query processing is a major performance bottleneck• Common performance optimizations: caching, index

compression, index pruning and early termination, parallel processing

• Multi-level caching: result caching vs. index caching• Mostly focus on result caching (but also index)

Caching

Query Processing

Inverted index can efficiently identify pages that contain a particular word or set of words

Main challenge for query processing is the significant size of the index data for a query

Need to optimize to scale with users and data Caching is one of such optimizations

Result caching: has query occurred before? List caching: has index data for term been accessed

before?

Related Work

• Markatos (WCW 2000) studies query log distributions and compares several basic caching algorithms cache

• Number of subsequent papers on result caching: • Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003)

• Fagni et al. (TOIS 2006)

• Lempel/Moran (WWW 2003)

• Saraiva et al. (SIGIR 2001)

• Xie/Hallaron (Infocom 2002)

• Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache

• Baeze-Yates et al. (Spire 2007) use some features for cache admission policy

Basics

• Sequence of queries q_1 to q_n

• LRU: least recently used

• LFU: least frequently used

• Can be implemented using basic data structures

score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score

• Recency (LRU) vs. frequency (LFU)

• Various hybrids

SDC (Static and Dynamic Caching)

LFULRU

Alpha = 0.7

Fagni et al. (TOIS 2006)

Characteristics of Queries• Query frequencies follow Zipf distribution

• While a few queries are quite frequent, most queries occur only once or a few times

Characteristics of Queries• Query traces exhibit some amount of burstiness, i.e.,

occurrences of queries are often clustered• A significant part of this burstiness is due to the same user

reissuing a query to the engine.

Contributions

• Study result caching as a weighted caching problem- Hit ratio

- Cost saving

• Hybrid algorithms for weighted caching• Caching and power laws

• Feature-based cache eviction policies

Weighted Caching

• Assume all cache entries have same size• Standard caching: all entries also same cost• Weighted caching: different costs• Result caching: some queries more

expensive to recompute than others• In fact, costs highly skewed• Should keep expensive results longer• Note: throughput vs. latency

Weighted Caching Algorithms• LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU)

• Landlord• On insertion, give entry a deadline equal to its cost• Evict entry with smallest deadline, and deduct this deadline

from all other deadlines in the cache

Weighed version of LFU (Young, Cao/Irani 1998)

• Clairvoyant: no poly. time optimal offline known• We cook up an estimate

• Assume system returns cost of query computed

Dataset

• 2006 AOL query log with 36 million queries• Queries which consist of only stop words are

removed• Requests for further result pages are removed

Hit Ratio of Basic Algorithms

Cost Reduction

New Hybrid Algorithms

• SDC• lru_lfu• landlord_lfu_w

Weighted Caching and Power Laws

• Problem with weighted caching with high skew• Suppose q_1 has occurred once and has cost 10,

and q_2 has occurred 10 times and has cost 1• LFU_w gives same priority is that right?

• Lottery:• Multiple rounds, one winner per round• Some people buy more tickets than others• But each person buys same number each week• Given past history, guess future winners• Suppose ticket sales are Zipfian

Weighted Caching and Power Laws

• Compare: smoothing techniques in language models• Three solutions:

• Good-Turing estimator• Estimator derived from power law• Pragmatic: fit correction factors from real data

• Last solution subsumes others

Weighted Zipfian Caching

Frequency g()

1 0.05

2 0.25

3 0.35

4 0.75

>=5 1.0

E.g, in LFU_w, Priority score = cost * frequency * g()

Hybrid Algorithms After Adding Correction

Feature-Based Caching• Most standard algorithms view input as sequence of object IDs• Hides many application details!• E.g., query length, frequency of query terms in query logs or in

collection, click behavior, navi/info query• But these could be very useful for caching!

• So, can/should we use more features in caching?• … and, should we keep using “explicit” algorithms, or rely on

machine learning?

• Compare: ranking functions in IR• Previous work: Baeza-Yates et al. (SPIRE 2007)

• F1: steps to last occurrence of this query;• F2: steps between last two occurrences of this query, if a query

occurs at least twice;• F3: query frequency so far; • F4: query length;• F5: length of shortest inverted index list of all query terms in the

query;• F6: the frequency of the rarest query term;• F7: the number of users who issue this query;• F8: among F7, the gap between the last two queries issued by

the most recently active user;• F9: average number of clicks per query;• F10: the query frequency of the rarest pair of terms in the query.

Features

Caching Algorithm

• Trivial machine learning approach (i.e., counting)• Split each feature into a few bins, thus placing each

cache entry into one bin• For each bin, estimate likelihood of reoccurrence

using past queries

• During caching (online), can efficiently move entries between bins until eviction

• O(lg c) cost per element (c is cache size)

Experimental Results – Hit Ratio

Experimental Results – Hit Ratio (cont.)

Experimental Results – Cost SavingsPriority score = probability score * cost of this query

Experimental Results – List Caching

Discussion

• A bunch of results on caching, in two parts …

• Note: feature-based beats the stuff in first part!

• Open: Cache size versus cache freshness issue

• Other apps of feature-based approach

Questions?

Documents

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines