Arnd Christian König Data Management, Exploration and Mining Group Microsoft Research

Arnd Christian KönigArnd Christian KönigData Management, Exploration and Mining

GroupMicrosoft Research

Data Management, Exploration and Mining Data Management, Exploration and Mining Group, MSRGroup, MSRSanjay AgrawalSurajit ChaudhuriKaushik ChakrabartiVenkatesh Ganti Dong XinText Mining, Search and Navigation Group, Text Mining, Search and Navigation Group, MSRMSRKenneth W. ChurchQiang WuNatural Language Processing Group, MSRNatural Language Processing Group, MSRMichael GamonMicrosoft AdCenterMicrosoft AdCenterMartin MarkovMicrosoft SearchMicrosoft SearchLiying Sui

Sponsored Search Ads: Separate data store / index. Ads associated with ‘bid-phrases’. Retrieval by matching (modified) queries with bid-phrases. Ranking using a combination of relevance, bid-amount and expected CTR.

Vertical Sub-collections: Examples: Products, News, Images,... Separate data store / index. Index may be different from web index. Ranking function different for each vertical. Many verticals => vertical selection problem.

Retrieval Overhead

Retrieval Quality

Vertical Selection

Verticals not provisioned to handle 100% of traffic. => fast, initial filter on queries.

Verticals not provisioned to handle 100% of traffic. => fast, initial filter on queries.

Some queries may have relevant results in many verticals.

Some queries may have relevant results in many verticals.

Can the specific ranking function be indexed efficiently?

Can the specific ranking function be indexed efficiently?

Different retrieval engine for each ‘vertical’ and ads.

Retrieval processing involves matching and ranking. Match-processing independent of ranking function

Multiple ranking functions to be tested in parallel. Allows for arbitrarily complex ranking functions. Organizational boundaries.

Ranking of results is not monotone function of single-word

scores: Top-k optimizations do not apply.

Long latency for some queries with large number of matches.

Setup: Index 20 M product descriptions from Product Vertical.

Using commercial full-text engine, measure latency for 2-word queries for words of different frequencies.

User-perceives significant

latency

3 orders of magnitude latency

difference

F = Frequent Keywords (>800K postings)

M = Medium-frequency Keywords (≈ 50K postings)

L = Low-frequency keywords (<1K postings)

Query logs: combinations of frequent keywords common.

“BMW bike”“Book jewelry”

“Book Dumbledore”

> 2.1% of queries in

log

Different retrieval engine for each ‘vertical’ and ads. Retrieval processing involves matching and ranking. Match-processing independent of ranking function

Multiple ranking functions to be tested in parallel. Allows for arbitrarily complex ranking functions. Organizational boundaries.

Ranking of results is not monotone function of single-word

scores: Top-k optimizations do not apply.

Search Latency is crucial (via Dries Buytaert’s blog): Amazon: 100 ms of extra load time caused a 1%

drop in sales. (Source: Greg Linden, Amazon) Google: 500 ms of extra load time caused 20%

fewer searches. (Source: Marrissa Mayer, Google) Yahoo!: 400 ms of extra load time caused a 5 to

9% increase in the number of people that clicked "back" before the page even loaded. (Source: Nicole Sullivan, Yahoo!)

Latency issues addressed through parallelism, caching, but also specialized data structures.

Is this efficient? Is this efficient?

“cheap”“used”“books”

Augmented Inverted Index:Vocabulary:

Bid-ID # words 177 2 2090 3

… … … …

Bid-ID # words 11 2 99 1 … … … …

Bid-ID # words 2004 1 2090 3 … … … …

…

Query: {cheap books}Query: {cheap books}

Problem: Nearly all processed bids do not match the query. Indexing by word not selective. No early termination. Some improvement via non-redundant indexing.

Problem: Nearly all processed bids do not match the query. Indexing by word not selective. No early termination. Some improvement via non-redundant indexing.

Alternative approach: Index bids at fine granularity.

Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})

“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]“Cheap used books” [bid phrase]B3 [Bid ID]“New books” [bid phrase]B2 [Bid ID]

Bid Lists:

“Vocabulary”:





Bid Lists :

“Vocabulary”:





Bid Lists :

“Vocabulary”:





Bid Lists :

“Vocabulary”:


Problem: # Hash-lookups becomes significant for long queries.

Query containing n words requires 2n-1 hash lookups.

Inacceptable for long queries, due to tight latency-constraints for sponsored search.

Problem: # Hash-lookups becomes significant for long queries.

Query containing n words requires 2n-1 hash lookups.

Inacceptable for long queries, due to tight latency-constraints for sponsored search.

Idea: Queries that access a set of words access all its subsets [ICDE 2009]

Why does this help?Trading off random access against sequential access (main memory) Large reduction in page walks (via TLB misses)Fewer (worst-case) lookups:

Let k be # words in largest vocabulary node – then a n-word query


“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]

“New books” [bid phrase]B2 [Bid ID]

Bid Lists :“Vocabulary”:

“Cheap used books” [bid phrase]B3 [Bid ID]

k

i i

n

1

12 nrequires lookups, instead of

Can it hurt? Yes, if we do too much merging…

“Cheap used books” [bid phrase]B3 [Bid ID]

Task: Can we compute an ‘optimal’ assignment of bids to bid lists?

Optimization problem:

Given a cost model and a query workload, compute bids-lists that minimize query cost.

Workload:

The relative frequencies of the most frequent queries in search logs are stable

Assignment is computed off-line and refreshed periodically

Cost Model:

Simple Model, decomposing access cost into

Cost of random memory access Cost of sequential memory scans (monotonic in #bytes read)

Hash({w1, w2})Hash({w1, w2, w3})Hash({w4, w1”})

“w1 w2”B1 [Bid ID]B4 [Bid ID]

“w1 w4”B2 [Bid ID]

“w1 w3 w2”B3 [Bid ID]

Solution sketch:We model the assignment as grouping of bids. For each query q, we can now assign costs to a query.For each value of k and workload , we assign cost to each set of bids. Mapping selection ≈ set covering problem.l = MAX(#distinct bids in single node), l small. Greedy selection log l – approximation

Possible mappings:

(a){B1, B4}, {B3}, {B2}(b){B1,B4,B3}, {B2}(c)…

“w1 w3 w2”B3 [Bid ID]

k = MAX(#words in entry)

k

i i

q

1

||lookups

Costs of scanning bid-list

Other Examples:

WAND processing [Broder et al, CIKM’03] Keyword Search in Spatial Data, e.g., [De Felipe et al.,

ICDE’08],[Zhang et al., ICDE’09] Entity Search, e.g., [Balmin, VLDB’04], [S. Chakrabarti et al.

PKDD’04, WWW’06, WWW’07], [Agrawal WWW’09] (Approximate) Auto-completion [Bast et al., SIGIR’06], [Nandi

et al., SIGMOD’07, VLDB’07], [Chaudhuri et al., SIGMOD’09] …

techniques modifications of IR-style processing or string

matching.

Search over relational objects in RDBMS

well-studied problem (e.g., DBExplorer, Discover, BANKS, etc.) … but join-paths (= business objects) are known in verticals

search and can be pre-materialized and indexed.

Retrieval Overhead

Retrieval Quality

Vertical Selection

Q: [canon rebel xti]

[canon rebel xti]

The Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 MP Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the exclusive... More...

Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 Mega Pixel Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the... More...

The ultra-powerful 12x optical zoom on the PowerShot S5 IS means you'll get the shot you want with no compromise, yet that's only the beginning of what makes this camera so exciting. The S5 IS is…

Retrieval Semantics: Keyword-search over product descriptions

Retrieval Semantics: Keyword-search over product descriptions

Q: [low light camera]

[low light camera ]




Observation I:Many web documents mention instances of low light digital cameras in close proximity to query keywords {low light, digital camera}

Q: [low light camera]




Observation II: (Pseudo-Relevance)The top web search results will contain mostly relevant pages.Hence, we can identify most relevant entities by:Submitting query to a search engineIdentifying mentions of entities in top returned documentsAggregating scores for these entities

Observation II: (Pseudo-Relevance)The top web search results will contain mostly relevant pages.Hence, we can identify most relevant entities by:Submitting query to a search engineIdentifying mentions of entities in top returned documentsAggregating scores for these entities

Results [WWW 2009]:Significant improvement in retrieval precision and recall.Low overhead by piggy-backing on search engine components

Entities extracted as part of page crawl pipeline.

Entity indexing and retrieval in snippet generation.

Results [WWW 2009]:Significant improvement in retrieval precision and recall.Low overhead by piggy-backing on search engine components

Entities extracted as part of page crawl pipeline.

Entity indexing and retrieval in snippet generation.

General approach: Issue the search query against a document

corpus . Identify relevant sub-components of top

results (e.g., titles, captions, tags, categories, entites, etc.)

Aggregate over the components.Query

)P(w,

)P(w,log)P(w,

C

q2

Resultdq

Approach: Retrieve top ~50 documents from web search

engine. Categorize each document into commercial

taxonomy Use combination of categories to characterize

query foradvertising.

Cat(D1)

Query

Cat(D2)

Cat(D3)

Approach: Retrieve top news documents from web

search engine. Extract the publish date / order Count how many of the retrieved documents

were among the k most recently published ones.

Date(D1)

Query

Date(D2)

Date(D3)

Approach: Retrieve top news documents from web search

engine. Extract the publish date / order Count how many of the retrieved documents were

among the k most recently published ones.

Additional Examples:

[Shen et al. , SIGKDD Exploration 2005] – Query classification: Using title, snippet and category information from each document.

[Collins-Thompson et al., SIGIR’09] – Query difficulty prediction: Document are represented as a low-dim. feature vector.

Many isolated variations on general approach.What is the right abstraction or infrastructure? Neither corpus nor retrieval depth need to correspond to ‘normal’ web search result. => Integration of pre-computed information into retrieval, aggregation over this data.

Many isolated variations on general approach.What is the right abstraction or infrastructure? Neither corpus nor retrieval depth need to correspond to ‘normal’ web search result. => Integration of pre-computed information into retrieval, aggregation over this data.

Q: [Sony]

[Sony]




Similar problem:Many verticals with relevant answers. Example: Query “Harry Potter” may trigger products, images, movies, etc.

Similar problem:Many verticals with relevant answers. Example: Query “Harry Potter” may trigger products, images, movies, etc.

Retrieval Overhead

Retrieval Quality

Vertical Selection

Once instances of query have been observed, CTR can be tracked, … but how do deal with unseen queries ?

Task: Estimate Pr ( Click | Query, News-Results).

News results compete for space with web results/ads. trigger only for queries with likely click.

News CTR is not primarily a function of document relevance

Relevant document(s) necessary, not sufficient for high CTR.

CTR for an ongoing news story remains (often) stable, even as

the underlying documents change.

”Buzz/Attention” around a story makes a difference.

Identifying news queries is not a (binary) query classification task

Many queries are inherently ambiguous, e.g. ‘Georgia’

Human labeling of training data is difficult:

‘Voter Registration’ ‘Oil Prices’ ‘Caylee Anthony’

News CTR is not primarily a function of document relevance

Relevant document(s) necessary, not sufficient for high CTR.

CTR for an ongoing news story remains (often) stable, even as

the underlying documents change.

”Buzz/Attention” around a story makes a difference.

Identifying news queries is not a (binary) query classification task

Many queries are inherently ambiguous, e.g. ‘Georgia’

Human labeling of training data is difficult:

‘Voter Registration’ 1.5 % – 5 % CTR ‘Oil Prices’ 22 % – 29 % CTR ‘Caylee Anthony’ 63 % – 69 % CTR

News click-through rates change (rapidly) over time.

Query text n-grams unlikely to yield good features.

Queries w/o news intent may still receive clicks.

CTR varies significantly among news-queries.

Keywords that are specific to a news event receive higher CTR.

Supervised learning, using collected click data.

Model Pr ( Click | Query, News-Results) as

Pr ( Click |

Relevance (Top News Result(s)),

Attention/Buzz around keywords,

“Cohesion” of retrieved stories,

query

surface properties).

Supervised learning, using collected click data.


Pr ( Click |




query


NewsCrawl

Partition news articles by crawl-date.

Titles

Titles

1st Paragraph

1st Paragraph

Text BodyText Body

Now, track ``attention’’ in news by measuring incidence of query-keywords in each partition. Each query generates array of counters.

NewsCrawl

Titles

Titles

1st Paragraph

1st Paragraph

Text BodyText Body

Issues: Occurrence of query-keywords in news tracks coverage, not attention. Differentiating keywords that are ‘globally’ frequent from new news headlines?

Issues: Occurrence of query-keywords in news tracks coverage, not attention. Differentiating keywords that are ‘globally’ frequent from new news headlines?

BlogCrawl

NewsCrawl

Titles

Titles

1st Paragraph

1st Paragraph

Text BodyText Body

Multiple Corpora: Blogs and news complement each other to capture coverage vs. attention. Use of ‘background’ corpus allows us to identify keywords indicative of news.

Multiple Corpora: Blogs and news complement each other to capture coverage vs. attention. Use of ‘background’ corpus allows us to identify keywords indicative of news.

Supervised Approach, using collected click data.


Pr ( Click |




query


Occurs often, but in several different news events. Pr(click) is less.

Occurs less often, uniquely identifies specific news event. Pr(click) is larger.

Query: “President Obama” vs. Query: “Hurricane Ike”

How similar are the documents the query terms occur in?

Approach: For all (subsets) of query terms:

Retrieve matching documents. Compute a language-model of the contexts the terms

occur in. Compute similarity of these.

Similarity Metric: Jensen-Shannon Divergence

Carmel, Yom-Tov, Darlow and Pelleg, ‘What makes a Query difficult?’, SIGIR 2006

Baseline: CTR > 10%: 70.1% CTR > 15%: 75.9% CTR > 20%: 81.8%For 82.5% of queries prediction within ‘error-band’ of +/- 10%.

Using (relevance-) scores for single verticals limiting.

=> How indicative is a query for a vertical?

Additional sources of evidence:

Query-logs (e.g., [Arguello et al., SIGIR’09], [Diaz, WSDM’09])

Sets of queries issued against / resulting in clicks for a given vertical. Generalization through language models.

Non-web text corpora (e.g., [Arguello et al., SIGIR’09])

Collections representative of verticals or concepts (e.g. via Wikipedia). Measures: clarity/cohesion, expected # results, trends over time.

Document categories (e.g., [Collins-Thompson et al., SIGIR’09])

Concept Graphs (e.g., [Diemert et al., WWW’09])

Based on co-reference between concepts. Extracted automatically, levering search engine in computation to take

advantage of relevance model, spam filtering.

Query text based classification performs well given large training data sets.

Automatic generation of queries/labels (e.g., [Li et al. , SIGIR’08], [Fuxman et al., SIGKDD’09]).

Retrieval processing

Novel retrieval problems

Loose coupling between retrieval processing and ranking.

(Worst-case) latency matters.

Faster retrieval leveraging data distributions, matching semantics.

Integration of web search and ad/vertical retrieval

Search provides context and can be used to enrich both the query as well as the text associated with items in a vertical.

Approach of search, extract/pre-compute & aggregate appears to apply in many scenarios.

Extends to additional (non-web) corpora, query logs, etc. Combining evidence from multiple sources.

Challenges / next steps:

Identifying common abstraction / operators. What is the correct system infrastructure?

Documents

Arnd Christian König Data Management, Exploration and Mining Group Microsoft Research