Upload
shonda-horn
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Arnd Christian KönigArnd Christian KönigData Management, Exploration and Mining
GroupMicrosoft Research
Data Management, Exploration and Mining Data Management, Exploration and Mining Group, MSRGroup, MSRSanjay AgrawalSurajit ChaudhuriKaushik ChakrabartiVenkatesh Ganti Dong XinText Mining, Search and Navigation Group, Text Mining, Search and Navigation Group, MSRMSRKenneth W. ChurchQiang WuNatural Language Processing Group, MSRNatural Language Processing Group, MSRMichael GamonMicrosoft AdCenterMicrosoft AdCenterMartin MarkovMicrosoft SearchMicrosoft SearchLiying Sui
Sponsored Search Ads: Separate data store / index. Ads associated with ‘bid-phrases’. Retrieval by matching (modified) queries with bid-phrases. Ranking using a combination of relevance, bid-amount and expected CTR.
Vertical Sub-collections: Examples: Products, News, Images,... Separate data store / index. Index may be different from web index. Ranking function different for each vertical. Many verticals => vertical selection problem.
Retrieval Overhead
Retrieval Quality
Vertical Selection
Verticals not provisioned to handle 100% of traffic. => fast, initial filter on queries.
Verticals not provisioned to handle 100% of traffic. => fast, initial filter on queries.
Some queries may have relevant results in many verticals.
Some queries may have relevant results in many verticals.
Can the specific ranking function be indexed efficiently?
Can the specific ranking function be indexed efficiently?
Different retrieval engine for each ‘vertical’ and ads.
Retrieval processing involves matching and ranking. Match-processing independent of ranking function
Multiple ranking functions to be tested in parallel. Allows for arbitrarily complex ranking functions. Organizational boundaries.
Ranking of results is not monotone function of single-word
scores: Top-k optimizations do not apply.
Long latency for some queries with large number of matches.
Setup: Index 20 M product descriptions from Product Vertical.
Using commercial full-text engine, measure latency for 2-word queries for words of different frequencies.
User-perceives significant
latency
3 orders of magnitude latency
difference
F = Frequent Keywords (>800K postings)
M = Medium-frequency Keywords (≈ 50K postings)
L = Low-frequency keywords (<1K postings)
Query logs: combinations of frequent keywords common.
“BMW bike”“Book jewelry”
“Book Dumbledore”
> 2.1% of queries in
log
Different retrieval engine for each ‘vertical’ and ads. Retrieval processing involves matching and ranking. Match-processing independent of ranking function
Multiple ranking functions to be tested in parallel. Allows for arbitrarily complex ranking functions. Organizational boundaries.
Ranking of results is not monotone function of single-word
scores: Top-k optimizations do not apply.
Search Latency is crucial (via Dries Buytaert’s blog): Amazon: 100 ms of extra load time caused a 1%
drop in sales. (Source: Greg Linden, Amazon) Google: 500 ms of extra load time caused 20%
fewer searches. (Source: Marrissa Mayer, Google) Yahoo!: 400 ms of extra load time caused a 5 to
9% increase in the number of people that clicked "back" before the page even loaded. (Source: Nicole Sullivan, Yahoo!)
Latency issues addressed through parallelism, caching, but also specialized data structures.
Is this efficient? Is this efficient?
“cheap”“used”“books”
Augmented Inverted Index:Vocabulary:
Bid-ID # words 177 2 2090 3
… … … …
Bid-ID # words 11 2 99 1 … … … …
Bid-ID # words 2004 1 2090 3 … … … …
…
Query: {cheap books}Query: {cheap books}
Problem: Nearly all processed bids do not match the query. Indexing by word not selective. No early termination. Some improvement via non-redundant indexing.
Problem: Nearly all processed bids do not match the query. Indexing by word not selective. No early termination. Some improvement via non-redundant indexing.
Alternative approach: Index bids at fine granularity.
Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})
“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]“Cheap used books” [bid phrase]B3 [Bid ID]“New books” [bid phrase]B2 [Bid ID]
Bid Lists:
“Vocabulary”:
Query: {cheap books}Query: {cheap books}
Alternative approach: Index bids at fine granularity.
Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})
“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]“Cheap used books” [bid phrase]B3 [Bid ID]“New books” [bid phrase]B2 [Bid ID]
Bid Lists :
“Vocabulary”:
Query: {cheap books}Query: {cheap books}
Alternative approach: Index bids at fine granularity.
Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})
“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]“Cheap used books” [bid phrase]B3 [Bid ID]“New books” [bid phrase]B2 [Bid ID]
Bid Lists :
“Vocabulary”:
Query: {cheap books}Query: {cheap books}
Alternative approach: Index bids at fine granularity.
Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})
“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]“Cheap used books” [bid phrase]B3 [Bid ID]“New books” [bid phrase]B2 [Bid ID]
Bid Lists :
“Vocabulary”:
Query: {cheap books}Query: {cheap books}
Problem: # Hash-lookups becomes significant for long queries.
Query containing n words requires 2n-1 hash lookups.
Inacceptable for long queries, due to tight latency-constraints for sponsored search.
Problem: # Hash-lookups becomes significant for long queries.
Query containing n words requires 2n-1 hash lookups.
Inacceptable for long queries, due to tight latency-constraints for sponsored search.
Idea: Queries that access a set of words access all its subsets [ICDE 2009]
Why does this help?Trading off random access against sequential access (main memory) Large reduction in page walks (via TLB misses)Fewer (worst-case) lookups:
Let k be # words in largest vocabulary node – then a n-word query
Hash({“cheap” “books”})Hash({“cheap”,”used”,”books”})Hash({“new”, ”books”})
“Cheap books” [bid phrase]B1 [Bid ID]B4 [Bid ID]
“New books” [bid phrase]B2 [Bid ID]
Bid Lists :“Vocabulary”:
“Cheap used books” [bid phrase]B3 [Bid ID]
k
i i
n
1
12 nrequires lookups, instead of
Can it hurt? Yes, if we do too much merging…
“Cheap used books” [bid phrase]B3 [Bid ID]
Task: Can we compute an ‘optimal’ assignment of bids to bid lists?
Optimization problem:
Given a cost model and a query workload, compute bids-lists that minimize query cost.
Workload:
The relative frequencies of the most frequent queries in search logs are stable
Assignment is computed off-line and refreshed periodically
Cost Model:
Simple Model, decomposing access cost into
Cost of random memory access Cost of sequential memory scans (monotonic in #bytes read)
Hash({w1, w2})Hash({w1, w2, w3})Hash({w4, w1”})
“w1 w2”B1 [Bid ID]B4 [Bid ID]
“w1 w4”B2 [Bid ID]
“w1 w3 w2”B3 [Bid ID]
Solution sketch:We model the assignment as grouping of bids. For each query q, we can now assign costs to a query.For each value of k and workload , we assign cost to each set of bids. Mapping selection ≈ set covering problem.l = MAX(#distinct bids in single node), l small. Greedy selection log l – approximation
Possible mappings:
(a){B1, B4}, {B3}, {B2}(b){B1,B4,B3}, {B2}(c)…
“w1 w3 w2”B3 [Bid ID]
k = MAX(#words in entry)
k
i i
q
1
||lookups
Costs of scanning bid-list
Other Examples:
WAND processing [Broder et al, CIKM’03] Keyword Search in Spatial Data, e.g., [De Felipe et al.,
ICDE’08],[Zhang et al., ICDE’09] Entity Search, e.g., [Balmin, VLDB’04], [S. Chakrabarti et al.
PKDD’04, WWW’06, WWW’07], [Agrawal WWW’09] (Approximate) Auto-completion [Bast et al., SIGIR’06], [Nandi
et al., SIGMOD’07, VLDB’07], [Chaudhuri et al., SIGMOD’09] …
techniques modifications of IR-style processing or string
matching.
Search over relational objects in RDBMS
well-studied problem (e.g., DBExplorer, Discover, BANKS, etc.) … but join-paths (= business objects) are known in verticals
search and can be pre-materialized and indexed.
Retrieval Overhead
Retrieval Quality
Vertical Selection
Q: [canon rebel xti]
[canon rebel xti]
The Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 MP Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the exclusive... More...
Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 Mega Pixel Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the... More...
The ultra-powerful 12x optical zoom on the PowerShot S5 IS means you'll get the shot you want with no compromise, yet that's only the beginning of what makes this camera so exciting. The S5 IS is…
Retrieval Semantics: Keyword-search over product descriptions
Retrieval Semantics: Keyword-search over product descriptions
Q: [low light camera]
[low light camera ]
The Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 MP Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the exclusive... More...
Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 Mega Pixel Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the... More...
The ultra-powerful 12x optical zoom on the PowerShot S5 IS means you'll get the shot you want with no compromise, yet that's only the beginning of what makes this camera so exciting. The S5 IS is…
Observation I:Many web documents mention instances of low light digital cameras in close proximity to query keywords {low light, digital camera}
Q: [low light camera]
The Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 MP Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the exclusive... More...
Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 Mega Pixel Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the... More...
The ultra-powerful 12x optical zoom on the PowerShot S5 IS means you'll get the shot you want with no compromise, yet that's only the beginning of what makes this camera so exciting. The S5 IS is…
Observation II: (Pseudo-Relevance)The top web search results will contain mostly relevant pages.Hence, we can identify most relevant entities by:Submitting query to a search engineIdentifying mentions of entities in top returned documentsAggregating scores for these entities
Observation II: (Pseudo-Relevance)The top web search results will contain mostly relevant pages.Hence, we can identify most relevant entities by:Submitting query to a search engineIdentifying mentions of entities in top returned documentsAggregating scores for these entities
Results [WWW 2009]:Significant improvement in retrieval precision and recall.Low overhead by piggy-backing on search engine components
Entities extracted as part of page crawl pipeline.
Entity indexing and retrieval in snippet generation.
Results [WWW 2009]:Significant improvement in retrieval precision and recall.Low overhead by piggy-backing on search engine components
Entities extracted as part of page crawl pipeline.
Entity indexing and retrieval in snippet generation.
General approach: Issue the search query against a document
corpus . Identify relevant sub-components of top
results (e.g., titles, captions, tags, categories, entites, etc.)
Aggregate over the components.Query
)P(w,
)P(w,log)P(w,
C
q2
Resultdq
Approach: Retrieve top ~50 documents from web search
engine. Categorize each document into commercial
taxonomy Use combination of categories to characterize
query foradvertising.
Cat(D1)
Query
Cat(D2)
Cat(D3)
Approach: Retrieve top news documents from web
search engine. Extract the publish date / order Count how many of the retrieved documents
were among the k most recently published ones.
Date(D1)
Query
Date(D2)
Date(D3)
Approach: Retrieve top news documents from web search
engine. Extract the publish date / order Count how many of the retrieved documents were
among the k most recently published ones.
Additional Examples:
[Shen et al. , SIGKDD Exploration 2005] – Query classification: Using title, snippet and category information from each document.
[Collins-Thompson et al., SIGIR’09] – Query difficulty prediction: Document are represented as a low-dim. feature vector.
Many isolated variations on general approach.What is the right abstraction or infrastructure? Neither corpus nor retrieval depth need to correspond to ‘normal’ web search result. => Integration of pre-computed information into retrieval, aggregation over this data.
Many isolated variations on general approach.What is the right abstraction or infrastructure? Neither corpus nor retrieval depth need to correspond to ‘normal’ web search result. => Integration of pre-computed information into retrieval, aggregation over this data.
Q: [Sony]
[Sony]
The Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 MP Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the exclusive... More...
Canon EOS Digital Rebel XTi offers an unbeatable combination of performance, ease-of-use and value. It has a newly designed 10.1 Mega Pixel Canon CMOS sensor plus a host of new features including a 2.5-inch LCD monitor, the... More...
The ultra-powerful 12x optical zoom on the PowerShot S5 IS means you'll get the shot you want with no compromise, yet that's only the beginning of what makes this camera so exciting. The S5 IS is…
Similar problem:Many verticals with relevant answers. Example: Query “Harry Potter” may trigger products, images, movies, etc.
Similar problem:Many verticals with relevant answers. Example: Query “Harry Potter” may trigger products, images, movies, etc.
Retrieval Overhead
Retrieval Quality
Vertical Selection
Once instances of query have been observed, CTR can be tracked, … but how do deal with unseen queries ?
Task: Estimate Pr ( Click | Query, News-Results).
News results compete for space with web results/ads. trigger only for queries with likely click.
News CTR is not primarily a function of document relevance
Relevant document(s) necessary, not sufficient for high CTR.
CTR for an ongoing news story remains (often) stable, even as
the underlying documents change.
”Buzz/Attention” around a story makes a difference.
Identifying news queries is not a (binary) query classification task
Many queries are inherently ambiguous, e.g. ‘Georgia’
Human labeling of training data is difficult:
‘Voter Registration’ ‘Oil Prices’ ‘Caylee Anthony’
News CTR is not primarily a function of document relevance
Relevant document(s) necessary, not sufficient for high CTR.
CTR for an ongoing news story remains (often) stable, even as
the underlying documents change.
”Buzz/Attention” around a story makes a difference.
Identifying news queries is not a (binary) query classification task
Many queries are inherently ambiguous, e.g. ‘Georgia’
Human labeling of training data is difficult:
‘Voter Registration’ 1.5 % – 5 % CTR ‘Oil Prices’ 22 % – 29 % CTR ‘Caylee Anthony’ 63 % – 69 % CTR
News click-through rates change (rapidly) over time.
Query text n-grams unlikely to yield good features.
Queries w/o news intent may still receive clicks.
CTR varies significantly among news-queries.
Keywords that are specific to a news event receive higher CTR.
Supervised learning, using collected click data.
Model Pr ( Click | Query, News-Results) as
Pr ( Click |
Relevance (Top News Result(s)),
Attention/Buzz around keywords,
“Cohesion” of retrieved stories,
query
surface properties).
Supervised learning, using collected click data.
Model Pr ( Click | Query, News-Results) as
Pr ( Click |
Relevance (Top News Result(s)),
Attention/Buzz around keywords,
“Cohesion” of retrieved stories,
query
surface properties).
NewsCrawl
Partition news articles by crawl-date.
Titles
Titles
1st Paragraph
1st Paragraph
Text BodyText Body
Now, track ``attention’’ in news by measuring incidence of query-keywords in each partition. Each query generates array of counters.
NewsCrawl
Titles
Titles
1st Paragraph
1st Paragraph
Text BodyText Body
Issues: Occurrence of query-keywords in news tracks coverage, not attention. Differentiating keywords that are ‘globally’ frequent from new news headlines?
Issues: Occurrence of query-keywords in news tracks coverage, not attention. Differentiating keywords that are ‘globally’ frequent from new news headlines?
BlogCrawl
NewsCrawl
Titles
Titles
1st Paragraph
1st Paragraph
Text BodyText Body
Multiple Corpora: Blogs and news complement each other to capture coverage vs. attention. Use of ‘background’ corpus allows us to identify keywords indicative of news.
Multiple Corpora: Blogs and news complement each other to capture coverage vs. attention. Use of ‘background’ corpus allows us to identify keywords indicative of news.
Supervised Approach, using collected click data.
Model Pr ( Click | Query, News-Results) as
Pr ( Click |
Relevance (Top News Result(s)),
Attention/Buzz around keywords,
“Cohesion” of retrieved stories,
query
surface properties).
Occurs often, but in several different news events. Pr(click) is less.
Occurs less often, uniquely identifies specific news event. Pr(click) is larger.
Query: “President Obama” vs. Query: “Hurricane Ike”
How similar are the documents the query terms occur in?
Approach: For all (subsets) of query terms:
Retrieve matching documents. Compute a language-model of the contexts the terms
occur in. Compute similarity of these.
Similarity Metric: Jensen-Shannon Divergence
Carmel, Yom-Tov, Darlow and Pelleg, ‘What makes a Query difficult?’, SIGIR 2006
Baseline: CTR > 10%: 70.1% CTR > 15%: 75.9% CTR > 20%: 81.8%For 82.5% of queries prediction within ‘error-band’ of +/- 10%.
Using (relevance-) scores for single verticals limiting.
=> How indicative is a query for a vertical?
Additional sources of evidence:
Query-logs (e.g., [Arguello et al., SIGIR’09], [Diaz, WSDM’09])
Sets of queries issued against / resulting in clicks for a given vertical. Generalization through language models.
Non-web text corpora (e.g., [Arguello et al., SIGIR’09])
Collections representative of verticals or concepts (e.g. via Wikipedia). Measures: clarity/cohesion, expected # results, trends over time.
Document categories (e.g., [Collins-Thompson et al., SIGIR’09])
Concept Graphs (e.g., [Diemert et al., WWW’09])
Based on co-reference between concepts. Extracted automatically, levering search engine in computation to take
advantage of relevance model, spam filtering.
Query text based classification performs well given large training data sets.
Automatic generation of queries/labels (e.g., [Li et al. , SIGIR’08], [Fuxman et al., SIGKDD’09]).
Retrieval processing
Novel retrieval problems
Loose coupling between retrieval processing and ranking.
(Worst-case) latency matters.
Faster retrieval leveraging data distributions, matching semantics.
Integration of web search and ad/vertical retrieval
Search provides context and can be used to enrich both the query as well as the text associated with items in a vertical.
Approach of search, extract/pre-compute & aggregate appears to apply in many scenarios.
Extends to additional (non-web) corpora, query logs, etc. Combining evidence from multiple sources.
Challenges / next steps:
Identifying common abstraction / operators. What is the correct system infrastructure?