Upload
jordan-sutton
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
Evaluation of (Search) Results
How do we know if our results are any good? Evaluating a search engine
Benchmarks Precision and recall
Results summaries: Making our good results usable to a user
Architecture of a Search Engine
A
EC
D
B
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web spider
Indexer
Indexes
Search
User
Benchmarks for a search engine
Index building: How fast does it index Number of documents/hour (Average document size)
Search speed: How fast does it search Latency as a function of index size
Expressiveness: How Expressive is its query languageAbility to express complex information needsSpeed on complex queries
Measures for a search engine
All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise
But, what is the most important measure?User happiness
What is it?Speed of response/size of index are factorsBut blindingly fast, useless answers
won’t make a user happy
Need a way of quantifying user happiness
Who is the user we are trying to make happy?
Web search engine: user finds what they want and return to the engine
eCommerce site: users find what they want and make a purchase
Is it the end-user, or the eCommerce site, whose happiness we measure?
Enterprise (company/government/academic): increased “user productivity”
How much time do users save when looking for information?
Can measure rate of return usersCan measure rate of return users
Measure time to purchase, or fraction of searchers who become buyers?Measure time to purchase, or fraction of searchers who become buyers?
Many other criteria e.g., breadth of access, secure access, etcMany other criteria e.g., breadth of access, secure access, etc
Happiness: elusive to measure
Commonest proxy: relevance of search resultsBut how do you measure relevance?We will detail a methodology, then examine its issuesRelevant measurement requires 3 elements:
A benchmark document collectionA benchmark suite of queriesA binary assessment of either Relevant or Irrelevant for each query-doc pair
Some assessments work on more-than-binary, but not the standard
Evaluating an IR system
Note: the information need is translated into a query
Relevance is assessed relative to the information need not the query
E.g., Information need: I'm looking for information on what types of computer applications are available at wellesley college.
Query: computer applications available wellesley college
You evaluate whether the documents found address the information need,
not whether they have keywords
Standard relevance benchmarks
TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years
Reuters and other benchmark doc collections used“Retrieval tasks” specified
sometimes as queries
Human experts mark, for each query and for each doc, Relevant or Irrelevant
or at least for subset of docs that some system returned for that query
Unranked retrieval evaluation:Precision and Recall
Precision: portion of retrieved docs that are relevant = P(relevant|retrieved)Recall: portion of relevant docs that are retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp)Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp true positive fp false positive
Not Retrieved
fn false positive tn true negative
Accuracy
Given a query an engine classifies each doc as “Relevant” or “Irrelevant”.
Accuracy of an engine: the fraction of these classifications that is correct.
= (tp+tn) / (tp+fp+fn+tn)
Can’t we just use Accuracy?
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget….
People using an IR system want to find something and have a certain tolerance for junk.
Search for:
0 matching results found.
Precision vs Recall
You can get high recall (but low precision) by retrieving all docs for all queries!
Recall can only go up! As the number of docs retrieved increases
But as either number of docs retrieved or recall increasesprecision goes down (in a good system)
A fact with strong empirical confirmation
Difficulties in using precision/recall
We should average over large corpus/query ensembles
We need human relevance assessments People aren’t reliable assessors
Assessments have to be binary Nuanced assessments?
Heavily skewed by corpus/authorship Results may not translate from one domain to another
A combined measure: F
Combined measure that assesses this tradeoff is F measure (weighted harmonic mean):
People usually use balanced F1 measure i.e., with = 1 or = ½
Harmonic mean is a conservative averageSee CJ van Rijsbergen, Information Retrieval
RP
PR
RP
F
2
2 )1(1)1(
11
F1 and other averages
Recall fixed at 70%
Result Summaries
Having ranked the documents matching a query, we wish to present a results listMost commonly, the document title plus a short summaryThe title is typically automatically extracted from document metadataWhat about the summaries?
Summaries
Two basic kinds: Static Dynamic
A static summary of a document is always the same, regardless of the query that hit the docDynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
Static summaries
In typical systems, the static summary is a subset of the document
Simplest heuristic: the first 50 (or so) words of the document
Summary cached at indexing time
More sophisticated: extract from each document a set of “key” sentences
Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR; cf. text summarization work
Dynamic summaries
Present one or more “windows” within the document that contain several of the query terms
“KWIC” snippets: Keyword in Context presentation
Generated in conjunction with scoring If query found as a phrase,
the/some occurrences of the phrase in the doc If not,
windows within the doc that contain multiple query terms
The summary itself gives the entire content of the window – all terms, not only the query terms – how?
Dynamic summaries
Producing good dynamic summaries is tricky The real estate for the summary is normally small and fixed Want short item, so show as many KWIC matches as possible, and
perhaps other things like title Want snippets to be long enough to be useful Want linguistically well-formed snippets:
users prefer snippets that contain complete phrases Want snippets maximally informative about document
But users really like snippets, even if they complicate IR system design