Upload
krisztianbalog
View
123
Download
0
Embed Size (px)
Citation preview
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
http://living-labs.net@livinglabsnet
“Give us your ranking, we’ll have it clicked!”
Krisztian BalogUniversity of Stavanger
Liadh KellyTrinity College Dublin
Anne SchuthBlendle
7th International Conference of the CLEF Association (CLEF 2016) | Évora, Portugal, 2016
Motivation- Overall goal: make information retrieval
evaluation more realistic
new retrieval methodusers live site
interaction data
How to test a new method with real users in their natural task environment (i.e., on the live site)?
#1
How to make interaction data available for method development?
#2
Key idea
new retrieval methods
users live site
data (docs/products,
logs, etc.)
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
API
Key idea
new retrieval methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
An API orchestrates all data exchange between the live site and experimental systems#1
API
data (docs/products,
logs, etc.)
Key idea
new retrieval methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
Focus on frequent (head) queries.- Ranked result lists can be generated offline - Enough traffic on them (historical & live)#2
API
data (docs/products,
logs, etc.)
Key idea
new retrieval methods
users live site
K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14
Medium to large organizations with fair amount of search volumeTypically lack their own R&D department#3
API
data (docs/products,
logs, etc.)
Methodology1. Queries, candidate documents, historical search and
click data made available
API
{ "queries": [ { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q1", "qstr": "monster high", "type": "train" }, { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q51", "qstr": "puzzle",
Methodology1. Queries, candidate documents, historical search and
click data made available
API
{ "doclist": [ { "docid": "R-d1291", "site_id": "R", "title": "LEGO DUPLO Hamupip\u0151ke hint\u00f3ja 6153" }, { "docid": "R-d1306", "site_id": "R", "title": "LEGO Rend\u0151rkapit\u00e1nys\u00e1g 5681" },
Methodology1. Queries, candidate documents, historical search and
click data made available
API
{ "content": { "age_max": 3, "age_min": 1, "arrived": "2014-08-28", "available": 0, "brand": "Lego", "category": "LEGO", "category_id": "38", "characters": [], "description": "Lego Duplo - \u00c9p\u00edt\u0151-\u00e9s j\u00e1t\u00e9kkock\u00e1k kicsiknek 10553<br />[…]",
Methodology2. Rankings are generated for each query and uploaded
through an API
API
{ "qid": "U-q22", "runid": "82" "creation_time": "Wed, 04 Jun 2014 15:03:56 -0000", "doclist": [ { "docid": "U-d4" }, { "docid": "U-d2" }, ... ],
Methodology3. When any of the test queries is fired, the live site
request rankings from the API and interleaves them with that of the production system
API
Interleaving- Site provides the set of candidate items that can be
re-ranked (safety mechanism)- Experimental ranking is interleaved with the
production ranking- Meeds 1-2 order of magnitudes data than A/B testing (also,
it is within subject as opposed to between subject design)
doc 1
doc 2
doc 3
doc 4
doc 5
doc 2
doc 4
doc 7
doc 1
doc 3
system A system Bdoc 1
doc 2
doc 4
doc 3
doc 7
interleaved list
A>BInference:
Methodology4. Participants get detailed feedback on user
interactions (clicks)
API
{ "feedback": [ { "qid": "S-q1", "runid": "baseline", "type": "tdi", "doclist": [ { "docid": "S-d1", "clicked": true, "team": "site", },
Methodology5. Ultimate measure is the number of “wins” against the
production system (aggregated over a period of time)
Outcome =#Wins
#Wins + #Losses
What is in it for participants?
- Access to privileged commercial data - (Search and click-through data)
- Opportunity to test IR systems with real, unsuspecting users in a live setting- (Not the same as crowdsourcing!)
- (Continuous evaluation is possible, not limited to yearly evaluation cycle)
Benchmark organizationtraining period test period
query type
train- feedback available- individual feedback
- update possible
test - feedback available
- no individual feedback - update possible
- no feedback available - no individual feedback
- update not possible
Product search- Ad-hoc retrieval over a product catalog- Several thousand products- Limited amount of text, lots of structure
- Categories, characters, brands, etc.
Product data Product name
Price / bonus price
Short description
Recommended age from/to
Gender recommendation
Categories
Brands
Long description
(Links to) photos
{ "content": { "age_max": 10, "age_min": 6, "arrived": "2014-08-28", "available": 1, "brand": "Mattel", "category": "Bab\u00e1k, kell\u00e9kek", "category_id": "25", "characters": [], "description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i […]", "gender": 2, "main_category": "Baba, babakocsi", "main_category_id": "3", "photos": [ "http://regiojatek.hu/data/regio_images/normal/20777_0.jpg", "http://regiojatek.hu/data/regio_images/normal/20777_1.jpg", […] ], "price": 8675.0, "product_name": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le", "queries": { "clawdeen": "0.037", "monster": "0.222", "monster high": "0.741" }, "short_description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i els\u0151 k\u00fclf\u00f6ldi \u00fatjukra indulnak..." }, "creation_time": "Mon, 11 May 2015 04:52:59 -0000", "docid": "R-d43", "site_id": "R", "title": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le" }
Frequent queries that led to the product
Queries- Typically very short
monster high magnetiz duplo lego friends geomag trash+pack barbie
monopoly lego duplo transformers star wars nerf carrera baba
Results (2015)O
utco
me
0
0,1
0,2
0,3
0,4
0,5
0,6
Evaluation round0 1 2 3 4 5
Baseline UiS GESIS IRIT
Inventory changesNew arrivalBecame availableBecame unavailable
Days
#Pro
duct
s
−40
−20
020
4060
80−4
0−2
00
2040
6080
05−01 05−03 05−05 05−07 05−09 05−11 05−13 05−15
Summary- Successes
- Experimental methodology - Many interesting opportunities to address current limitations
(come to NewsREEL & LL4IR session tomorrow) - The living labs platform
- Open source, can be used for a variety of tasks - Some interesting work for product search
- See best of the labs session - Lack of success
- Raise sufficient interest in the use-cases at CLEF
Limitations / Open issues- Head queries only: Considerable portion of traffic,
but only popular info needs- Lack of context: No knowledge of the searcher’s
location, previous searches, etc.- No real-time feedback: API provides detailed
feedback, but it’s not immediate- Limited control: Experimentation is limited to single
searches, where results are interleaved with those of the production system; no control over the entire result list
- Ultimate measure of success: Search is only a means to an end, it is not the ultimate goal
TREC Open Search http://trec-open-search.org/
- Use-case: academic search- Ad-hoc document search
- Sites- CiteSeerX - SSOAR — German Social Sciences - Microsoft Academic Search
- Round #3 runs from Oct 1 to Nov 15