35
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab http://living-labs.net @livinglabsnet “Give us your ranking, we’ll have it clicked!” Krisztian Balog University of Stavanger Liadh Kelly Trinity College Dublin Anne Schuth Blendle 7th International Conference of the CLEF Association (CLEF 2016) | Évora, Portugal, 2016

Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab

Embed Size (px)

Citation preview

Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab

http://living-labs.net@livinglabsnet

“Give us your ranking, we’ll have it clicked!”

Krisztian BalogUniversity of Stavanger

Liadh KellyTrinity College Dublin

Anne SchuthBlendle

7th International Conference of the CLEF Association (CLEF 2016) | Évora, Portugal, 2016

Living Labs for IR Evaluation

Motivation- Overall goal: make information retrieval

evaluation more realistic

new retrieval methodusers live site

interaction data

How to test a new method with real users in their natural task environment (i.e., on the live site)?

#1

How to make interaction data available for method development?

#2

Key idea

new retrieval methods

users live site

data (docs/products,

logs, etc.)

K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14

API

Key idea

new retrieval methods

users live site

K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14

An API orchestrates all data exchange between the live site and experimental systems#1

API

data (docs/products,

logs, etc.)

Key idea

new retrieval methods

users live site

K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14

Focus on frequent (head) queries.- Ranked result lists can be generated offline - Enough traffic on them (historical & live)#2

API

data (docs/products,

logs, etc.)

Key idea

new retrieval methods

users live site

K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14

Medium to large organizations with fair amount of search volumeTypically lack their own R&D department#3

API

data (docs/products,

logs, etc.)

Methodology1. Queries, candidate documents, historical search and

click data made available

API

{ "queries": [ { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q1", "qstr": "monster high", "type": "train" }, { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q51", "qstr": "puzzle",

Methodology1. Queries, candidate documents, historical search and

click data made available

API

{ "doclist": [ { "docid": "R-d1291", "site_id": "R", "title": "LEGO DUPLO Hamupip\u0151ke hint\u00f3ja 6153" }, { "docid": "R-d1306", "site_id": "R", "title": "LEGO Rend\u0151rkapit\u00e1nys\u00e1g 5681" },

Methodology1. Queries, candidate documents, historical search and

click data made available

API

{ "content": { "age_max": 3, "age_min": 1, "arrived": "2014-08-28", "available": 0, "brand": "Lego", "category": "LEGO", "category_id": "38", "characters": [], "description": "Lego Duplo - \u00c9p\u00edt\u0151-\u00e9s j\u00e1t\u00e9kkock\u00e1k kicsiknek 10553<br />[…]",

Methodology2. Rankings are generated for each query and uploaded

through an API

API

{ "qid": "U-q22", "runid": "82" "creation_time": "Wed, 04 Jun 2014 15:03:56 -0000", "doclist": [ { "docid": "U-d4" }, { "docid": "U-d2" }, ... ],

Methodology3. When any of the test queries is fired, the live site

request rankings from the API and interleaves them with that of the production system

API

Interleaving- Site provides the set of candidate items that can be

re-ranked (safety mechanism)- Experimental ranking is interleaved with the

production ranking- Meeds 1-2 order of magnitudes data than A/B testing (also,

it is within subject as opposed to between subject design)

doc 1

doc 2

doc 3

doc 4

doc 5

doc 2

doc 4

doc 7

doc 1

doc 3

system A system Bdoc 1

doc 2

doc 4

doc 3

doc 7

interleaved list

A>BInference:

Methodology4. Participants get detailed feedback on user

interactions (clicks)

API

{ "feedback": [ { "qid": "S-q1", "runid": "baseline", "type": "tdi", "doclist": [ { "docid": "S-d1", "clicked": true, "team": "site", },

Methodology5. Ultimate measure is the number of “wins” against the

production system (aggregated over a period of time)

Outcome =#Wins

#Wins + #Losses

What is in it for participants?

- Access to privileged commercial data - (Search and click-through data)

- Opportunity to test IR systems with real, unsuspecting users in a live setting- (Not the same as crowdsourcing!)

- (Continuous evaluation is possible, not limited to yearly evaluation cycle)

The Living Labs Platform

Source codehttps://bitbucket.org/living-labs/ll-api

Documentationhttp://doc.living-labs.net/

Dashboardhttp://dashboard.living-labs.net/

CLEF LL4IR

Use-cases

• Product search (REGIO Játék)

• Web search(Seznam)

• Product search (REGIO Játék)

Benchmark organizationtraining period test period

query type

train- feedback available- individual feedback

- update possible

test - feedback available

- no individual feedback - update possible

- no feedback available - no individual feedback

- update not possible

Product search- Ad-hoc retrieval over a product catalog- Several thousand products- Limited amount of text, lots of structure

- Categories, characters, brands, etc.

Product data

Product data Product name

Price / bonus price

Short description

Recommended age from/to

Gender recommendation

Categories

Brands

Long description

(Links to) photos

{ "content": { "age_max": 10, "age_min": 6, "arrived": "2014-08-28", "available": 1, "brand": "Mattel", "category": "Bab\u00e1k, kell\u00e9kek", "category_id": "25", "characters": [], "description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i […]", "gender": 2, "main_category": "Baba, babakocsi", "main_category_id": "3", "photos": [ "http://regiojatek.hu/data/regio_images/normal/20777_0.jpg", "http://regiojatek.hu/data/regio_images/normal/20777_1.jpg", […] ], "price": 8675.0, "product_name": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le", "queries": { "clawdeen": "0.037", "monster": "0.222", "monster high": "0.741" }, "short_description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i els\u0151 k\u00fclf\u00f6ldi \u00fatjukra indulnak..." }, "creation_time": "Mon, 11 May 2015 04:52:59 -0000", "docid": "R-d43", "site_id": "R", "title": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le" }

Frequent queries that led to the product

Queries- Typically very short

monster high magnetiz duplo lego friends geomag trash+pack barbie

monopoly lego duplo transformers star wars nerf carrera baba

Results (2015)O

utco

me

0

0,1

0,2

0,3

0,4

0,5

0,6

Evaluation round0 1 2 3 4 5

Baseline UiS GESIS IRIT

Inventory changesNew arrivalBecame availableBecame unavailable

Days

#Pro

duct

s

−40

−20

020

4060

80−4

0−2

00

2040

6080

05−01 05−03 05−05 05−07 05−09 05−11 05−13 05−15

Summary and Outlook

Summary- Successes

- Experimental methodology - Many interesting opportunities to address current limitations

(come to NewsREEL & LL4IR session tomorrow) - The living labs platform

- Open source, can be used for a variety of tasks - Some interesting work for product search

- See best of the labs session - Lack of success

- Raise sufficient interest in the use-cases at CLEF

Limitations / Open issues- Head queries only: Considerable portion of traffic,

but only popular info needs- Lack of context: No knowledge of the searcher’s

location, previous searches, etc.- No real-time feedback: API provides detailed

feedback, but it’s not immediate- Limited control: Experimentation is limited to single

searches, where results are interleaved with those of the production system; no control over the entire result list

- Ultimate measure of success: Search is only a means to an end, it is not the ultimate goal

TREC Open Search http://trec-open-search.org/

- Use-case: academic search- Ad-hoc document search

- Sites- CiteSeerX - SSOAR — German Social Sciences - Microsoft Academic Search

- Round #3 runs from Oct 1 to Nov 15

We you!living-labs.net

Thanks to