Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab

Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab

http://living-labs.net@livinglabsnet

“Give us your ranking, we’ll have it clicked!”

Krisztian BalogUniversity of Stavanger

Liadh KellyTrinity College Dublin

Anne SchuthBlendle

7th International Conference of the CLEF Association (CLEF 2016) | Évora, Portugal, 2016

Living Labs for IR Evaluation

Motivation- Overall goal: make information retrieval

evaluation more realistic

new retrieval methodusers live site

interaction data

How to test a new method with real users in their natural task environment (i.e., on the live site)?

#1

How to make interaction data available for method development?

#2

Key idea

new retrieval methods

users live site

data (docs/products,

logs, etc.)

K. Balog, L. Kelly, and A. Schuth. Head First: Living Labs for Ad-hoc Search Evaluation. CIKM'14

API

Key idea


users live site


An API orchestrates all data exchange between the live site and experimental systems#1

API


logs, etc.)

Key idea


users live site


Focus on frequent (head) queries.- Ranked result lists can be generated offline - Enough traffic on them (historical & live)#2

API


logs, etc.)

Key idea


users live site


Medium to large organizations with fair amount of search volumeTypically lack their own R&D department#3

API


logs, etc.)

Methodology1. Queries, candidate documents, historical search and

click data made available

API

{ "queries": [ { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q1", "qstr": "monster high", "type": "train" }, { "creation_time": "Wed, 22 Apr 2015 09:15:41 -0000", "qid": "R-q51", "qstr": "puzzle",



API

{ "doclist": [ { "docid": "R-d1291", "site_id": "R", "title": "LEGO DUPLO Hamupip\u0151ke hint\u00f3ja 6153" }, { "docid": "R-d1306", "site_id": "R", "title": "LEGO Rend\u0151rkapit\u00e1nys\u00e1g 5681" },



API

{ "content": { "age_max": 3, "age_min": 1, "arrived": "2014-08-28", "available": 0, "brand": "Lego", "category": "LEGO", "category_id": "38", "characters": [], "description": "Lego Duplo - \u00c9p\u00edt\u0151-\u00e9s j\u00e1t\u00e9kkock\u00e1k kicsiknek 10553<br />[…]",

Methodology2. Rankings are generated for each query and uploaded

through an API

API

{ "qid": "U-q22", "runid": "82" "creation_time": "Wed, 04 Jun 2014 15:03:56 -0000", "doclist": [ { "docid": "U-d4" }, { "docid": "U-d2" }, ... ],

Methodology3. When any of the test queries is fired, the live site

request rankings from the API and interleaves them with that of the production system

API

Interleaving- Site provides the set of candidate items that can be

re-ranked (safety mechanism)- Experimental ranking is interleaved with the

production ranking- Meeds 1-2 order of magnitudes data than A/B testing (also,

it is within subject as opposed to between subject design)

doc 1

doc 2

doc 3

doc 4

doc 5

doc 2

doc 4

doc 7

doc 1

doc 3

system A system Bdoc 1

doc 2

doc 4

doc 3

doc 7

interleaved list

A>BInference:

Methodology4. Participants get detailed feedback on user

interactions (clicks)

API

{ "feedback": [ { "qid": "S-q1", "runid": "baseline", "type": "tdi", "doclist": [ { "docid": "S-d1", "clicked": true, "team": "site", },

Methodology5. Ultimate measure is the number of “wins” against the

production system (aggregated over a period of time)

Outcome =#Wins

#Wins + #Losses

What is in it for participants?

- Access to privileged commercial data - (Search and click-through data)

- Opportunity to test IR systems with real, unsuspecting users in a live setting- (Not the same as crowdsourcing!)

- (Continuous evaluation is possible, not limited to yearly evaluation cycle)

The Living Labs Platform

Source codehttps://bitbucket.org/living-labs/ll-api

https://bitbucket.org/living-labs/ll-api

Documentationhttp://doc.living-labs.net/

http://doc.living-labs.net/

Dashboardhttp://dashboard.living-labs.net/

http://dashboard.living-labs.net/

CLEF LL4IR

Use-cases

• Product search (REGIO Játék)

• Web search(Seznam)

• Product search (REGIO Játék)

Benchmark organizationtraining period test period

query type

train- feedback available- individual feedback

- update possible

test - feedback available

- no individual feedback - update possible

- no feedback available - no individual feedback

- update not possible

Product search- Ad-hoc retrieval over a product catalog- Several thousand products- Limited amount of text, lots of structure

- Categories, characters, brands, etc.

Product data

Product data Product name

Price / bonus price

Short description

Recommended age from/to

Gender recommendation

Categories

Brands

Long description

(Links to) photos

{ "content": { "age_max": 10, "age_min": 6, "arrived": "2014-08-28", "available": 1, "brand": "Mattel", "category": "Bab\u00e1k, kell\u00e9kek", "category_id": "25", "characters": [], "description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i […]", "gender": 2, "main_category": "Baba, babakocsi", "main_category_id": "3", "photos": [ "http://regiojatek.hu/data/regio_images/normal/20777_0.jpg", "http://regiojatek.hu/data/regio_images/normal/20777_1.jpg", […] ], "price": 8675.0, "product_name": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le", "queries": { "clawdeen": "0.037", "monster": "0.222", "monster high": "0.741" }, "short_description": "A Monster High\u00ae iskola sz\u00f6rnycsemet\u00e9i els\u0151 k\u00fclf\u00f6ldi \u00fatjukra indulnak..." }, "creation_time": "Mon, 11 May 2015 04:52:59 -0000", "docid": "R-d43", "site_id": "R", "title": "Monster High Scaris Parav\u00e1rosi baba t\u00f6bbf\u00e9le" }

Frequent queries that led to the product

Queries- Typically very short

monster high magnetiz duplo lego friends geomag trash+pack barbie

monopoly lego duplo transformers star wars nerf carrera baba

Results (2015)O

utco

me

0

0,1

0,2

0,3

0,4

0,5

0,6

Evaluation round0 1 2 3 4 5

Baseline UiS GESIS IRIT

Inventory changesNew arrivalBecame availableBecame unavailable

Days

#Pro

duct

s

−40

−20

020

4060

80−4

0−2

00

2040

6080

05−01 05−03 05−05 05−07 05−09 05−11 05−13 05−15

Summary and Outlook

Summary- Successes

- Experimental methodology - Many interesting opportunities to address current limitations

(come to NewsREEL & LL4IR session tomorrow) - The living labs platform

- Open source, can be used for a variety of tasks - Some interesting work for product search

- See best of the labs session - Lack of success

- Raise sufficient interest in the use-cases at CLEF

Limitations / Open issues- Head queries only: Considerable portion of traffic,

but only popular info needs- Lack of context: No knowledge of the searcher’s

location, previous searches, etc.- No real-time feedback: API provides detailed

feedback, but it’s not immediate- Limited control: Experimentation is limited to single

searches, where results are interleaved with those of the production system; no control over the entire result list

- Ultimate measure of success: Search is only a means to an end, it is not the ultimate goal

TREC Open Search http://trec-open-search.org/

- Use-case: academic search- Ad-hoc document search

- Sites- CiteSeerX - SSOAR — German Social Sciences - Microsoft Academic Search

- Round #3 runs from Oct 1 to Nov 15

http://trec-open-search.org/

We you!living-labs.net

Thanks to

Technology

Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab