16
TREC Session Track Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Embed Size (px)

Citation preview

Page 1: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

TREC Session TrackBen Carterette

Paul CloughEvangelos Kanoulas

Mark Sanderson

Page 2: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Vision

Page 3: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Vision

Put user in the evaluation loop

Cranfield Paradigm• Simple user model• Controlled experiments• Reusable but static test collections

Interactive Evaluation/Live Labs• Full user participation• Many degrees of freedom• Unrepeatable experiments

SessionTrack

Page 4: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Simulate an interactive experiment◦ Compare IIR systems

by controlling user interactions◦ Build a collection that is portable and reusable◦ Devise measures to evaluate the utility a user

obtains throughout a session

Vision

Test systems in a large variety of cases that lead to sessions

Page 5: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Categorization on “who’s to blame” basis?◦ Corpus

There is no composite document that fulfills the users need

◦ User Users cannot express their need by the appropriate

query Users learn about their need throughout the session

◦ System

Sessions

Page 6: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Accumulate information through the session

◦ Learn from the past Improve on the current

query

◦ Look into the future

IIR systems for sessions

Page 7: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Goals

(G1) Test the ability of retrieval systems to use the history of user interactions (past queries,…) to improve performance on the current query

(G2) Evaluate system performance over an entire session instead of a single query

Page 8: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Reality

Page 9: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

All participants should be provided with a fixed history of user interactions

Task 1: If history can teach us anything…

Q1

RL1

Qn

RLn

Qn+1

Page 10: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Participants Task

For each session participants submitted 4 ranked lists (RLs)

(RL1) current query

(RL2) current query & past queries in the session

(RL3) current query & past queries in the session and ranked URLs

(RL4) current query & past queries in the session, ranked URLs, clicks and dwell times

Page 11: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Come up with topics that can lead to sessions with more than one query.

◦ In 2011 we generated about 100 topics, collected about 1000 sessions: 90% were single session queries.

◦ In 2012 we got about the same percentage of single-query sessions

Challenges

Page 12: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Come up with topics that can lead to different search task types.

◦ In 2011 we generated factual tasks with specific goal(s)

◦ In 2012 we generated a variety of task types factual / intellectual tasks specific / amorphous goals

Challenges

Page 13: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Evaluating retrieval systems

◦ The notion of relevance may change throughout the session

◦ Novelty needs to be considered

◦ Different levels of diversification throughout the session should be applied

More challenges

Page 14: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Whole-session evaluation

Task 2 : Looking into the future …

Page 15: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Whole-session evaluation

Task 2 : Looking into the future …

Simulate user interactions◦ Given a topic, a ranked list, and relevance

information• Simulate browsing behavior• Simulate clicks, dwell times, etc.• Simulate query reformulation

Quantify utility

Page 16: Ben Carterette Paul Clough Evangelos Kanoulas Mark Sanderson

Built static test collection

Evaluate retrieval performance when session information is available

Still far from a whole-session evaluation◦ Is it possible to build reusable and portable but

dynamic test collections?◦ Is it worth it?

Conclusions